How do we use LLMs for Lab Operations in Pharma and Biotechs?

ChatGPT, AI and Large Language Models (LLMs) are all over the news recently. But what is an LLM? Simply put these ‘AIs’ are trained using large language datasets which are then used to predict the text output for a given text input. Most users will interact with it as a service their employer is deploying or a generic provider (such as the ubiquitous ChatGPT).

LLMs got their powers since the introduction of Transformer architecture in 2017 by Google to GPT-2 in 2019 and GPT-3 in 2020 released by OpenAI. The GPT-3 model was already extremely capable, but it took another two years when OpenAI released ChatGPT in Nov 2022. The model component of ChatGPT was not that different from the GPT-3 model and the real innovation and growth driver was that OpenAI team nailed down the chat user experience, provided clear examples of potential use cases and vision of how to make LLMs to behave like a “product” rather than just a tool for machine learning engineers

As a result, ChatGPT enjoyed a dramatic rise in popularity unmatched by any other internet service before. It’s likely that the context matters a lot for the successful application of LLM technology. 

LLMs are a game changer for the life science space. They open up new avenues to utilise the vast storages of existing data that has been aggregated digitally over the last 20 years. While data lakes, terabyte storages and interconnected IT systems are an enabler to access and gather data – LLMs are the key to unlocking the knowledge for more users and applications than ever before. And in ways that would have seemed futuristic and impossible just a year ago.

Jonas Kulessa , Chief Technology Officer – LabTwin GmbH

So, how can we use LLMs? 

LLMs are shown to be good text editors – they can create decent summaries, rephrase text, or make it more digestible. LLMs can find entities in text and pass them over to other systems giving ability for computer programs to use natural language as a user input and map it into the algorithmic logic. 

One example is natural language queries for databases or analytics systems such as DataBricks or Excel feature “Analyse data” which allows users to perform actions by asking questions in English. 

Voice assistants like LabTwin are uniquely positioned to be a good entry point for LLMs. We use LLMs a lot to work with the natural language as an input to assist our users with information and guide them through the workflows within the app. With LabTwin, scientists are able to obtain reliable answers to scientific questions, retrieve operational info, perform accurate automatic calculations and even receive troubleshooting tips to resolve errors on-the go.

Limitations of LLM application stems from the way of how they are trained and what objective is being used. LLMs ultimately predict text and are prone to “hallucinations” – they output the most “likely” text which might not be factually correct. Ability to ingest arbitrary information is another strong limitation of LLM technology. LLMs are trained on the corpus of text spanning a lot of material found the open internet but first, not all material is in the publicly available and second, there might be mistakes or genuine disagreements about the factual knowledge. The introduction of guardrails or checks can partially mitigate it, and hopefully, we will eventually see fundamental improvements from LLM providers. 

In the context of pharma or biotechs LLMs adoption will be driven first by the routine scenarios of office work – data conversion or generic R&D questions to make a rough concept map of an area. The second adoption peak will probably happen with advent of corporate-tailored LLMs that will be able to make use of proprietary data and have a set of domain-specific checks. We already have examples of companies like Bloomberg that trained their LLMs like BloombergGPT5 specifically for financial texts. Lab suitable LLMs can’t be far behind and they could cause one of the biggest changes to research output we’ve ever seen.

Dennis Shepelin

Dennis is a Data Scientist and Machine Learning Engineer at LabTwin GmbH, Berlin with a background in Computational Biotechnology.He has done his Biology Bachelor and master’s from Lomonosov University and his PhD from the Danish Technical University and has worked with fundamental and applied challenges in biology. Dennis is Interested in application of Natural Language Processing techniques for Life Science

Leave a Reply

Your email address will not be published. Required fields are marked *