Generative AI for Drug Information

kailasamsriram
Dec 8, 2023
4 min read

The weekend of October 20th marked the kickoff of Data Aces’ inaugural hackathon. The company’s attitude of adapting and innovating at every turn drove the theme of the hackathon to be around Generative AI, the hottest emerging trend in the industry. I was incredibly excited to take a dive into the depths of the growing world of Gen AI and everything it has to offer. After an hour of socialization and some words of encouragement from the CEO, we were off to the races.

Ideation

The first stage was ideation, and like any good developer we tried to answer the age-old question of “what can we build that will add value to our clients?”. Since we work with pharmaceutical companies and are familiar with the data sets and pain points in the industry, we decided to narrow our focus to that industry. We came up with the idea for a simple chatbot, which would allow doctors, reps or just about anyone interested to query information on an array of drug related data - Generative AI for Drug Information.. The vision for what we wanted to achieve was clear but there was one small issue, none of us had worked with Gen AI before. So now, with a well-defined goal in mind, we began the exploration phase.

Exploration

I started off by exploring the fundamental ideas of Gen AI, LLMs, transformers and its underlying workings. I came across incredibly fascinating concepts such as positional encoding, embeddings, attention and self-attention. I scoured through hundreds of LLM models, not

knowing which one to choose and after some research into the relative pros and cons of the different models, I settled on using the roBERTa model. The roBERTa model is a transformer that can be used for applications such as textual processing, contextual question answering, textual summarizations and so on. There was a LLM model built on roBERTa called roBERTaForQuestionAndAnswering that fit our particular use case perfectly. For the coding environment, I initially chose to use google collab due to its capability to scale and get the full potential out of the LLM model without the restrictions of the local system. Now, we needed the data that we were going to use for the model. To quickly do a proof of concept, we decided to focus only on the overactive bladder condition and the two most popular medications prescribed to treat it, GAMETESA

and MYRBETRIQ. We accessed multiple online drug archives and retrieved the information relating to the above-mentioned drugs as well as the disease itself.

Process

First we configured our python environment and install all the libraries that we require for our project. We installed the Streamlit, PyPDF2, Transformers, RobertaForQuestionAnswering, AutoTokenizer, and pipeline libraries. The frontend of the app is built using Streamlit and is comprised of a simple UI comprised of a text prompt that allows the users to enter a question and then an answer board where all the question answer pairs over a given session are stored. We also create a folder containing all the drug information that we want our model to reference. The next step is to extract all the textual content from the PDF files and store that textual data to be used as the context for the model. The question entered by the user is converted to a vector and appended to the vector store being used by the LLM model. The following is a step by step walkthrough of the process which depicts the internal working of the model:

1. Context input – The user uploads the documents that they want the bot to reference

2. User input – The conversation begins with the user entering a question into the chatbot

3. Input processing – The input question as well as the contents of the drug files are processed and vectorized

4. Model processing – The processed input is passed to the roBERTa model where it analyzes the context and semantics of the user’s question to generate a meaningful representation

5. Response generation – Using the recognized intent and context, the chatbot generates a relevant and informative response to answer the user’s question.

Streamlit interface – The response is presented to the user through a Streamlit interface, providing a user-friendly and visually appealing display.

Performance

The performance of the model was subpar when compared to models such as llama and Mistral-7B. The answers were usually presented in a single word format which could not comprehensively cover all aspects of the question that need answering. The model also seemed unable to grasp the semantic meaning behind the question and usually referred to the wrong portions of the context for its answers. Here are some of the interactions with the model.

Regardless of this model’ performance, we are confident that the use case we have stumbled upon is an amazing one. We are continuing to experiment with this idea and testing the performance of different models and methods of data delivery to enhance the performance.

Conclusion

In conclusion, embarking on the journey of building my first Generative AI chatbot with the powerful combination of roBERTa and Streamlit has been a rewarding and enlightening experience. Navigating the intricacies of natural language processing and understanding the nuances of transformer models has not only expanded my technical skills but also deepened my appreciation for the boundless possibilities within the realm of artificial intelligence. Through the seamless integration of roBERTa’s language and understanding capabilities and Streamlit’s user-friendly interface, I have witnessed the birth of a conversational agent that transcends mere functionality. As I reflect on this transformative endeavor, I am not only proud of the tangible outcome but also inspired to continue exploring the frontiers of AI, eager to contribute to the ever-evolving landscape of innovative and intelligent applications.

Insurance

Enterprise Data Hub

Pharma

Data Observability

Commercial Data Warehouse (CDW)

Generative AI for Drug Information

Ideation

Exploration

Process

Performance

Recent Posts

Comments