niket.b
- Aug 25, 2023

Build and Deploy LLM Apps

Since the advent of ChatGPT, Large language models (LLMs) have undoubtedly changed the way we think about how we interact with datasets and use them to power key features in our applications. However, it isn’t always very easy or intuitive to adapt our standard app dev processes to support LLMs. As we all know, LLMs come with some limitations on what is possible as well as some caveats such as hallucinations. In this article, we share the best practices we learned from our journey to build and deploy LLM Apps.

Build your first LLM App

If you are just starting with LLMs, you probably read an article talking about how easy it is to build an app such as a chatbot or assistant. You use a platform such as OpenAI or Anthropic and in a few clicks, you have your first app!

From the heady success of the rapidly built prototype, a lot of grunt work then goes in to build and deploy LLM app into Production.

But it doesn’t stop there! The app works well in production for a while until it runs into a series of corner use cases. These cases were missing from your training / test datasets and you quickly fall into the well of disappointment---this corresponds directly to the Gartner Hype cycle!

How do you avoid falling into this trap? Some amount of pre-planning and knowledge of the common problems can help you build and deploy LLM Apps.

Low Accuracy Answers

In addition to cloud services like OpenAI or Claude, we often use Base LLMs (ex. Mistral, Falcon, Llama-2 etc.) and deploy them in our private networks (PrivateLLMs). This is because many enterprises are concerned about sending the data to a cloud service (for eg., there is PII information or other confidential information like contract terms or legal arrangements).

One of the challenges we typically run into is that these base models are trained on some general purpose public datasets. So, when we ask them questions based on our data they are likely to give incorrect answers biased by the billions of training points they have seen in the past.

To get past this, we have to help the LLM out by giving it specific data from a datastore. It would make the LLMs job a lot easier if we can give it targeted data points relevant to the keywords in the user’s question.

Vector Databases and RAG

Vector databases are designed to do exactly this. They have their origins in search tools – Solr and ElasticSearch are both based on Apache Lucene and have been powering search functions across millions of websites and Enterprise apps for over a decade. These search databases typically store JSON documents and focus on full text search; vector databases store vector embeddings and are really good at ultra fast similarity search.

Retrieval augmented generation (RAG) is the process of supporting LLMs with a vector store that has indexed our specific data sets, allowing it to base its answers on the most relevant data returned by the vector search.

Explanation of RAG process — RAG to Build and Deploy LLM Apps

Step1: Take your enterprise data and feed it to an embedding model for conversion to a vector representation. Enterprise data can include data in sql tables, pdf documents, word docs, excel files and other document stores.

Step 2: Store converted vectors in your vector database. The vectors will be indexed into a data structure such as HSNW, LSH, or IVF, which specialize in enabling ultra-fast vector similarity searches.

Step 3: A user using a chatbot / assistant asks it a question which comes in as a ‘prompt’. This can be a combination of a ‘system’ prompt created by the logic in the assistant and the user’s input

Step 4: The prompt is processed by an embedding model, creating an embedding query vector

Step 5: This embedded query vector is used to search our vector DB.

Step 6: Based on results of the vector similarity search, the vector db passes back the top-n relevant contexts to the LLM

Step 7: Using the original prompt from Step 2 and the retrieved, ranked context data is passed to it, the LLM generates an answer for the user

This pattern of using a vector store to index our data and give the LLM a targeted context to work with is highly effective in producing higher quality answers in our assistants. Other approaches like fine tuning can further adapt the Base LLMs parameters to better ‘understand’ our datasets and expected answers.

There are several available vector embedding methods and tools to help streamline your code for this. In ExpertSense RepoAI, we use langchain for embeddings.

Scaling Considerations

Indexing process

Once we build our LLM app, and go past the testing / training phase and achieve a well conformed fit for our datasets and are happy with the results, we start looking at our production datasets and deployment/scaling considerations.

Here, we first look at the volume of data we need to index. For small and mid-sized data sets, it might be easier to simply brute-force it and let the index job run for many hours overnight. For larger datasets spanning tens of thousands of documents, this is simply not feasible. We need to parallelize the indexing process to have it complete in a reasonable time. One option is to combine Ray and Langchain

Vector Store

Similar to the indexing process, smaller datasets will work well with a single-node vector store such as ChromaDB or PGVector ( an extension that can be installed on PostgreSQL to support vector embeddings and similarity search).

For larger datasets, we recommend a cloud service, such as Pinecone, Astra DB, Amazon OpenSearch etc, which puts the scaling headaches on to the service provider. In addition, you can also consider a distributed scalable database such as Cassandra or DataStax Enterprise,

Insurance

Enterprise Data Hub

Pharma

Data Observability

Commercial Data Warehouse (CDW)