How do we serve a response with confidence, if we don’t know how confident we should be about the response?
Author: Soham Chatterjee
Over the last few weeks, we have been trying to build an LLM product ourselves, a Chrome extension aimed at improving English writing skills for non-native speakers, to see the challenges of taking an LLM to production. Our extension is free to use and open-source. All you need is an OpenAI API key.
Here are the main takeaways:
In no particular order, here are the major challenges we have faced when building this product.
One of the significant challenges with using LLM APIs is the lack of SLAs or commitments on endpoint uptime and latency from the API provider. While building our application, there were inconsistencies with when we might get a result back from the API. Creative workflows need you to give responses quickly so you can capitalise on the flow state. Having to wait even a few extra seconds for the result from a model can break that flow.
Prompt engineering, which involves crafting prompts for the model, is another challenge, as results using the same prompt can be unpredictable. Best practices for prompts may not work with future models, making it hard to predict their effectiveness. Additionally, the model’s output, which is in natural language, can be ambiguous and inconsistent, especially when parsing specific information from it. This will make your product unreliable. You can make the output more predictable by providing examples of expected outcomes in the prompt, but these also tend to fail, especially for complex problems.
Complex products with chains of prompts can further increase inconsistencies, leading to incorrect and irrelevant outputs, often called hallucinations. This also causes a lack of reproducibility, where the same prompt with the same settings may produce different results, making it difficult to ensure consistency in the product. LLMs will also hallucinate with extreme confidence, making it difficult to spot them.
Another significant challenge is the lack of adequate evaluation metrics for the output of the Language Model. It is challenging to serve results with confidence without knowing how confident one should be about the result.
An incorrect result in the middle of the chain can cause the remaining chain to go wildly off track. In many cases, getting the chain back on track with prompting is very difficult. But how can you even tell if your chain is off track? And how can you check if the correcting prompt has successfully brought the chain back on track?
Our biggest problem that led to the most delays? API endpoint deprecation. When we started building our demo, we used OpenAI’s DaVinci-002 model. We created a whole set of crafted, finetuned, few-shot prompts for that API. It started working really well. However, a few weeks into the project, OpenAI deprecated that API and suggested devs move to the DaVinci-003 endpoint. Unfortunately, this wasn’t an easy transition.
Trust and security issues also pose a challenge for deploying Language Models. There are concerns about how data is being used by API providers, especially in light of recent news about proprietary code leaks. There were also concerns that OpenAI would use the data for training their next generation of models. Recently though, OpenAI announced that you could opt out of that.
The next trust issue is knowing what data was used to train these models. Since the data can adversely affect the output the model produces. One of the speakers in yesterday’s conference (I believe it was Hanlin) mentioned that if you are building a financial model, you do not want your model to be trained on data from r/wallstreetbets. Or more importantly, if you are building a model that will make medical diagnoses and suggest treatments, you definitely want to avoid your model being trained on data with misinformation.
Finally, attacks on Language Models pose another challenge, as malicious actors can trick them into outputting harmful or inaccurate results. Tools like guardrails (built by Shreya Rajpal, another speaker at yesterday’s conference) are being developed to avert such attacks, but better and more reliable solutions are needed.
With these problems in mind, what are the solutions, and what are the best practices for deploying LLM models?
Many problems can be fixed by finetuning an LLM or training your own language model from scratch instead of using an API. In our previous newsletter, we compared the two and talked about the economics of choosing one over the other. Using an API has a low barrier to entry and is a good way to build an MVP without investing in a team of engineers and data scientists. However, as your product attracts more users and grows in complexity, it is better to finetune and train your own model.
Another challenge of deploying language models in production is developing effective prompts. Here are some guidelines you should follow:
If you need to process a large amount of data, and your users might ask multiple questions about that data, you need to use a vector database. The deployment pattern for this is that you create an embedding over the data, and then instead of querying the API, you can query the embedding to get good answers. Vector databases can be used to store and query embeddings quickly. There are many vector databases. Chroma is one that integrates well with LangChain. By not querying the LLM API, you reduce your costs and decrease latency.
Using really long chains and complex agents is not something I would advise, as they don’t work reliably enough at this point to deploy to production. One way to address errors and malicious outputs is to use a watcher language model to watch the output of another language model. Your watcher language model can also fix and parse responses from the first language model. However, be aware of the increasing costs of this setup. Also, who watches the watcher?
While searching for best practices, I didn’t find much out there. Many of the current best practices released by OpenAI, AI21Labs and others are from last year (they may as well be from the previous decade at the pace this field is moving in, lol) and don’t talk much about production architectures or design patterns for deployment. I hope the community comes up with more of these soon.