Sign in or Join the community to continue

A Survey of Production RAG Pain Points and Solutions

Posted Feb 28, 2024 | Views 2.6K

# LLMs

# RAG

# LlamaIndex

Share

Speakers

Jerry Liu

CEO @ LlamaIndex

Jerry is the co-founder/CEO of LlamaIndex, the data framework for building LLM applications. Before this, he has spent his career at the intersection of ML, research, and startups. He led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG, and worked on recommendation systems at Quora.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Large Language Models (LLMs) are revolutionizing how users can search for, interact with, and generate new content. There's been an explosion of interest around Retrieval Augmented Generation (RAG), enabling users to build applications such as chatbots, document search, workflow agents, and conversational assistants using LLMs on their private data. While setting up naive RAG is straightforward, building production RAG is very challenging. There are parameters and failure points along every stage of the stack that an AI engineer must solve in order to bring their app to production. This talk will cover the overall landscape of pain points and solutions around building production RAG, and also paint a picture of how this architecture will evolve over time.

+ Read More

TRANSCRIPT

A Survey of Production RAG Pain Points and Solutions

AI in Production

Slides: https://docs.google.com/presentation/d/1IABRSeATtbNeH2n3FeD_29GunGnhi8F2NxGhAzNDpNU/edit?usp=drive_link

Demetrios [00:00:00]: We've got one more person giving us a closing keynote. Indeed. It's my man Jerry. Where you at, dude? How you doing?

Jerry Liu [00:00:10]: Hey, Demetrius. It's great to be here. Thanks for having me.

Demetrios [00:00:13]: So when I saw you, like eight months ago, you asked me a question that I'm going to throw back at you because I feel like I didn't have much of a good answer to this question. You kind of caught me off guard with it, and I was like, damn.

Jerry Liu [00:00:27]: That'S a great question.

Demetrios [00:00:29]: I'm the one that's usually asking the questions, too, but you hit me with it, and I was like, I'm not sure what are you most excited about right now in this space.

Jerry Liu [00:00:42]: Yeah, I think part of that is the topic of the talk, which is really making a lot of these applications that people are building with llms production ready. And so that's actually going to be a big chunk of this talk. And then the other part is really just seeing how model advancements impact a lot of the abstractions that we have today. So Gemini pro just came out today, right? Well, not like API access, but has like a 10 million context window. Of course, there's all this multimodal models coming out as well. How do you think about rag agents as these llms get better, faster, cheaper, and also expand more modalities? I think that stuff really excites me. It's more experimental. It's kind of a function of the underlying models getting better.

Jerry Liu [00:01:23]: But that stuff is super exciting. But in the meantime, I think even just like, that's like top of funnel, right? And in the middle of funnel, everyone already building apps, just figuring out the best practices to stabilize it. That stuff is obviously going to be incredibly useful for everybody in LM application development right now.

Demetrios [00:01:39]: Excellent. Well, Jerry, I'm going to let you get cracking with this, and when you finished, I'm going to sing you lullabyte. Well, I know where you're at. It's not really lullaby time, but for me it is. So go ahead and give your talk. I'll be back in 2020, 5 minutes, and anyone that has questions, feel free to throw them in the chat, and I will ask at the relevant moment. So it's all you, Jerry.

Jerry Liu [00:02:08]: Sounds good. Thanks, Beetros. Hope you got some rest after this. The title of this talk is a survey of production rag pain points and solutions. And so the main goal of this really is to actually just go over how people are building rag pipelines right now. What are some of the main issues people are facing as well as cover a general survey of solutions to these issues. Obviously, in the past year, Gen AI has exploded specifically around LM application development. In a lot of enterprises that we've talked to in terms of just applications, they're building, they're building stuff like knowledge search and question answering, conversational agents, document processing and extraction, and also kind of more agentic workflows where you can not only do reads like search over information, but writes like actually take actions within the digital world where you can send an email, schedule, calendar, invite, and more.

Jerry Liu [00:03:09]: So a brief primer on LlamIndex for those of you who might not be familiar, Llama Index is a very popular open source framework that provides context augmentation for your LLM application. We provide all the tools to help you connect your llms and multi bottle models with data. And so, you know, take your unstructured semi structured image data, et cetera. Combine it with all the other components of the stack, vector stores, llms, embeddings, and you can build all these advanced application or these use cases over the data. This includes like question answering, structured extraction, chat. All this kind of falls roughly within the rag category as well as agents. So here's just a very simple quick start. To get started, you don't really need to do much or even really understand too much of the framework, but all you have to do is load in some data, index it in some sort of in memory storage, and then ask questions over it.

Jerry Liu [00:04:06]: And the idea is you can get some basic Chat GPT like functionality over your own data in very few lines of code. But of course, if you listened to some of our talks before or read through the documentation, you know that rag goes way deeper than just five lines of code. And so let's dive into it and really talk about some of the challenges and complexities in making rag production ready. So let's talk about the rag stack rag is retrieval augmented generation. It's a popular way to try to augment your language model with data, and it consists of two main components, right? There is data parsing and ingestion, and then there's also data querying. Data parsing and ingestion is processing unstructured documents, parsing and chunking and embedding it, and putting it to a storage system, popularly like a vector database. I'm sure you might have heard that term like there's pine cone, VVa chroma. We have like 40 plus vector database integrations.

Jerry Liu [00:05:03]: And then once that data is in a vector database, then you can do querying on top of it. For instance, you can do retrieval from the existing database. Take that context, put it within some prompts, and synthesize a response, and all of a sudden you have a basic version of search or Chat GPT like experience over your data. Naive Rag looks something like the following, like the five lines of code example, look something like this. You take in your data, let's say it's a PDF document. Load it in with PipeDf, a very popular open source PDF reader. Do some sort of sentence splitting. So you set your chunk size to 256, and then split by every 256 tokens.

Jerry Liu [00:05:45]: Fetch the top five chunks. When you actually want to do retrieval, put it into the question answering prompts and synthesize an answer. If you follow most introductory YouTube tutorials on any framework, this is roughly what you'll see. So naive rag is easy to, I mean, this stuff is easy to prototype, but building these rag applications, it's hard to productionize. So naive rag tends to work well for simple questions over a simple small set of documents. Like what are the main risk factors for Tesla? If you have like a ten K document and annual report, if you have our quick start Paul Graham essay example, you feed in one of Paul Graham's many essays and you ask, what did he do during his time at Y combinator? It's really looking for a specific fact within this knowledge base, and that's where a naive rag tends to work. Okay, the issue is that productionizing rag over more questions and a larger set of data tends to be pretty hard. Once you actually try to scale this up to not just simple questions, but more complex questions, like maybe you have multipart questions, maybe you have questions that require drawing apart from disparate pieces of context from different document sources.

Jerry Liu [00:07:03]: And as you add more data sources as well. Once you move beyond a single document, a single book, to multiple documents, multiple books, multiple PDF files, the performance really starts to degrade. Your response quality goes down. There's going to be a set of questions where you're not able to get back the right answers. You might face some symptoms, like bad retrieval, hallucinations. Some of the presenters might have talked about this, and it's a hard problem. One, your performance isn't very good, and two, you don't actually know how to improve this system. There's a lot of parameters throughout this entire process, and we'll take a bit of a look into why this is the case.

Jerry Liu [00:07:41]: But there's just a lot of parameters to tune, and it's actually pretty hard to try to optimize all the parameters in a way that really maximizes the accuracy of your overall system. To understand this a little bit deeper, let's look at the difference with writing LLM based applications versus traditional software. Traditional software is defined by you write code. So you define a set of programmatic rules. So you'd write a function, for instance, and given an input, you write the programmatic rules to define what the output should be. And it's fairly easy to reason, assuming you write your code in a somewhat modular way, about the expected output, right. Given this input, the set of things you're going to do on it is relatively easy to reason about, and so you can reason about what the expected output should be. Of course, there's always edge cases.

Jerry Liu [00:08:35]: If you ever written code, there's always going to be edge cases where this thing is going to fail. But roughly speaking, it's still a simpler problem than AI, because AI powered software, if you, for instance, use a traditional machine learning model or an LLM, it's an AI model is defined by a black box set of parameters. This takes in some high dimensional input, like unstructured text images, and produces some sort of output. And it's really hard to reason about what this function space really looks like. You don't know beforehand if you feed in a given input what the output will look like through this model, because you'll see a set of numbers, but you don't really understand what that function space looks like. It's hard to reason about how to visualize this high dimensional space. When you optimize a machine learning model, the model parameters themselves, you do gradient descent on the model parameters, attune them. So that's fine.

Jerry Liu [00:09:28]: The parameters are fit to the data set that you define, but the surrounding parameters, the moment you try to use this model in an inference setting, the surrounding parameters are not tuned. Like, let's say you use an LLM. The LLM, of course, is free, train on a ton of data. But then once you actually try to use the LLM with a prompt, let's say a prompt template, that prompt template is a hyperparameter, because that has not been tuned. That's just something that you define. So one of the complexities of AI powered software is that if one component of the system is a black box, like specifically your AI model, then all the components of the system form this overall black box across your entire software stack that you define. So what this really means is, let's say your LLM is a black box because it contains a set of parameters that you can't easily observe. But the moment you try to say construct a rag pipeline with data parsing, ingestion, embedding, model retrieval.

Jerry Liu [00:10:28]: Then this entire thing just introduces more parameters, and this overall thing just becomes like a black box function. That's hard to reason about, because every single one of the parameters that you introduce here affects the performance of the end system. Even if that original capability was part of written programmatically, because you're composing it as part of this overall ML system, all this stuff is basically, you basically constructed a bigger ML black box. A classic example is chunk size, right? Obviously, on its own, chunk size is easy to test for and reason about. If you want to split a sentence or a paragraph into 256 tokens each, you can do that and reason about, write unit tests for what that looks like. But how this 256 tokens or 512 affects the overall performance of your LM system, you don't really know until you actually write this entire software and actually try to evaluate the entire thing end to end on a data set. So the high level takeaway here is there's just too many parameters for developers to figure out. Like there's a combinatorial explosion of parameters for developers to tune to try to optimize the performance of the entire rag pipeline.

Jerry Liu [00:11:42]: At the end of the day, because you're basically trying to optimize this black box system, every single one of these parameter decisions affects the accuracy of your overall rag pipeline. What are some of the parameters that you should tune? Just as an example, let's say we're trying to build rag over a set of pdfs. Here is a set of options that you need to think about. One is which pdf parser should I use? Second is how do I chunk my documents? What strategy should I use? What's like the chunk size? How do I process embedded tables and charts? Which embedding model should I use during retrieval? If I'm doing dense retrieval, should I do top k like equal to five, equal to two? Should I do sparse search, hybrid search, and of course, what's the LMM prompt template? The solution that we're posing here is let's categorize as people are building these rag pipelines. Of course there's going to be a lot of parameters to figure out, but let's try to organize the pain points by categories. So we pick the most common pain points, and for each one of these pain points, locate where they are in this overall stack and try to figure out what's the best practices for resolving these pain points. A paper that came out pretty recently called seven failure points. When engineering retrieval augmented generation system takes like an initial stab at it, and they have a cool diagram here which basically outlines right, like here's the indexing process where you go from documents to chunker to database, and then here's the query process.

Jerry Liu [00:13:17]: As you go from query to rewriting, retrieval, re ranking and synthesis. And throughout the stages of this rag pipeline, you can identify pain points at these different stages and try to propose different solutions for each of these. An article from one of our community contributors, Wenchi, actually builds upon this even more. So it expands from seven pain points to twelve pain points, and there's twelve rag pain points and proposed solutions. So it takes the original seven and adds on some additional aspects, like being able to query tabular data, being able to deal with complex documents, and handling security issues. So maybe for the rest of this talk we'll go over some of these pain points solutions and also briefly talk about what's next after this. So a lot of the pain points that users face in building rag can be seen as response quality of related. Basically, a lot of the issues boil down to you ask a question, you're not able to get back the result.

Jerry Liu [00:14:24]: So more specifically, there's seven pain points that reflect this overall issue that pops up. One is the context itself might be missing in the knowledge base, you just might not have this data available. The context might be missing in the retrieval pass, the context might be missing after re ranking, the context isn't extracted, the output is in the wrong format, the output has incorrect level of specificity and the output is incomplete. We'll go over some of these components, but at a very high level. This is just like seven pain points for response quality. Some other pain points are non response quality related, relate to, for instance, scalability, security, and also more use case specific pain points. Like sometimes you have certain types of data like tabular data or complex documents where the naive rag approach doesn't really work and you want to be able to propose a specific flow or architecture for being able to solve it. If we have time, we have a specific section on being able to parse complicated pdfs because that's our favorite section and we'll actually offer a sneak preview into what we're building.

Jerry Liu [00:15:45]: Great. So let's figure out solutions. The first is this is kind of dumb, but sometimes the context might be missing in the knowledge base. You ask a question, you're not able to get back a good result, and it's because that data might not actually exist in your vector database, or it exists, but it's in a garbage format. There are some solutions to this. Picking a good document parsing solution is quite important if you evaluate we have ten plus different PDF parsers within llama hub, our community hub, for different types of integrations, and you'll find that the results quality varies widely, wildly depending on which one you use. If you're parsing HTML, for instance, leaving the HTML tags in versus not leads to a dramatic difference in performance. Another is just sometimes the context exists, but it doesn't contain relevant global context.

Jerry Liu [00:16:46]: For the embedding model or the LLM to really figure out how to really contextualize this piece of text, a popular approach to really make sure your data is high quality is to add in the right metadata for each chunk of text. So let's say you have an overall document. Once you split it up into a bunch of chunks, you want to make sure you actually add proper annotations to each chunk so that this tells both the embeding model and the vector database, metadata filtering and the LLM and more information beyond just what's in the raw text. So it can actually figure out what article it's from, why it's relevant, that type of thing. Another common issue is that when you build a prototype rag, you typically build it over a static data source. But in production, typically data sources update and sometimes they update pretty frequently. And so sometimes you want to set up like a recurring data ingestion pipeline so that you can actually properly process new updates over time. The next step here is that maybe the data exists but maybe it's not being retrieved.

Jerry Liu [00:17:55]: Let's say you do top K retrieval and you're not actually able to fetch the relevant context based on the user query. Some basic approaches here in terms of just like how do you think about this and how do you solve this? Some of it actually does relate to maybe you want to add in some metadata to contextualize the context a little bit more, but you really want to do hyperparameter tuning for your chunk size and your top k specifically. A quick way to debug this is if you set top k to like 100 or something, is the context actually showing up out of the 100? Like if you set it to something very big? If it's still missing, there's likely some issue with the data representation. But if it shows up, then that means that this probably just means you need to refine your retrieval algorithm as well as the embedding model that you pick. A follow up and very related step is oftentimes what people do in a retrieval system is you not only have an initial retrieval pass, but you set the top k purposely very big in the initial retrieval and then do some sort of re ranking or filtering down the road so that you're able to get back rerank set of results after. Of course, even that dense retrieval plus re ranking might still not give you back the relevant context. It turns out there's a whole grab bag of different tips and tricks for you to try out even fancier retrieval methods, some of which are actually unique to llms. There's this whole world of information retrieval literature that existed pre llms, but post llms you can actually deal with the fact that you, for instance, can decouple chunks used for embedding versus chunks used for the LLM, and also use the LLM to do some sort of query planning reasoning for more interesting and fancy retrieval techniques.

Jerry Liu [00:19:52]: So I listed a few buzwords here that are available in the docs, but roughly speaking, there is more interesting ways you can model this data beyond just like dense retrieval plus re ranking. Another example is you can try fine tuning your embedding models to task specific data. This is not typically something that people do initially. In fact, when you prototype something you probably don't want to fine tune. But if you're an enterprise company, you really have a lot of domain specific data, and sometimes you're just not getting back good retrieved results over this domain specific data. Maybe you should consider fine tuning a model if you have a sufficient training set. Let's say that context is retrieved by the retriever, but the issue is that the LLM isn't actually able to extract out the information. This is something that we'll kind of see how the llms evolve over time, but has been an issue for GPT four, cloud two, and maybe not an issue for Gemini Pro as of today.

Jerry Liu [00:20:59]: Jeff Dean just posted a tweet saying the needle haystack analysis works great for Gemini Pro, so we'll see how that works. But the context is there, but the LLM doesn't understand it. And for a lot of current llms, if that context is somewhere in the middle, in the middle of the context window of the prompt, the LLM just doesn't actually find it. So if you ask a question regarding that piece of context, the LLM can't answer the question. And this is a property of a lot of current models, and it's an issue because it means that you can't just arbitrarily stuff a ton of context into the prompt window of current models. I think Greg had a really nice analysis on GPT four, and also Claude, and this is called like the needle in the haystack experiments, where you inject some random facts like Jerry likes taco Bell or something in some arbitrary piece of text, and then ask the question like what does Jerry like? So some solutions to this include prompt compression. There's some benefits to this. Like you want to compress the context, you want to maybe reduce the token cost, reduce the latency, but also try to retrieve the best results.

Jerry Liu [00:22:09]: There's also some sort of like context reordering where there's kind of like a fun approach where let's say this is the retrieve set and these chunks are ranked by relevance. So node one, node two, node three, node four, you can re rank it so that the most relevant context is always at the ends. So you actually alternate between the front and the back. So node one is here, node two is here, node three is here, node four is here. So the least relevant context is somewhere in the middle. So if the LM forgets it, that's fine, it's a fun trick. Your mileage may vary, but feel free to try it out. Another common concern is that the output is in the wrong format.

Jerry Liu [00:22:55]: I think a lot of people building rag and just LLM applications in general expect structured outputs, structured JSON in particular. There's some general interesting practices for structured JSON, and roughly the three categories are how do you do text prompting to tell the LLM to output stuff in JSON? OpenAI specifically has a function calling mode where they fine tune a model to output JSON more effectively. And there's also just like token level prompting methods like guidance and LMQL, and some other ones where you can actually just enforce the fact that it's JSON by directly inserting the token during decoding time. This is typically not something you can do via the high level LLM APIs that exist today, but if you have direct access to the model, you can just directly insert the template that is supposed to happen and just have the LLM fill in the slots. So basically the LM just has to fill in some tokens in the middle. It's a cool project we integrate with guidance and definitely make sure to give that a. And the last bit before I maybe go to the concluding section is just the incomplete answer. So what if you have like a complex multipart question? Naive rag is primarily good for answering simple questions about specific facts, but sometimes you ask a question and it's a multipart thing, and top K Rag gives you back some context, but really not all of it.

Jerry Liu [00:24:22]: To answer the question, this is an entire talk on its own, but this really goes into what's beyond naive rag. Basic rag approaches towards adding more agentic reasoning to break down complex questions into smaller ones, solve longer running tasks, more vague research problems. And how do you compose that on top of a rag pipeline to give you back a response? So everything from a react loop like some sort of query planning and execution, to incorporating tool use with a rag pipeline and other sources, these are all components and being able to build some sort of agentic reasoning layer on top of a rag pipeline that can handle multi part questions. Great. I'm going to skip that last bit for now. And I think the very last bit I want to talk about is how do you do a rag over complex documents? I gave a little heart on that last section over there, and basically I want to specifically talk about complex pdfs with a lot of different sections and embedded tables. This sounds like a niche problem, but actually turns out to be a pretty popular issue that a lot of people face, is if you have documents with like for instance, embedded tables, if you do naive Rag, it fails terribly. If you do naive rag, what you're going to do is you're going to end up splitting tables in half, you're going to end up collapsing stuff, destroying spatial information, and this of course leads to hallucinations.

Jerry Liu [00:25:56]: We were one of the first to basically come up with advanced retrieval techniques sometime last year to basically say, to really deal with this problem, you should think about modeling a PDF as some sort of node hierarchy. So you have the text, you can split the text itself section by section and make sure you have some sort of proper parsing algorithm for that. But for tables specifically, if you have a good table extractor, then you can represent that table by its text summary. And so that you basically have like a little document graph, like a PDF represented by a set of nodes. Some nodes are linked to text and some nodes linked to tables. And if you vector index this entire layer of nodes, then this allows you to do better semantic retrieval over different elements in your document. And from there you can ask different types of questions over this data. And so we were one of the first to come up with this, and we showed that assuming you can actually have a good table extractor, you can actually ask pretty complex questions and get back good answers.

Jerry Liu [00:26:58]: So the only missing component is how do I parse out these tables from this data? If you look at PYPDF again, it works for prototypes, but for a lot of very complex documents. It tends to break apart the spatial information within these tables, and it's not really able to give you back the right results. So it extracts this type of information into a messy format that's impossible to pass down into more advanced ingestion retrieval algorithms. So this is something I wanted to offer to the audience today. Maybe don't share it publicly yet because we haven't actually announced it, but it's basically like a beta preview where we basically built a specialized parser designed to let you build rag over these complex pdfs. You can extract tables from pdfs, run recursive retrieval, and check out the link here. I can share these links too, but basically you can ask questions over semi structured data. So it's a favorite problem that we've been trying to solve over the last few months.

Jerry Liu [00:28:07]: Great. And so I have an entire set of slides after this, but I'm out of time. And basically the key question that we all want to focus on is one, how do you make rag production ready? And we covered a lot of the pain points and proposed solutions there. And the second is, what is next for rag? Right? And the theory for us is, I think as these models get better, it's really just going to become more agentic and these alums are going to be able to reason not just in a one shot manner, but in a repeated sequence and do search retrieval over complex problems and also perform actions for you. So with that said, thank you and I'll pass back to you. Demetrius.

Demetrios [00:28:50]: Oh dude, you ended quick there. I was still ready to keep going. I love that you gave us a little bit of a sneak peek at what you got coming. Man. That is so cool.

Jerry Liu [00:29:06]: Thanks. Yeah, we're excited to release it and just stay tuned. We'll have a public announcement soon.

Demetrios [00:29:12]: So awesome, man, you brought the heat. I knew you were going to make it amazing, but I was not expecting it to be that good. This literally was everything that I could have asked for and more. It was so real. You went through so many of these issues that we all are working on day in, day out. Anyone doing anything with rags can relate so much so, Jerry, Louis, everyone, thank you for coming. Fill out the good old feedback form and we're going to peace out.

+ Read More

Sign in or Join the community

Watch More

Scaling Data Reliably: A Journey in Growing Through Data Pain Points // Miriah Peterson // DE4AI

Posted Sep 18, 2024 | Views 902

Building RAG-based LLM Applications for Production

Posted Oct 26, 2023 | Views 2.2K

# LLM Applications

# RAG

# Anyscale

What is the Role of Small Models in the LLM Era: A Survey

Posted Nov 05, 2024 | Views 940

# LLMs

# Small Language Models

# Specialized Tasks