MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Building Advanced Question-Answering Agents Over Complex Data

Posted Aug 15, 2024 | Views 138
# LLMs
# RAG
# LlamaIndex
Share
speaker
avatar
Jerry Liu
CEO @ LlamaIndex

Jerry is the co-founder/CEO of LlamaIndex, the data framework for building LLM applications. Before this, he has spent his career at the intersection of ML, research, and startups. He led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG, and worked on recommendation systems at Quora.

+ Read More
SUMMARY

Large Language Models (LLMs) are revolutionizing how users can search for, interact with, and generate new content, leading to a huge wave of developer-led, context-augmented LLM applications. Some recent stacks and toolkits around Retrieval-Augmented Generation (RAG) have emerged, enabling developers to build applications such as chatbots using LLMs on their private data. However, while setting up basic RAG-powered QA is straightforward, solving complex question-answering over large quantities of complex data requires new data, retrieval, and LLM architectures. This talk provides an overview of these agentic systems, the opportunities they unlock, how to build them, as well as remaining challenges.

+ Read More
TRANSCRIPT

Jerry Liu [00:00:09]: Yeah, thanks for having me. Hey everyone, hope you're enjoying the AI quality conference so far. So for the purposes of this talk, my talk will be on the importance of data quality for advanced rag. I'm Jerry, co founder CEO of Lamindex. For those of you who don't know, Lamindux is a data framework and platform for helping you build LLM applications of your data, like rag agents and a lot more stuff. We have some exciting releases this week. Stay tuned. They're not in the slides, but yeah, keep your eyes out.

Jerry Liu [00:00:38]: Great. So most of you probably know what rag is. That's probably why you're here. There's two main components to rag. There's data parsing and injection, as well as data querying like the retrieval LLM prompting piece. And I think many of you, if you're just starting off building rag, probably have built something of the following form. You do some naive parsing using an open source parser, you do some naive splitting, you just split your pages down the middle, every few sentences or so, every paragraph. You do top k dense retrieval, and then you stuff it all into a prompt.

Jerry Liu [00:01:14]: So if you've built rag systems, this pattern should seem relatively familiar to you. And I think one of the main challenges with this overall approach, right, is that it doesn't really work all the time. If you've built this, it's very easy to prototype. You can build a prototype in about five minutes to like ten minutes of your experience with Python and follow one of our quick start tutorials. But we've talked a lot about this. There's a lot of steps you need to take to ensure that you're actually able to get to production and meet the quality bar necessary in order to build something that can actually respond well given any question that you want to ask. So for a lot of rag, naive rag approaches tend to work well for relatively simple questions over a simple small set of documents. If you have five PDF's and you want to ask a question about a specific fact in one of those PDF's, naive rag generally works pretty well.

Jerry Liu [00:02:10]: Embeddings are decent these days. They'll be able to surface the right chunk for you, and then LLMs can generally synthesize the right answer. Given the chunk in the context of however, productionizing rag over more questions and a larger set of data is a lot more challenging. Some of the failure modes that we see talking to a lot of developers within the enterprise include the following failure modes. This includes like simple questions over complex data. So even if your question is still simple. If the data itself is complicated, you might not be able to surface the right answer all the time. It might be hallucinating an answer for you.

Jerry Liu [00:02:48]: And we'll define what complex data means in just a bit. Another failure mode is simple questions over multiple documents. Let's say you're not just asking a question over one PDF, but you're asking a question over like ten or 100 or even a million. And the third one is just not being able to actually answer more vague, complex multi part questions. Rag systems tend to work well for relatively targeted questions, but the moment you try to ask something a little bit more complex, it tends to break a little bit. So the top priority goal should be trying to figure out how to get high response quality from the set of representative questions that you want to ask. And we've talked about this a little bit, but there's generally two main focus areas for improving your rag systems. One is improving data quality, which is going to be the focus of this talk.

Jerry Liu [00:03:38]: And then the other is this whole separate thread about agents. I'm sure many of you are probably hearing a lot about agents throughout the course of today, as well as generally on the Internet, and we have talked about this as well, but I only have like ten minutes left, so we're going to talk about data quality. Let's focus on improving data quality. And this part's actually pretty underrated, right? Because data quality is one of those things that if you're a machine learning engineer, everybody understands the importance of data pre processing, feature engineering, so on and so forth. But the thing about LLMs is that it's so easy to get started by just like stuffing in a bunch of text, that oftentimes you might let these considerations slip. And this can introduce a barrier to you actually getting a production with your rag system. One of the first principles that I think we believe in is basically this idea of garbage in, garbage out. It's a principle that's true in ML, and it's also a principle that's true in the case of LLM app development.

Jerry Liu [00:04:31]: Good data quality is a necessary component of any production LLM map. And the thing is, if you don't have the right data processing layer, just very generically, you're not going to get clean data. And then if you don't have clean data, LLMs are going to have a hard time figuring out how to actually give you back good response quality, even for very powerful LLMs today. Some of the main components of data processing include the following. There's parsing, there's chunking and there's indexing. We'll talk about each of these components in just a bit. But first, one common use case we see over and over again within the enterprise is this idea of a complex document. A lot of documents can be classified as complex.

Jerry Liu [00:05:16]: So basically, instead of just having a paragraph of text, it might have a lot of other elements in it. It might have embedded tables, it might be a PowerPoint presentation with different spatial layouts. It might have charts either in the form of actual SVG shapes or in the form of rasterized image. There might be headers and footers that you might want to interact as metadata. And so oftentimes we see that naive rag indexing pipelines fail over these documents. You're just not able to get back. When you just do the naive slicing and all this stuff, you get back a whole bunch of hallucinations when you ask a question over, for instance, a table or a chart or an image. One of the main reasons this is the case is just bad parsing.

Jerry Liu [00:06:01]: Um, if you have a bad PDF parser, you're going to have like a relatively underperforming, uh, rag pipeline. So if you use like PIPDF, for instance, to parse this like table image right here, which is a table extracted from a financial report of a large bank, you real, you notice that a lot of the text and numbers are extracted into this format. It tends to be, you know, relatively messy. Uh, all the numbers and text are blended together. And we find that when you ask questions over this text, that's badly parsed, even for like GPT 4.0, even for Opus, you're not able to oftentimes hallucinates the answer if you ask a question over a specific value in some of these tables. So one of the things that we built at Llamaindex is llamaparse, which is a special document parser designed to let you build rag over complex docs. Of course, there's other players working on this problem as well. And I think the overall motivation of a lot of these projects is to really take in unstructured data that can be very complex and somehow structure it in the right way so that LLMs can understand it.

Jerry Liu [00:07:08]: So some of the core capabilities of Llamaparse include being able to properly extract out and format tables and charts, input natural language parsing instructions, extract out images, so you can build multimodal rag, and also support a bunch of different document types that are common today. You know, PDF's, PowerPoints, Docx files, HTML and more. One of the like, taking a step back. One of the core ideas is actually pretty interesting. Parsing itself can improve performance by quite a bit. So even without advanced indexing and retrieval, if you're still just doing the dumb thing about like, you know, chunking like every number of sentences, and then doing like top k dense retrieval without any sort of hybrid search, or like BM 25, good parsing itself helps to reduce hallucinations by a lot. So what we did was we took the Caltrain schedule, like the weekend schedule, and we ran it through, for instance, for llama parse, and we get back a well spatially laid out kind of text representation of the schedule. It turns out models like OpenAI and anthropic, and some of the state of the art LLMs out there, they understand text formatting pretty well.

Jerry Liu [00:08:21]: So when things are well spaced and well aligned, they can actually answer questions over this text a lot better than if that text was not formatted well. So this is just an example with PYPDF, and then this is an example with llama parse. If you ask a question by feeding in this formatted text and over a specific train, you realize you get back like the right times for this specific train using lambda parse, this text representation right here, and you're not able to get back the right response if you use a naive parser that messes up the formatting of the table. What's interesting is that you can combine advanced parsing, but also with advanced indexing and retrieval. And so this is basically talking about steps towards a slightly more sophisticated pipeline than pure flat like chunking and indexing. So what we talk about here is this overall idea of hierarchical indexing and retrieval to model heterogeneous different types of data within a document, whether it's unstructured text, tabular or multimodal. So an example shown here is really this diagram. A PDF can be broken down into a bunch of text chunks, images, tables, and a generally good pipeline looks something like the following.

Jerry Liu [00:09:36]: You parse the documents into a set of multimodal elements. So text chunks, tables, images and more. And then for each one of these elements, you extract one or more text representations that can be indexed. And so for a table or an image, you can extract out like a summary or for instance, multiple summaries of this multiple descriptions of this element. And what you want to do is you embed and index the summaries, the text representations that link to the underlying object. So for tables, you might extract summaries or table cells as a thing that you actually feed to your embedding model, what you end up with, the thing that you store in a vector database is this stuff right here. But all this stuff links to the underlying entity. And during retrieval you want to do some sort of two step retrieval approach.

Jerry Liu [00:10:23]: We call this recursive retrieval because you basically just follow the links along the document graph and continue retrieving until you get all the context, but retrieve the indexed bits, like the summaries and the sentences, and then you fetch the source elements, whether it's the raw document text, or the table, or the image. And so this two step retrieval approach oftentimes works a lot better by helping you have different representations of the same element. So depending on the question that you want to ask, it'll be able to surface the relevant item compared to flat indexing and naive retrieval. An example shown here is just a pipeline that we built over an annual report, like a financial document. And you're able to get back answers without hallucinations when you ask about certain, um, information about cash flows for Netflix, for instance, within this table, uh, whereas a naive pipeline gives you back the wrong response. So, you know, that's one of the core ideas. Um, and then the second idea here is, besides parsing, of course, there's also just like general chunking and indexing tips that over the course of the past year, people have thought about, uh, come up with and basically kind of promoted as good, generally best practices for anyone building a rag pipeline. The first, and this is something that we found, is that page level chunking is oftentimes a strong baseline for your documents.

Jerry Liu [00:11:44]: So if you have a bunch of PDF's or PowerPoints, generally speaking, all the information that you need, a lot of information that you need, is going to be contained within that single page. Obviously there's exceptions. A section can span multiple pages. You might have to devise clever ways of injecting metadata filters or metadata so they can filter for it later. But generally speaking, instead of worrying about the specific chunk size, if you just chunk at the level of pages, you're going to be able to get decent results. An advanced approach, like once long context models take off and the cost and latency go down even further. One thing that we're very interested in is this idea of just document level chunking, which will further reduce the need to really worry about very fine grained chunking parameters like chunk size, when you can just stuff entire documents into the LLM prompt window. Some other tips here include trying to preserve semantically similar content for chunking.

Jerry Liu [00:12:42]: This is related to the overall, the way that llama parse, for instance, parses complex documents, but generally speaking, don't break tables, don't break text in the middle of a section, and try to keep things relatively semantically coherent. Of course, metadata extraction is one of those underrated things that directly, uh, ties into this overall approach right here. Extracting a healthy dose of metadata, or even different types of metadata, gives you back a semi structured way of representing your text that you can then query from across a variety of different dimensions, whether it's like vector search or even something like SQL. On the indexing side, all the stuff is related, which is, you know, not only do you get back like a parse like document graph, you also want to make sure that the way you, the thing that you indexed is the different representations of the same underlying object. And so oftentimes indexing a text chunk with a single vector is not enough. You might extract different representations, whether it's a summary, a sentence of the same underlying piece of text. And you want to be able to represent that with multiple different vectors so that during retrieval you can retrieve the underlying source object and de dupe as necessary to basically give you back the underlying data. Some other indexing tips here include actually having a docstore.

Jerry Liu [00:14:05]: Oftentimes people just think, oh, we only need like a vector database. When building rag, you really need like a key value store, so you can store different types of like hierarchical information and also your source documents. You also need a docstore to basically help you do caching or incremental syncing. If you know you have a lot of data and you only want to propagate changes when that data updates, you want to be able to make sure that you're able to just sync the stuff that has changed. And so a docstore can help you store the hashes of the documents to ensure, to basically double check which documents have changed and which haven't. Finally, you need basically a lot of people are building chatbots and agents these days and having some sort of storage system to store conversation history and more generically, longer term memory is a necessary component. And we're going to anticipate that this is going to become like an increasing thing. You're not only just going to want to retrieve flat document chunks, but also return the conversation history in some sort of well represented way, potentially with knowledge graphs in the last 30 seconds or so.

Jerry Liu [00:15:09]: I think there's still some very interesting challenges of data processing for LLMs, and I encourage all of us to basically think about this even as we build more advanced stuff on the query orchestration side of data quality does matter quite a bit. Parsing, chunking, indexing all impacts the capabilities of your end to end QA interface knowledge assistant. But some of the main things that we're still thinking about include the impact of multimodality. As multimodal models get better, faster cheaper, it'd be interesting to think about the native chunk representation of a document being an image of that page, as opposed to a text representation, which will inherently allow you to capture stuff like diagrams and images. The impact of long context windows many people have wondered whether rag is dead with long context LLMs, and generally speaking, we think that retrieval rag will stay, especially over larger document corpuses. But we think that the minute chunking decisions of chunk size within a page will probably go away. The third interesting thing to think about is a lot of people are talking about vector databases, but what we generally need is probably like a more unified storage system that can not only search by vectors, but a lot of different types of query interfaces, whether it's SQL, whether it's knowledge graphs and unify unstructured structured multimodal data, especially as models become more multimodal themselves. So interesting things to think about.

Jerry Liu [00:16:37]: And again, we have some cool releases coming up this week, so stay tuned. Thank you. Yeah.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

1:11:01
Data Selection for Data-Centric AI: Data Quality Over Quantity
Posted Oct 10, 2021 | Views 507
Building Better Data Teams
Posted Aug 04, 2022 | Views 1.5K
# Data Teams
# Data Tooling
# RN Production
# Financial Times
# Ft.com