MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Self-Improving RAG

Posted Aug 15, 2024 | Views 62
# RAG
# AI Quality
# LanceDB
Share
speaker
avatar
Chang She
CEO / Co-founder @ LanceDB

Chang is the CEO and Co-founder of LanceDB, the Database for AI. Chang was one of the original coauthors of Pandas and has been building tools for data science and machine learning for almost two decades. Most recently he was VP of Eng at TubiTV where he focused on personalized recommendations and ML experimentation.

+ Read More
SUMMARY

Higher-quality retrieval isn't just about more complex retrieval techniques. Using user feedback to improve model results is a tried and true technique from the ancient days of checks notes recommender systems. And if you know something about the pattern about your data and user queries, even synthetic data can produce fine-tuned models that significantly improve retrieval quality.

+ Read More
TRANSCRIPT

Chang She [00:00:10]: Excited to be here. Thanks for coming to my talk. Just curious, how many of you went to were at Jerry's talk, like two talks ago? Okay, cool. And how many of you have put a rag pipeline into production? Oh, nice. Way more than I thought. This is awesome. And how many of you have worked on recommender systems or like, personalization systems in the past? Nice. Okay, cool.

Chang She [00:00:39]: So this is great. So we're going to be talking about components of self-optimizing rag. It's basically sort of a fancy term for having a feedback loop, right? A little bit more than that. But at the core, that's what it is. What are the knobs you can turn, how to set up that loop? And what are the things you need to assemble and watch out for? So, my name is Chung. I'm the CEO and co-founder of Land CB. I've been working on data tools for data science and machine learning for about two decades at this point. I was one of the original co-authors of the Pandas library.

Chang She [00:01:16]: It was a long time ago, and I spent a bunch of time in big data systems and recommender systems in the past. And I'm now building land CB, the database for multimodal AI. So I'll tell you a little bit about LANCB later on and how it can be part of this RAC pipeline and be the storage and querying layer for these things. And I'm an engineer by training, but these days I spend about equal time tweeting and coding. And you can go to Shiptalkers to find out your own ratio, if you're curious. So I think the reason why I'm really interested in this topic is I think right now there's sort of a gap in rag taking rag from demos into production. And so probably lots of you have put rag into production. I think a lot of you have seen that pain point of getting to a compelling demo can be pretty quick and easy, but getting that reliably into production is much harder, way harder than you would extrapolate from the effort of that demo.

Chang She [00:02:30]: So part of that is retrieval quality. It's that 28 80 20 rule. It's 20% of the effort to get to that like 80% demo. But then how do you get that last 20% so you can get into production? And that usually, you know, I think usually that doesn't take 80% of the time. That takes like 120% of your time. And of course, I think getting rag into production, as you probably know from your experience, it's not about having that perfect setup right from the beginning. But I how to set it up so that you have continuous improvement over time and just gets better and better the more you use it and the more you have user feedback. So having that composable system with different knobs to turn on, different components improve, plus a feedback loop where you can take errors that you encounter or synthetic data.

Chang She [00:03:25]: And then most importantly, having that eval system to be able to benchmark your results if you put those three together, that's what I'm calling self improving rag. It's a continuous system for self improvement of your rag pipeline. So LANCB is, you probably know LAnCB is a vector database, but my counsel for people starting with Rag is don't actually start with vector search. Start with your bm 25 as the benchmark, as the baseline in some datasets. In a lot of datasets, actually, full text search can work just as better, as good, if not better. And it's less expensive to compute and index. So I would definitely start there. In this screenshot you can see us using Lamandex with the llama two paper dataset, and we benchmarked different using different retrieval techniques.

Chang She [00:04:27]: And you can see that the FTS here, full text search actually does better than the backdrop search. I'm making spoilers on re ranking in hybrid, but you'll see a little bit more in this presentation. So the second knob you can turn is the chunking strategy. So depending on your use case, depending on the type of data, there's lots of different ways you can divide up your text data. And you want to do it in a way such that it matches with your use case, so that you don't end up with a garbage in, garbage out kind of pipeline. So for example, if you have text versus JSON or HTML data, or if you're ingesting markdown versus code, those all have different structures, they have different syntactic rules. You want to make sure that you use the correct parser and chunker for those. And for each of those, there are tons of details that you can tune, like the window size, amount of overlap, the list of delimiters that you can give it, things like.

Chang She [00:05:36]: And then you have to be aware of things like the token limit, so that your context doesn't actually fits both your LLM and your embedding model. There's LangChain. Fortunately, LangChain and Lamindex comes with tons of chunking processors and so we, so I have a link, but given that the Internet's not really working, so we have a blog post as a survey on all of these different techniques. So I'll post the slides afterwards so you can kind of get a sense for what they are. And of course, good evals here makes this experimentation process a lot faster. Lastly is just be careful of the money that you're spending on the embedding API. If you're using open AI or something like that, make sure you, let's say, sample the data so that each experiment is not like a billion embeddings or something like that. Okay, come on.

Chang She [00:06:47]: Okay. A second big knob you can turn, of course, is the embedding model. Right? And how to choose the right one is kind of is very important. So in my experience, it's almost always easy, easiest to just start with the MTB leaderboard. And I couldn't get hugging face to load, but let me see if I loaded it. Looks, hugging face looks like this right now, but I got something to load on GitHub, I guess, or maybe a refresh. But anyways, how many of you know what the MTEV leaderboard is? Okay, all right, so MTEB stands for massive text embedding benchmark. And essentially it's a large scale benchmark that you can run.

Chang She [00:07:40]: And the easiest way to look at the results is on hugging phase. It's a maintained list of all the different embedding models and how they perform for different problems. And so you can just go there and take a look at the top ones and you can essentially choose, like most people I think would probably start with OpenAI. And if you want something that runs locally or you don't want to pay that OpenAI cost, but you have local infrastructure, you can always start with something from hugging face. And then if you really need the best quality for your problem, take a look at the different metrics that they have. That brings us to selecting a good metric. So NDCG is usually a pretty good default for ranking problems, and this is where a lot of people start. So NDCG is great for ranking problems and it's been very useful for recommender systems and no surprise, pretty useful for measuring embedding models as well.

Chang She [00:08:47]: And of course, if you are working under certain constraints, like pay attention to the token limit of the model number of dimensions of the output of the vectors, and then the inference cost of itself. So you take all that into account. And basically what that helps you do is choose the lightest model that will satisfy the task at hand. And most of the time you actually don't need sort of the absolute highest in the MTV leaderboard, but it's more about choosing the one that's just good enough for the tasks. And how many of you have tried the latest OpenAI embedding models, like text embedding three large? And have you guys tried the dimension reduction functionality there? Do you guys know about the functionality? So when you generate the embeddings in the new OpenAI embeddings API, you can actually tell it, I only want a certain number of dimensions back and so we don't know exactly what it does. But I think the best guess is basically just chopping off the first n dimensions that you give it. And again, let me see. Yeah, so we did some benchmarks, and if you comparing just ones across OpenAI, so on the left you see Ada two and then text embedding three small and large, and then reduced to 1024, 512 and 256 dimensions.

Chang She [00:10:26]: So no surprise that as you reduce the dimensions for the same model, the quality goes down slightly. But what is surprising is actually like if you take text embedding three large and you reduce that to 256 dimensions, that actually performs better in our benchmark than like 802, which is 1536 dimensions. And 256 is much easier to store index, and it's much faster to query as well. So I think the biggest gains with embeddings is not just picking the best model out of the gate, but actually fine tuning it and continuously improving it over time. So, especially if your domain knowledge is a little bit different from the training data that went into the embedding model, or if your distributions end up being different, then fine tuning can be a good way to make significant improvements. And you don't really need a ton of data. As this land CB user tweeted. Chris generated about $10 worth of synthetic data using OpenAI, and was able to take a much smaller open source model and essentially make that perform much better for his use case than sort of the biggest and best generic embedding models.

Chang She [00:11:53]: And so of course, selecting the right samples and labels matter here. And so just like if you worked in personalization recommendation, a lot of the labels will come in and you have basically positive labels. It's much harder to generate negative samples, and you have to be a little bit careful about where you know what to label and exactly what to feed that model. And if you don't have, you know, when we start putting that into production, we don't have user data to start with. And if you need an initial boost in quality, synthetic data is often always a great way to start. And so sort of a simplified loop here is I'm showing it here. I'm just looking at the time here. So you can have, you can essentially set up this pipeline where you can figure out the right chunking strategy and then figure out the right embedding model.

Chang She [00:12:57]: And then once you have user feedback or you have synthetic data, that's where fine tuning can happen. And this is what completes this feedback loop. So if you know how well your system is doing with good evals and you have a good system to essentially collect user feedback and run the fine tuning, you can then have a pretty basic feedback loop that will continuously improve your rag pipeline over time. So the second part is, let's talk about a little bit more advanced loop of, okay, I need first, I started with BMM 25 benchmarks baselines, and then I moved on to vector search. But I can actually combine the two oftentimes, even if one is better than the other, it doesn't mean one strictly dominates the other in all samples and all queries. So they're typically good for different use cases. And you can add more things on top of that to create hybrid search with multiple recallers. Again, this is very similar if you've worked in recommender systems, so you can certainly filter the search before you get into that BM 25.

Chang She [00:14:03]: And recently there's been more research into graph rag and using graph databases for retrieving or retrieving more complicated and sophisticated relationships. So filters, if you have explicit structure, use it if you can, keywords and fuzzy search. BM 25 is really good at that. Then of course vectors captures these semantic relationships. For example, if you search for person, BM 25 is really good at returning everything related to person, persons, Persona or whatever, but it won't return things like customer, user and things that have semantically similar meanings to the search term. And of course, if you have a pretty deep and explicit knowledge graph, graph, graph databases might be a good choice here to add to it as well. So of course, if you have multiple recallers, that means you have multiple sets of results, but you can only present one to the user or you give one to the LLM. So how do you sort the combined results? So you want to make decisions on whether you use the ranks of the results or a score, and then the scores that come out of each recaller might not be comparable.

Chang She [00:15:20]: So you need to think about calibration normalization, and then once you plug into, once you combine them, you need to think about picking the right re ranker and then improving that re ranker over time. So in our experiments, the coherer re ranker via the API tend to perform the best. And then just a simple linear combination added gains and was the cheapest and easiest to do. And then there was a gamut of open source models that you can use for re ranking and you can actually even prompt OpenAI into becoming, or chat JPT into becoming a re ranker model for you. So if you put all that together, sort of that's, this is the sort, this is the whole system, which I'm calling the context engine. Because there's way more than just a vector database that you can tune using the feedback that you get from the user. Essentially this automation becomes okay, have a way to measure the, measure your progress, measure the quality of your results. That feedback loop where the user responses are being used to turn the right knobs in the systems.

Chang She [00:16:32]: And that gives us the way to say, continually self optimize your rag pipeline. Now of course these things won't be 100% automated, so don't expect that to be the case. It wasn't the case for recommender systems and it won't be the case for rag. So what's important is look at the metrics. You need to have visualization tooling and then you need to pay particular attention to edge cases. So my expectation here is just like in recommender systems, you'll have a really big step function and quality as you, when you first start putting these fine tuning processes and self optimization processes in place. And then it'll taper off and become sort of a long tail problem, you know, land CB, we're the, I would call us the all in one database for AI. So the context engine have different ways to query your data.

Chang She [00:17:28]: You can store different data and if you went to Jerry's talk right, in order, in addition to the vectors, you also need a way to store like the source documents or the source images if it's multimodal, and the metadata that you can query as well. So land CB is an embedded open source database, is very flexible with storage options. So you can actually store your data and vectors directly on s three and query it from your application or create it from a lambda function. And land CB's multimodal. So you can store multimodal data. You can run multimodal workloads from vector search to full text search to SQL. And the use case is multimodal. So from search to analytics to fine tuning and training, you can use LandSV for all of those.

Chang She [00:18:15]: And there's an embedding registry. So we can essentially automate the embedding generation process. You can specify the embedding model that you want to use and have embedding generation be automatically taken care of by land CVD. All right, and then I think I'm coming up on time, so I'm going to skip some of these slides. But essentially what Lance DB gives you is the ability to have a single table that can run all the different workloads that you need and house all the different data that you need for AI in general from just SQL and analytics that you see on the left, Pytorch training and fine tuning in the middle and search for vectors or full text search indices on the right. So thanks for listening to me, my talk. On the left you'll see the discord for land CB community. So please join us and check us out on GitHub.

Rahul Parundekar [00:19:20]: Awesome. Thank you so much, Sanshay. We do have time for a couple of questions if that's okay. And then is Felix in the room? Okay, you're up next. So we're going there. Who has a question?

Q1 [00:19:38]: Thank you. Very exciting. One question, how do you formulate user feedback so that the models can actually learn? Is there like a particular format? Is there like some impedance mismatch? Because many times these signals can be noisy. How do you, do you clean them up before you push them to the system? Or is it like case by case?

Chang She [00:19:52]: Yeah, it's a little bit hard to standardize. But typically, yeah, you definitely want to clean up the, clean up the data. But typically it's easiest would be like if you have answers that you presented to the prompt and then you have some user action that was like rejected or accepted. And then you can also use things like if they're like subjective evals or something like that with human labeling that can also serve as good data for fine tuning. And then the synthetic data. For example, if you're doing search, if you have a product description or something like that, you can essentially tell JetJpt what would be a good use query that generates or that would, that would result in this product description or it would be similar to this one. So it's not really standard, unfortunately.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Databricks Assistant Through RAG
Posted Jul 22, 2024 | Views 1.2K
# LLMs
# RAG
# Databricks
Building RAG-based LLM Applications for Production
Posted Oct 26, 2023 | Views 1.9K
# LLM Applications
# RAG
# Anyscale