RagSys: RAG is Just RecSys in Disguise // Chang She // AI in Production Lightning Talk
Chang is the CEO and Co-founder of LanceDB, the Database for AI. Chang was one of the original coauthors of Pandas and has been building tools for data science and machine learning for almost two decades. Most recently he was VP of Eng at TubiTV where he focused on personalized recommendations and ML experimentation.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
What once was old is new again. With increasing experience in RAG, more attention is being directed towards improving retrieval quality. The evolution of RAG pipelines is resembling recommender pipelines, incorporating features such as hybrid search and reranking. This lightning talk will briefly examine the parallels between the two approaches and demonstrate how to implement hybrid reranking with LanceDB to enhance retrieval quality.
AI in Production
RagSys: RAG is Just RecSys in Disguise
Slides: https://docs.google.com/presentation/d/1C0GlF1L5m_rIPFchOaDt38mZN71NiB81/edit?usp=drive_link
Collab Notebook: https://colab.research.google.com/drive/1_M8riqjd2IPbm0czKCZlveSxuQOtqNLh?usp=sharing
Demetrios [00:00:00]: Now, where is my man Chang? Where you at? There he is. What's up, dude?
Chang She [00:00:06]: Not much. I'm excited to be here.
Demetrios [00:00:08]: Great to have you, as always. We're a little bit behind and I've got, after your talk, we're to be doing a little bit of stand up comedy routine in the break, so I'm gonna let you go. And then we've got the many coming up next. I can only imagine that my man Mikhail has a lot planned for this comedy routine because Gemini 1.5 came out today and gave us something that nobody asked for, but apparently they thought it was going to be useful. 1 million tokens. I don't know if you saw that yet or you've been with your head in this 100% trying to write your talk. So, Chang, great to have you, man. Thanks to Lance DB for sponsoring us, and I'm going to get out of your way and let you cook, my man.
Chang She [00:01:04]: Sounds good to me. Let me get the screen share. Okay. Okay. I think I see something. I hope everybody sees the same thing too. Cool. Hey, everyone, thanks for being here.
Chang She [00:01:19]: Thanks for having me. Demetrius here to talk about Rag and how it's just Rexus in disguise. And if when I said that the Transformers theme music, play it in your head, ping me and dm me afterwards, we're going to be great friends. So, hi, my name is Chongshe. I'm the CEO, co founder of LandCB, the database for multimodal AI. So it's very easy to use installs in seconds. You can get up to billion scale vector search for Rag, and you can also store tensors and image data in it for training, fine tuning in our pytorch and tensorflow data loaders. So I've been working in this space for almost two decades at this point.
Chang She [00:02:02]: I started out being one of the co authors of the Pandas library. Most recently, I spent a number of years at two BTV, a streaming company, working on ML Ops recommender systems and ML experimentation. So if you want to talk shop, I'm on Twitter and GitHub using the same handle, changis Khan. Okay, so, rag, another slide about rag. Are we sick of it yet? Never, right. So with Rag, we can see it's sort of the open book exam of AI, where you can extend the model knowledge without having to touch the model itself. It's really, really easy to build a demo, and we've talked a lot about that. But what about production? So in production, quality of your retrieval really matters.
Chang She [00:02:51]: If you were on Twitter yesterday, you might have seen this tweet where a gentleman accidentally wrote 500 basis points when he meant to write 50 and the company stock tanked after they issued the correction. Now of course, if a human makes the mistake, you may be tempted to just say, hey, fire that person. If the AI makes that mistake, what do you do here? Do you fire the engineer that wrote the application? Do you fire the user? Do you sue OpenAI? It's not clear, but whatever the recourse, retrieval quality and accuracy really matters. So how have people really tried to increase quality? Well, one, through different chunking experiments before the embedding generation process, I've tried with different embedding models and then using different recall techniques for full type search or just SQL search or graph search. And then once you have results from all these different methods of searching, you have to have a way to combine all of them and maybe re rank them before you feed it to the LLM. If you've worked in sort of classical machine learning, this might look familiar because this is basically just a content or in general a recommender system. So the chunks are just features and different chunking experiments are more like feature engineering. In the content recommender system you might have recallers based on embeddings from user history, the content metadata, the user demographic data, and even like posters or subtitles and things like that.
Chang She [00:04:30]: And then it goes into a re ranking model and you have a final response, which is usually sort of what do you display to the user in the home grid for a Netflix or e commerce like Amazon or eBay or something like that. Okay, so let's take a diversion and see a quick example of how this works for rag and so for LANCB. This is a recent release, we're now supporting hybrid search with re ranking. And so we wrote this little notebook on Colab to use the Airbnb financial data set to demonstrate this feature. Although given yesterday, maybe I should be using lift financial data for this. But let's step through this really quickly. So here I've got the SEC filing PDF and then I'm going to use PiPDF loader to load it and then use Langchain's recursive character splitter to split that up. And then I'm going to create a Lansdb table to say hey, I'm going to use the OpenAI embedding model and then the data model is just okay, I've got the page content from the lane chain doc and then a vector field that will be generated by the OpenAI embedding model.
Chang She [00:05:52]: And then I can create this table using this data model and then just add the langchain documents to my table and so I can turn this table into a pandas data frame. And I can see there's a column of page content, the text and then the vectors. I didn't have to generate it myself. Land CB does that in the background using OpenAI. So vector search is pretty simple. So we want to ask the question, what are the specific factors contributing to Airbnb's increased operational expense in the last fiscal year? So this is just table search. Passing the query limit five, I want five responses and give it to me as a pandas data frame. So I get the top page content, the vector, and also the vector distance here.
Chang She [00:06:39]: And we can see that it's not bad if we look at one of these listings decline amongst different factors. And so this is all about financial headwind and things like that. Now, how does hybrid search work? Well, hubbard search means not only do we search via the vector field, we're also searching in this page content text. So what we're going to do is create a full text search index on the page content field. And then the only thing different I have to do during search is pass in query type equals hybrid. And now I get a data frame back. Instead of just the distance, it's relevant score. By default, what we do is a linear combination.
Chang She [00:07:24]: You normalize the scores from the vector search and the full text search and then you combine them into a relevant score using just a simple linear combination. And so what you can see is that the second document return is actually different. Now. So from the hybrid search we get something about this clause about the contractor may deliver a pricing certificate, blah, blah, blah. And I believe if you look at the original query, fiscal year is mentioned here. And so this is sort of what the full tech search is picking up here. And this is the second document based on vector search. Now this is pretty simple and one of the great things about this feature is that it's very customizable.
Chang She [00:08:13]: So out of the gate we support five different types of rerankers. So the best we've seen is actually the cohera reranker without fine tuning. So here you just say, I want to use the cohera reranker and then pass then that in to the search process and you get sort of a data frame that looks similar in structure, but the scores and the content is different. So as you can see, the cohere ranker actually surfaces to the top, something that's very directly relevant. And so if we look at the GPT response Chat GPT response based on the top context from vector search versus cohere, you can actually see that using the cohere responses, we see that we're able to tease out the big factor, which is international operations, particularly in China. So we support, in addition, we support Colbert Reranker, cross go to reranker, and an interesting OpenAI reranker, where we're actually prompting OpenAI to become an expert reranker. And then of course, with a few lines of code, you can actually write your own custom reranker and plug that into the search process. All right.
Chang She [00:09:32]: Okay. In general, I think so this was a quick demo of how do you go through that hybrid search and re ranking process and different ways that you can re rank your results. What's missing here is that from the analogy, recommender systems always get better over time the more you use it, and that is because of the feedback mechanism. And actually, in rag, I see a lot of users that are starting to experiment with this as well. So you can see one of our users talk about how he was able to, with $10 of synthetic data, to be able to fine tune the embedding model to be a lot better than the best in class generic embedding models. So those will be something that we're going to be working on next, and I think you'll be able to see great results from that. Now, last thing I'll say is that analogies have limitations, right. So it's not a perfect one to one parallel between rag and our recommender systems.
Chang She [00:10:42]: The end result is different. And in rag, there's an extra step of generation. What is relevant to whether the results in recommender systems are relevant to a particular user. Here in rag, the relevance is defined for a given question. And then for recommender systems ranking, that first spot in the home page is super valuable. So fine grained differences in ranking also matters for rag. If your top context answers fit within the context, the token limit, then that one, two, three position may not matter as much. So I think a lot of that means that how you measure the quality is going to be different.
Chang She [00:11:26]: And so it's not exactly a one to one match, but it's certainly worth thinking about and drawing parallels to get insights from traditional recommender systems. All right, so that's it for my talk. I think it's roughly ten minutes. And thank you for listening to me, ramble, and we'd love to have you join our discord channel and talk more about Rag and AI and whatnot.
Demetrios [00:11:57]: I think I just have to say one thing, which is you take the cake for best Twitter handle.
Chang She [00:12:06]: Thank you. It is actually my rename, so it is Changis Khan. C-H-A-N-G-H-I-S-K-H-A-N. I'm dating myself, but it's my aim screen name.
Demetrios [00:12:22]: Oh, it's so good, man. It is so good. Is there a link that you can drop in here or in the chat for that collab notebook that you were just playing with?
Chang She [00:12:30]: Absolutely. I'll drop it in the chat.
Demetrios [00:12:33]: Perfect. All right, man. Thank you for doing this. And we're going to keep on moving with these models. These models.