MLOps Community
+00:00 GMT
Sign in or Join the community to continue

The Future of RAG

Posted Mar 06, 2024 | Views 356
# Artifact Storage
# LLM Design Pattern
# ContextualAI
Aditya Bindal
Vice President, Product @ Contextual AI

Aditya is the VP of Product at Contextual AI. Contextual is building the next generation of RAG as part of its Enterprise AI platform. Enterprise customers use Contextual Language Models (CLMs) to create production-grade workflows that address all the pain points of RAG 1.0.

Before Contextual, Aditya was at AWS AI. At AWS, Aditya led product for large-scale deep learning and foundation models.

+ Read More
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

New LLMs are constantly appearing in the AI landscape, and retrieval augmented generation (RAG) has become a dominant LLM design pattern. What will the future bring? Join Contextual AI VP Product Aditya Bindal for a deep dive into the next generation of foundation models that prioritize customization and privacy.

+ Read More

The Future of RAG

AI in Production


Adam Becker [00:00:05]: Next up we have Aditya. Let's see. Aditya, are you around?

Aditya Bindal [00:00:10]: Hey, folks.

Adam Becker [00:00:12]: Okay, I can hear you. Aditya, how are you doing today?

Aditya Bindal [00:00:16]: Good, Adam, how are you?

Adam Becker [00:00:18]: This has been an excellent lineup so far and I'm very excited to hear your talk. And I alluded to it right in the beginning thing about what the future of rag ought to be and how we should be thinking about this. Whether or not you're going to be now delivering the final hammer blow on rag, or whether or not it will linger for some time. It's hard to know yet, but perhaps we'll be much wiser after your presentation.

Aditya Bindal [00:00:44]: Yeah, let's see. Hopefully it's some good questions for everyone to think about. So, should I get started?

Adam Becker [00:00:54]: Yes, please. Do you need to share your screen?

Aditya Bindal [00:00:57]: Yes, I'll share my screen.

Adam Becker [00:01:00]: Okay. Awesome. Take it away.

Aditya Bindal [00:01:03]: So, hi everyone. My name is Aditya Bindal. I'm the vp of product at Contextual AI. Thanks so much for being here. So, I want to talk today about the future of Rag. So, the way you think about a language model, given some previous context, what's coming next? So let's do a quick recap of where we are and where we see Rag headed. So there's a lot of excitement about language models in the enterprise. I think people have a lot of enthusiasm for the types of workflows and agents that can be built using these language models, but there's also a lot of frustration.

Aditya Bindal [00:01:42]: So we've heard from customers, we've seen ourselves that there are some issues that are really holding back adoption of these technologies at scale and why they seem like a first generation technology. So things like hallucination, often with high confidence, a lack of attribution, where you don't know where some information in the model's output is coming from, staleness and a lack of data compliance, where it's relying on old data, a poor data privacy, where a lot of confidential prompts and completions and documents are going to third parties, and you don't have control over how that information gets used. And then a poor cost quality trade off, where a lot of the use cases are deploying these consumer grade models that in many ways are not designed for those specialized enterprise use cases where they're being deployed. And so one approach to solving these problems that you may have seen is retrieval, augmented generation, or rag. So the CEO and co founder of Contextual, Dowkela, pioneered and introduced Rag with his team back in 2020 when he was at Facebook AI research. And if you're not familiar with the concept, the basic idea is that you make external knowledge accessible to the language model using retrieval. And typically, the way this looks today in the kind of a simplified view, is you have these different frozen components. So you have a frozen language model, maybe something from OpenAI, you have a frozen vector database with frozen embedding models.

Aditya Bindal [00:03:09]: And then you have maybe some documents that you chunk and you send those to the encoders, they get put in the vector database. And now you're doing similarity search between the language model and the vector database. And there's more complex versions of it. But the core idea is the same, where all the components are actually frozen, and you're now just trying to prompt and chain these things together. And in some ways, if you go back to the original rag paper, so this is table six from the 2020 paper that's now four years old. And if you look at table six, it has these ablations that show you what the performance is on different benchmark data sets. When you freeze this rag system and when you unfreeze it. And the main point of the paper, the main technical contribution, was to actually show us that freezing is suboptimal.

Aditya Bindal [00:03:56]: You shouldn't freeze these components. You should train and optimize them together, you'll get much better performance. And this is an insight in the paper right there four years ago. But the way we see Rag implemented is taking these off the shelf frozen components and just chaining them together. And so internally contextual, we refer to this as kind of Frankenstein's monster. So it has the hands and legs and limbs, and it kind of moves around, but it isn't the real thing, it isn't doing what you really want. And in the opposite end of that spectrum is something like the iPhone, where everything is end to end optimized. So fewer moving pieces, everything is designed to work together, the camera module, the chips, the touch screen.

Aditya Bindal [00:04:35]: So we are building what we call contextual language models that go back to that core insight from four years ago and build upon that as the next generation. So if you take the entire system and jointly optimize it, you can actually do much better than using just frozen components. So the way this looks as we think about the future of Rag is instead of taking maybe trillion parameter model that's meant for more consumer use cases and then trying to do similarity search over a vector database, you're actually building specialized intelligence. So models that have the right amount of compute, so that they can comprehend over the enterprise's domain and everything else is now happening through retrieval. So you end up getting smaller, more controllable systems that are very robust and production grade at the specific things that you want them to do. And then the next piece of this is the feedback loops. So in all of these systems, we've seen, for example, RLHF and how it's made this huge improvement to Chat GPT. And so reinforcement learning with human feedback allows you to learn and align the system's values and keep improving continuously based on user feedback.

Aditya Bindal [00:05:51]: The challenge with RLHF is that when you get, let's say, a thumbs down from a user, you now need to send it to a human annotator and say, what would a thumbs up have looked like for that response? And then the human annotator would write the thumbs up, and then you would get the pair of the thumbs down and the thumbs up back, and you would retrain and improve the model with it. And there's some attempts that have been made to make this process a little bit better with direct preference optimization. But the big bottleneck was that you needed all of this paired data, the thumbs up and the thumbs down. So what we've done is invented a new technique that's called Kanman Taversky optimization. So it's named after the behavioral economists. You might know about them from the famous book thinking fast and slow. And with Kanman Tversky optimization, or KTO, you can improve the model using just a single feedback signal. That could be a thumbs up or a thumbs down.

Aditya Bindal [00:06:45]: So the need for having this balanced data set of thumbs up and thumbs down pairs is no longer needed. And we found a lot of our customers actually have existing data that has a solitary feedback signal. So that's one aspect that you want to be able to improve this in a very data efficient and quick way. The second part of it that we touched on is having as much of the compute cached in the retrieval. So if you do this, you don't need to now memorize over documents and data. You can actually cache a lot of information in the retrieval system, provided that the retrieval is good and to make the retrieval good, that's where the end to end optimization kicks in. So when you take these things together, you end up with a very different regime that looks nothing like the paradigm you see with Rag 1.0 today. So with these frozen rag systems, you have a frozen embedding model, a frozen language model, a frozen vector database and retrieval system, and you aren't able to fine tune and align the whole system.

Aditya Bindal [00:07:45]: So you can do RLHF potentially on the language model or any one of these components individually, but you aren't able to now do the tuning and the alignment on the end to end. And the approach that if you take the original idea from the 2020 paper to its limit, the approach that we think is the future of Rag is taking the entire end to end system. So all the way from your document extraction, chunking, encoding, to then the retrieval and the generation, you are able to backpropagate through all of that. And so this end to end contextual language model, as we call it in house, but any Retrieval augmented language modeling system would allow you to tune and align these systems in an end to end way. And so this kTO, with all the data efficiency benefits, when you get a thumbs down, that thumbs down is now improving not just the generation, it's improving the retrieval, the embeddings, the extraction, the whole thing together as a system. And so that's one of the key things that we believe is going to change with this next generation of rag. We think about system optimization, not model optimization. And so the main thing that we've really seen in a lot of these cases is that a lot of our customers, a lot of folks that we talk to, find it really easy to get started.

Aditya Bindal [00:09:04]: So there's so many tools out there that make it trivially simple to have a vector database or language model connected with cosine similarity. And now you have a rag application. And so you can make a lot of demos, you can do kind of quick prototyping, and you can show that to users and get feedback. But then people seem to hit a wall, they seem to hit this wall where they can't go beyond the demo and actually make it production grade. And at some point you start seeing diminishing returns from all the chaining, all the prompting that you could do. And that's where we feel the next iteration or the next step up is going to come from. This new architecture that's natively good at retrieval, that's optimizable end to end, so that you can actually specialize it for the use case and the workflow that you're trying to build. And you now have a deployable system that can actually hit that production grade and get robust.

Aditya Bindal [00:10:01]: And if you look at the history of deep learning, this is the trend we've seen time and time again. So with computer vision, with early NLP, end to end, optimized systems always tend to win out. And we think we're just seeing the same dynamic now taking place with large language models and foundation models. So that's kind of a quick overview of how we see the world in terms of the future of rag and retrieval augmentation as natively improving the model, having these things deal with multimodal inputs and outputs and optimizing the entire system, not just the model. So we'd love to answer any questions you have and learn more about how you folks are thinking about using rag. Thank you.

Adam Becker [00:10:45]: Okay, Aditya, thank you very much. If you could please keep your slides up, because I want to go like one by one and just dissect what's going on to make sure that I understand what's happening. Okay, here it is. Before I do that, you had, I think, one question here. This is from Kay, who says, we just had a speaker who was saying fine tuning is not appropriate for 90% of use cases. Do you agree with this, considering your point about ClM?

Aditya Bindal [00:11:20]: So I think in any general statement like that, that fine tuning is right or wrong, I would not agree with that. I think there's this false choice that people seem to make or be given of rag versus fine tuning. I don't think it makes sense to choose. You really want to just use the right technique algorithmically for the type of problem that you're solving. So if you have a retrieval augmented system, you don't want to memorize information that is in the documents you retrieve from, but you do want to tune and align the entire system around the objective. So the example I always like to use is, let's say you're recommending credit cards. Each year you might have new credit cards, and there are new documents that describe the interest rates and the benefits and the perks. You don't want to train over those new documents.

Aditya Bindal [00:12:10]: You want to be able to retrieve them based in relevance to the query, not similarity. But you do want to instruction tune the entire system on becoming a really good credit card recommender. So it's able to answer with the right level of detail, it has the right style, and then you want to align that based on the thumbs up, thumbs down, feedback from users, or an explicit signal like did they actually make the purchase or did they buy the credit card? So I think this choice between fine tuning and rag doesn't make any sense to me. You want to use both, but you want to use them a little differently.

Adam Becker [00:12:44]: Okay, so again, let's now break this down step by step. And for the people in the audience for whom this analysis is going to be a bit too juvenile, I apologize. But if you appreciate me actually trying to get deep into their bones, then stick around. Okay, so can you please go to the previous slide, to the one where you actually demonstrate what all of the different components are that are about to be frozen? Was it this one? I think it was this one. Okay. Yeah. When you say freezing here, what is it? Actually, that, is it the docs? Is it the chunking? You're saying? We have read a static repository of data. These are documents from one to 15.

Adam Becker [00:13:29]: Boom. We've now embedded them, we've chunked them, we've embedded them. They live in the vector database. When you say frozen, what about this picture is frozen?

Aditya Bindal [00:13:39]: Great question. So if you think about any machine learning model, what makes the machine learning model perform its task is that it has been optimized so that it finds the optimal set of parameters to do the task or to score highly on whatever evaluation criteria that you use. So in this case, if you look at the different types of models that are being used, there's a language model, that's the LLM here, there's the encoder models. Maybe you have a different one for documents and queries, and then you might have different AI or machine learning models you're using for chunking and extraction from documents. So when we say frozen, what we mean is you've taken a preexisting model that someone else trained and you're now just doing inference against that model. So you give it some input, it could be your files, it could be a prompt, and then you get an output. In a machine learning setting, what you want to do is optimize the system. So optimize here means that the parameters of these models, the language model, the encoder models, are actually changing so that they become really good at the specific thing you want them to do.

Aditya Bindal [00:14:50]: And crucially, you want to do that together as one system so that the same accuracy metric or loss signal is back propagating from the language model all the way back through the chunker.

Adam Becker [00:15:03]: Got it. Okay. And so basically the thumbs up, thumbs down, whatever supervision, whatever signal we're going to then collect, you want that to propagate towards the optimization of each one of these little things. And that way we do some sort of like joint learning and joint optimizing of all the different subcomponents of this entire system.

Aditya Bindal [00:15:25]: Exactly. But there's actually two key stages. So what you describe is what's happening to specialize the system for one customer's workflow or domain. You also want to pretrain these things together so you don't want to wait for, let's say, a GPD four. And then you try and make it better at retrieval. You want the language model to be good at retrieval from day one. So when you're pretraining the language model, you want to pretrain it with retrieval so that it's actually learning to answer questions by searching for information based on relevance.

Adam Becker [00:15:56]: Okay, now I want to ask you about 1000 dumb questions. All right, so I'm going to start. Number one. First of all, is there a risk if you allow each of these components to now begin to jiggle as you're back propagating? Sort of back propagating. I use that metaphorically. But as you are beginning to now own all of these systems, isn't there the risk that perhaps your document embedded is going to learn something? And so the system itself might seem like it's improving, but then the LLM itself might be degrading in its performance in some manner, and then is there such risk?

Aditya Bindal [00:16:34]: Yes, I think this is really a question about evaluation. So in any type of machine learning modeling, you need to have a clean evaluation data set, and you never want to taint that evaluation data set by training on it. So whether you're doing it for a rag LLM training regime or any other model, a computer vision model, as long as you have that evaluation criteria separate, you would be able to keep these things clean and untainted.

Adam Becker [00:17:06]: Okay. But that point is actually a consequential point because doing that rigorously seems to be very difficult. Because it is. Yeah.

Aditya Bindal [00:17:19]: I didn't mean to suggest it's an easy thing to do.

Adam Becker [00:17:22]: Right. But you'll constantly be tainting your data and then risking overfitting on it every time you learn something new. Now you need to start with fresh, new batch. Now that seems to me to be the window.

Aditya Bindal [00:17:36]: Let me just give you an example of how this works in a real customer use case. So they may have a domain where they're doing, let's say, investment research, or they're doing procurement or insurance analysis, and they would have an evaluation data set that's always separate from the real usage that's being generated by their end customers. So when you get thumbs up, thumbs down signals, that's a different data set. The evaluation data set they have is like fixed mostly, and it's not used in any of these optimizations that we've been talking about. Yeah, but the hard work they put in is making that evaluation data set up front.

Adam Becker [00:18:19]: Okay, so one more question. Now I'm getting from the audience. Apurva is asking from the user feedback, how do you figure out what component needs to be changed and how much of each component needs to be changed?

Aditya Bindal [00:18:31]: Great question. So there's two things we do. So one is we actually don't figure it out by hand, because there is no good way to figure it out manually. So we let the system optimize and see what can be improved in the signal. So when you get a thumbs down, we know what was retrieved for that query, we know what the generation was. And so we can now use that to help the system jointly optimize. So the encoders might change less, the language model might change more, or vice versa. That's just part of the algorithmic optimization.

Aditya Bindal [00:19:02]: But we've also seen, and this is the second technique, that sometimes it's valuable to really focus on improving the retrieval. And maybe that's a feedback signal we can collect on the retrieved document. So when you get maybe a document back, saying, this is some context that's relevant to the question, you can give a thumbs up or a thumbs down just on the retrieval, and then improve that specific component. So we do this kind of carve out, but overall, we don't need to specify which component to improve.

Adam Becker [00:19:32]: Yeah. Okay, so another question here from Kavita. Can you clarify how you now incorporate feedback all the way to the source?

Aditya Bindal [00:19:47]: Let me maybe describe it in two different ways. So the training regime is independent of any customer document. So the pretraining is happening before the contextual language model gets specialized for any one use case. And so that's just taking the traditional language model training regime. And now it's not stopping at the language model. It's allowing us to optimize the entire system together. So the same loss signal is now making its way through the language model, through the retriever, through the document chunker. And this is happening in the pretraining stage.

Aditya Bindal [00:20:27]: We don't have to actually go all the way back to the source documents. We just have to go back to the model that's maybe extracting or chunking from the source documents. Now, a lot of the secret sauce and a lot of the skill and art of making these models converge is in being able to do this efficiently and doing it end to end. Because language model training, I think there's a lot of great tools out there, there's a lot of great examples. But when you modify the training regime so that it has all these other components that are being jointly optimized, it actually gets very complex. And there's a lot of things you have to do in order to make this entire system now converge and become very accurate.

Adam Becker [00:21:07]: Can I ask about the, if we were to zoom in, let's say, to the document embedder or to the query embedder, and I actually now look at the network itself. I imagine you tell me that, I imagine that you're not thawing or unfreezing every single possible layer and parameter, because that sounds like perhaps overkill. Perhaps you would need so much data to actually start to move all of these things around and optimize every one of these billion. Maybe you say, you know what? I want the first layer or the last layer. I want the layer. Right. I want the subnetwork. How do you think about what aspects to actually saw?

Aditya Bindal [00:21:53]: Yeah, so I think what you're pointing out is that it does get computationally expensive if you have to do this entire training regime. So in some cases, we actually do untreeze everything, but in some cases, we've learned tricks of what to unfreeze and what to leave as is, or how to update different parts at different points in time. I think that's one of the things that we really built a lot of specialization and skill in making these things converge in a computationally efficient.

Adam Becker [00:22:23]: Yeah, yeah. Okay, we have another question here from the audience, from Viplav. Would you say that the better the data model itself is, or the more well described the content is, the better the LLM can address the enterprise needs through pre training and fine tuning?

Aditya Bindal [00:22:41]: I think that's, in general, it's a true statement. I would say that a lot of the failure modes and language models in enterprise come down to data. So if you have a really good data model and a data flow that you can now use to improve these systems, that tends to make the system a lot more robust, more accurate. But having the right data isn't often enough. So you need to have it in the right form. And so, especially for retrieval augmentation, you need to have, for example, source documents and a lot of example, queries and answers, so that you can take that combined system and use it to fine tune and improve.

Adam Becker [00:23:23]: Yeah. Can you move on to one more slide, to the next one, I think, to the ablation studies? Yeah. So, the way we normally think about rag is the frozen component paradigm. What you're saying is that we just didn't quite read it carefully enough. If we were to read the original rag paper, then you see that they do show that unfreezing them, or jointly optimizing them is superior, or that freezing is suboptimal, as you put it. Do they then go about unfreezing only certain components of the system, not just within the network, but even between. Do they actually test it out on all of the different legs of this journey? And if so, how significant? Might be a little bit difficult to kind of read it from here, but how significant is the improvement that they even originally shown in 2000?

Aditya Bindal [00:24:16]: Yeah, so remember, this is kind of four years ago. So a lot has obviously changed since then. In the original paper, I think the goal was to just show that there is this optimization space that's not been explored before. And that's why the paper was actually published, because it's showing that there is this new direction that you can go in, and it took just a few of those components and unfroze them to demonstrate what's possible. And since then, there have been additional papers that have built on this foundation and shown even bigger improvements in some of the data sets you've shown. And now we're taking that to the kind of next extreme and making the entire end to end on freezing all the components. But you can think of this as the starting point that really gives everyone the idea that you shouldn't freeze these things.

Adam Becker [00:25:10]: Yeah, we got one more question here now from the audience, from Kyle. He says, this is really interesting to me, because if your embedding model isn't frozen, it seems like you'd see embedding drift, where older embeddings could potentially differ from newer embeddings of the same or semantically similar content, such that it would be harder to find older documents or older chunks. Would love to hear some clarification on if this is expected behavior. And if not, then why, short of periodically re embedding all of the documents in the database now using the new embedding?

Aditya Bindal [00:25:45]: Yeah, so I think this relates in some ways to the question that I think you had just asked as well. When we unfreeze, we actually do have the ability to re optimize all of it. So not just the embeddings. And so if you go back to this table in this paper, one of the things that it's teaching us is that optimizing the embeddings by itself is not what you should do. You should tune and optimize the embeddings and the language model together. And when you do that, I think that the question of drift is in some ways not different from how you would deal with it if you just had a language model and no retrieval component. So if you had a language model, you have a set of queries that work well, you make some updates to the language model. Now there's drift.

Aditya Bindal [00:26:30]: So you need to have a good test suite, and you need to have an evaluation set that allows you to detect that drift and then try and fix it. I think that's the same paradigm. The benefit with this approach is that because the embeddings are being tuned for that specific objective and set of instructions jointly with the language model, the system as a whole is being measured. So, for example, you wouldn't have embedding drift and language model drift. You'll be able to control it as system drift, and you can then fix it in much the same way. Except now you don't need to fix each system individually. You fix the whole system as one.

Adam Becker [00:27:12]: Right. Do you imagine that? Can you move on to the next slide one after, maybe another one? Okay. Yeah. You're operating in environments that sometimes don't just have that upvote down vote, right? Thumbs up, thumbs down. Sometimes it's just like thumbs up, as in, let's say somebody made the purchase, although I guess you could say, well, they made the purchase or they didn't make the purchase, but the not making the purchase shouldn't be as penalizing as making the purchase is rewarding. Right. So you have to sort of weigh them differently. And it seems like one of the things that you're introducing here is this Taversky, what did you call it?

Aditya Bindal [00:28:05]: Kundman Taversky optimization.

Adam Becker [00:28:08]: Can you tell us a little bit if this is proprietary? Feel free. Not to dig too deep, but give us a sense what is it that it's doing.

Aditya Bindal [00:28:16]: Yeah, so this is actually open source. So there's a paper, you can find it on our blog and a technical report, and then you can find all of this available. So you're right that sometimes you don't want to penalize for that thumbs down signal if someone, let's say, doesn't make a purchase, because that could mean many different things. What this gives the customer, though, or the end user, is the flexibility to define what they count as a thumbs up and a thumbs down signal and then use it to improve the system. So you can make it an empirical question. Let's say you're an ecommerce retailer and you have a user who puts something in that cart but doesn't check out, and then you have someone who puts something in a wish list or just browses a page or just enters a search. So you can think of it as a funnel, and you can define empirically which is the best stage to label as a thumbs down and see how it improves the system. And because you now have this super fast and cheap and data efficient feedback loop, you don't need to wait for that entire labeling cycle to take place.

Aditya Bindal [00:29:22]: You can try different things and imperatively see. Is that a good place to do it? With a few customers, we've seen that the domain makes it really easy to identify what actually is a thumbs up and what actually is a thumbs down. So there'll be a more explicit action taken by the user that you can interpret as a thumbs up or a thumbs down.

Adam Becker [00:29:41]: Yeah. Okay. It seems like the kinds of problems that they probably always see in reinforcement learning, where you're not entirely getting this type of structure in your reward and you have to come up with proxy ways of doing it. It sounds somewhat similar. Can you go back now again, sorry, to that diagram with a different component?

Aditya Bindal [00:30:06]: The final one?

Adam Becker [00:30:11]: No, not yet. The final one. The one. The initial one, yes. This is something that has always been a little bit weird to me, and perhaps you can help to kind of at least help me refine my thinking about this. What we're doing with the cosine similarity here is just trying to find some similarity between the query and the relevant document. Right. But there is something that feels to me unholy about that because it just feels wrong.

Adam Becker [00:30:45]: Because so what if I happen to, in my question, I happen to mention certain things. The answer should actually is not relevant to the way I framed the question. The answer is relevant to the way the answer is framed. Right. Yet nevertheless, I am now searching in question space, but not in answer space. And I'm hoping that there is some form of association between those two, which there may or may not be. Is there something to be done about even that similarity search, that aspect of the retrieval that can further be optimized?

Aditya Bindal [00:31:21]: Yeah, absolutely. I think you made a really good observation that something about that just feels intuitively wrong. And I think the reason for that is we've all seen search engines evolve and improve. So Google is not doing similarity search and just looking at keywords, right? So they have this ability to build a relevance engine that actually finds content based on what's relevant to the query and they can understand the intent and do all those things that make it such a good search engine. So the same way you can imagine the future of Rag is a language model with a native search capability, or a native search engine that can find things based on relevance. So you don't want to do cosine similarity, what you want is relevance instead. And to do relevance, the language model needs to have some way of understanding that this content or this document is relevant to the query, even if the way the query was structured or phrased with its keywords is not hitting anything that's similar to those documents. In some ways, what you described is really at the heart of this future of rag contextual language model that allows you to pretrain these components together so that the language model gets this native search capability and understanding of how to find relevant information, not similar information.

Adam Becker [00:32:42]: Yeah. Do you imagine, or have you seen that certain components, or legs or connections between components are actually more ripe for optimization than other ones? And if so, which are these? Right? Is it perhaps it's just aquarium better if we freeze everything else and just allow the quarry and better to continue modulating and learning, you can get all of the benefit you need. Or perhaps it's something to do with the. For example, I suspect chunking isn't the most important one, but I could be wrong. Perhaps in certain regimes it is more important. What have you learned about the importance of these different modules?

Aditya Bindal [00:33:22]: Yes, I think there are actually quite a few examples where chunking makes a huge difference. What we found is that no single component seems to dominate. So even the language model is only around 20%. And it's really about getting these things to operate together that gives you the huge gains. And that's the same core insight as the 2020 paper, that it's about the optimization together, because you can think of each of these models as having their own latent space, and now you're kind of combining it into one and jointly finding the optimal states. So that, I think is really the secret sauce behind all of this. Any one component is not going to give you the same benefit.

Adam Becker [00:34:05]: Yeah. Okay, now, last thing, I promise. Can we now go back to the final diagram? Okay, so I'm just trying to fully register it in my mind. We got the docs, now we have a document encoder. This now has a color. This color means that it is potentially learnable, right? Is that the idea? We're encoding and then it goes into the database. This would be another vector database, the query. Okay, we got an input.

Adam Becker [00:34:37]: The input is being encoded, the query encoder, and then that provides the context. Okay, can you walk us through this one more time?

Aditya Bindal [00:34:50]: Sure. So the goal here is to really treat this as one system. So you start with the documents that are brought in by the customer, by the user, and you now have the embedder models, the encoder models that are creating some kind of index, and then these are getting passed into context. So, so far this is the same as the original rag. The difference is that if you look at the boxes that show you the model components, so the generator, the encoders in the simplified diagram, those are now being trained together so that when you create the embeddings, those embeddings have been optimized for that specific use case and task. So if you're again going back to the credit card recommendation example, the way you encode the query, the way you encode the documents and the way the language model generates, all of these things have been tuned to make it really good at that one task. And then you're doing a lot of the traditional things you would with any language model on prompting and bringing things into the context window in order to generate an output. But the components that you use to make all of this happen, the query encoders and the generator in particular, these are now end to end optimized.

Adam Becker [00:36:09]: Aditya, thank you very much. One of the things that I believe everybody who's been playing around with these language models has understood is just like how dependent they are on the prompt right, and if you don't get the prompt right, it feels like you just lost the game. And this is a massive space to optimize how you actually do it. And then one of the ways that you're approaching it is, well, let's just redo the encoding or the embedding of that query, or let's allow that to learn or to be learned alongside every other component. There are other ways, I imagine, of doing it, because right now all of this is being done through guesswork. And the person that has taken a stab at this is Alex. Alex, are you with us too? Yep. Can you hear me? We can hear you, Alex.

Adam Becker [00:37:02]: It feels to me like there's just like perfect synergy between what you guys are doing. There's a lot of just stochasticity, just like in how we're trying to learn many of these things. So I'm very excited to be hearing from you, Alex, about the work that you're doing. Aditya, thank you very much. And thank you for allowing me to grill you and to try to squeeze as much as I could. So I appreciate it.

Aditya Bindal [00:37:23]: Thank you so much.

Adam Becker [00:37:24]: Adam.

Aditya Bindal [00:37:24]: Thanks, everyone. This was a lot of fun. Lost my tune and.

+ Read More
Sign in or Join the community

Create an account

Change email
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

A Survey of Production RAG Pain Points and Solutions
Posted Feb 28, 2024 | Views 1.8K
# LLMs
# LlamaIndex
DevTools for Language Models: Unlocking the Future of AI-Driven Applications
Posted Apr 11, 2023 | Views 3.4K
# LLM in Production
# Large Language Models
# DevTools
# AI-Driven Applications