Supercharging Your RAG System: Techniques and Challenges // Tengyu Ma // DE4AI
Co-founder & CEO of Voyage AI Assistant Professor of Computer Science at Stanford University
Retrieval-augmented generation is the predominant way to ingest proprietary unstructured data into generative AI systems. First, I will briefly state my view on the comparison between RAG and other competing paradigms such as finetuning and long-context LLMs. Then, I will briefly introduce embedding models and rerankers, two key components responsible for the retrieval quality. I will then discuss a list of techniques for improving the retrieval quality, such as query generation/decomposition and proper evaluation methods. Finally, I will discuss some current challenges in RAG and possible future directions.
Demetrios [00:00:03]: We've got our last keynote of the day coming at us. Where you at, tank? Where you at? Where you at? Oh, closing doors, making sure that we got the vibe right, huh?
Tengyu Ma [00:00:16]: Yeah, I met Stanford, actually. I'm at home.
Demetrios [00:00:21]: Ah, nice. All right. I like it. I like it.
Tengyu Ma [00:00:24]: Yeah.
Demetrios [00:00:24]: So, man, I'm excited for, for your talk. I appreciate you closing it out for us. This is very, very awesome. I'm honored to have you at this virtual conference, and I'm going to just hand it over to you. And anybody that wants to ask questions, feel free to drop them in the chat. We're all here now, and so I'll try and grab the best questions and then shoot about you when, when you're done.
Tengyu Ma [00:00:56]: Yeah. Thanks so much. Thanks so much for having me here. I guess you can hear me okay. Yeah, this is a great event. Yeah. So this is the first time I have I hear a song before the talk and also maybe original sound next time. Maybe.
Tengyu Ma [00:01:11]: Maybe this is the last time I'm going to hear original sound before talk. So, yeah, it's really nice to have to be here. And my talk is about rag and some of the a little more advanced techniques on RAC. Let me just make it bigger. So I'm Tong Yu. I'm assistant professor at Stanford. I'm also having a startup right now. I'm the co founder and CEO of Voyage AI, who is mostly working on the components in Rike.
Tengyu Ma [00:01:38]: So I'm Trinidad, a scientist, and I know that this is a data engineering conference. So I try to make the talk more technical, but I guess technical doesn't necessarily mean that more engineering, I guess you'll see. So I would like to start a talk by very, very briefly talking about what I see as the paradigm change in engineering and in just generally how things are deployed in industry. So about five years ago, seven years ago, I started at Stanford. I started teaching this course, CS 299 machine learning, with Chris Ray, my colleague at Stanford. So, and in the slide stack, there was a slice, which is like this. So the seven steps of IML systems acquire data, look at data, so and so forth, and you repeat, and we still teach this, actually. So I occasionally use these slides to teach that lecture, but actually, in the foundation model generated AI era, it sounds like one gigantic pre trained large language model can do just almost everything.
Tengyu Ma [00:02:46]: All of these steps are required when you build a model. But once the model are ready, you don't really need any of these steps. A large language model can simulate most humans. In many cases, this makes the paradigm a little bit different, because now you are using these modules on the models as a black box without thinking too much about deployment, because most of the costs are API calls. You know, even you code, you hold them, host the model on your own infrastructure. Actually, you still try to set it up so that you do API call to the models. So, and the way that you really do is that you describe the task in natural language and write some instructions and demonstrations, not very different from training a new employee onboarding them in your company. So, so in some sense the tasks becomes more like how do you write up good instructions in natural language, as opposed to writing up the code to train the model and measure, so on and so forth.
Tengyu Ma [00:03:56]: However, the issue is that an off the shelf large language model doesn't and shouldn't know your proprietary data, and that's still missing. So even you have an employee who is really, really smart, they cannot really just work at a company from day one because they don't have the institutional knowledge about the company, they don't know the data in the company, they still have to learn something specific about the company. And I think that comes another layer on top of large language model, which I'm going to focus on. So my view is that even large language models approach AGI. Even you have Einstein, you still need some retrieval system, some knowledge system that help you in addition to that, ingest the proprietary and unstructured data from the proprietary environment. Like a company, that's where the rag come into play. And this is probably a predominant approach these days after a few, after 1.5 years of discussions between different approach. So how does Reg work? I assume most of the audience roughly know how it works.
Tengyu Ma [00:05:07]: So I'm going to be very brief here. So basically you have unstructured proprietary data, and then you go through these four steps, which are for retrieval, retrieving relevant information. And once you retrieve relevant information, you give them to a large language model. These four steps are first step is ETL, where you have to connect to the data sources and then transform the data in a way that can be processed by a text embedded model. So basically you just turn everything into text so far. So you get adjacent file out of this ETL stack, and then you use an embedding model to vectorize the text. And then you get a vector. And if you have a lot of documents, of course you get a lot of vectors and put these vectors into vector base for you to organize store.
Tengyu Ma [00:05:54]: And then you do a search using the vector base k nearest neighbor search module. And then you got probably 100 documents that are relevant to the query. And then you use a rewrinkle step, a re ranking step to refine the results to probably top five, top three, and you give the top three documents to large language model and the ETL and the vector database. Steps are mostly engineering I would say, because they don't necessarily change the quality of the search results. And embedding models and revolves are the AI components that decides the retrieval quality and also they drive the end to end response quality. As you know, if you don't have the relevant information, then the large language model may hallucinate because they don't know how to answer the question. And if you have the relevant information these days even relatively weak large language model can synthesize a good answer given the relevant information. So I like to spend probably like two minutes to discuss some of this.
Tengyu Ma [00:07:00]: I think these days it becomes much clearer rag versus fine tuning versus long tongue text these are three ways to incorporate proprietary information into large language model. I think there's still probably some debate there and this is my view. I like to start with giving analogy with the biological networks. So rag in some sense is analogous to retrieve and recent books from a library to solve a problem. Fine tuning is kind of like you work so hard to rewire your brain even because you change your parameters. In some sense, you change up synapses in your brain, you rewrite your brain so that you are very, very good at some type of problems. Without even additional context, your brain just memorize. And in the synopsis, those kind of type of problems, long context is kind of like you read the entire library in your short term memory in your brain every time to solve a problem.
Tengyu Ma [00:08:02]: And from the analogy you can see the differences and pros and cons. Fine tuning, if it works, is great because you don't have to have any additional context every time you solve a new problem. But it's very kind of like requires a lot of knowledge to really change your brain, how your brain works. And that's kind of the challenge in some sense these days with fine tuning, because without enough data, fine tune doesn't seem to be very effective. You may not be able to ingest the knowledge into fine tune the model or you may overfit to the small amount of data. And for long context, it's really just costly, right? It doesn't seem to make a lot of sense to read the entire library every time to solve your problem. So if you look at the details about cost of rig versus long context, this is my off the shelf calculation. I didn't do a very careful calculation because, you know, the gap is too big.
Tengyu Ma [00:08:56]: Let's assume you have 1 million tokens. A query of a rag doesn't really scale with the number of tokens, because it only depends on how many tokens in a query they are. So that's probably on order of one e minus five, because 100 tokens times most of embedding models these days is one e minus seven. So 100 tokens times one e minus seven is one e minus five and a quarter with a long context with 1 million tokens cost probably $10. Because let's say 1 million token times most of the large language model is something like five tokens per million token, maybe $10 per million token. So it's one e minus five on that order. So you have 1 million times one minus five, which is kind of like $10. So the gap is like a six orders magnitude.
Tengyu Ma [00:09:41]: Of course, you can catch the activations of the long contact transformer, but the storage costs are also significant because you have to keep the GPU running there. You have to store them in some other ways. You know, you can probably shave off a few magnitudes, you know, from the long context, but there's so much of a, a gap, long context, you know, large language models make, you know, may become cheaper and cheaper because of other issues for other factors. For example, the GPU's may become cheaper. We may have, you know, better engineering to maximize our MFU, the Gpuization, or other tricks to speed up things. But most of those tricks also applies to rag, and so, which means rag will also get cheaper. So basically, my take, this is my personal view, long context is just way too expensive. Fine tuning doesn't seem to work after others seem to work very well with not enough data after 1.5 years, you know, of kind of cycle, you know, in the last, you know, finding has been there for a long time, but in the last 1.5 years, I think people have tried it a lot and didn't seem to work very well.
Tengyu Ma [00:10:47]: And RaC seems to be the direction, the future direction to go. And it makes sense. It has this hierarchical structure where you first retrieve some information, a small amount of information, and put in your short term memory, and then solve the problem, and you forget about them. And next time you retrieve some other small amount of information in your short term memory, which means the context, and then solve the problem. So, yeah, so that's Reg and the rest of the talk will be a little more about some of the details about how reg works and I how to make RAC works better. So very, very briefly. So, embedding model, which is the key part here, is these models that vectorize documents or queries into vectors. So every document will be mapped to a vector, which is a list of numbers, which is called vectors or embeddings.
Tengyu Ma [00:11:38]: So every document or queries will have a vector representations in the euclidean space. And so let's say it's 3d instead of like in reality, it's going to be a thousand dimension, but let's say the vector is 3d, then basically it's a point in the 3d space. And the properties that you have about the embedding model is that the relatedness or the semantic relationship between documents is captured by the geometric distance between the vectors. So if two documents are closed in the geometric space, they are also close in the semantic space. And that's why you can read off this semantic relationship by looking at the geometric distances between two documents. And that makes it useful for search, semantic search. So basically you want to search similar documents just to do near K nearest neighbor search in a geometric space. You give a query, you're given a query, and you turn a query into a vector in the same space, and then you look at which are the closest vectors to the query vector among all the document vectors, and you found document three.
Tengyu Ma [00:12:50]: And then you say, that's the most related documents in a semantic sense. And another component which is slightly less popular, but most of the space users are using is the revancors. So this is the fourth step I had in the reg stack. So this is additional refinement step on top of the vector search. Suppose you have already done a search, let's say vector based search, or sometimes people use lexical search, like the BM 25 to do the search, and you already get 100 documents, and then you can still rewrite these 100 documents. Maybe you don't know whether these 100 documents are react correctly. Maybe these 100 documents are obtained through multiple sources of search algorithms, or you just want to refine them. And then you can use this rev curve, which basically takes in a set of documents and then re rank them and give a relevant score for every document.
Tengyu Ma [00:13:49]: And you can use the scores to sort the documents and then take a subset of them. So you sort them and you take maybe top three, which are the most highly relevant documents among this stage. So it's a refinement step. And why revampers is useful is because there are some structural differences in how the new edge works are implemented under the hood so if you use embedding models just to recap, so you have a query and you go through a transformer, which is the embedding model, and then you get a vector out and you are given a document, you go through the transformer, you are getting embedding out, then only until this. At this point, the interactions between the query and embeddings of the queries and the documents start to matter. So only at the end you take a cosine similarity and you have interactions between the two objects, the query and documents. So the benefit of this two tower bi encoder setup for embedding model is that the, the embeddings or the vectors for the documents can be pre computed prior to query time. And that saves a lot of time on the fly.
Tengyu Ma [00:15:05]: But the problem is that when you compute the embeddings of the vectors for the document, it doesn't know the query at that time. Basically, the embeddings has to be prepared for all kinds of possible queries, and that creates a lot of burden in the embeddings because the embeddings basically have to, for the document, the embeddings for the document basically have to memorize almost every aspect of the document so that you can be prepared for all kinds of possible corners down the road. And that just creates a lot of burden to the embeddings and which means that you may not be able to memorize all the details of the document. However, when you have this revanker, the Revancor setup is that at the time that you have the revancor, you already have to, you already have a small set of documents, so you can forge to concatenate the query and document together and give it to the transformer together. So it's called cross encoder, as you can see from the figure. So basically this transformer sees both the query and document. And that means that you can quickly drop all the irrelevant part of the document and only focus on the relevant part and really process very carefully whether the document is really relevant to the query. So basically you have more focus because the query is given to you when you are processing the document.
Tengyu Ma [00:16:28]: And that's why the revancors can give you very highly accurate relevant scores. But the downside is that you have to see both the query and document. And so basically at a query time, you have to run this multiple times, depending on how many documents you want to process. So you cannot really process 1000 or a million query plus document pairs. You can only probably process 100 of them. That means you can only work with 100 documents and that's why you need a pre filtering step from the embedding models to pre filter 100 documents and then give it to a revamper. So revancing is much more accurate but also slower. So that's why you can only use it for a small number of documents.
Tengyu Ma [00:17:14]: Okay, so these are some of the basics. And now let's discuss a little bit about how do we improve the retrieval quality, which is one of the I think it's the main bottleneck these days for RAC. You don't retrieve the relevant document or information, then you don't have a good answer, and in many cases the information is there in your pool of data. But it's just that the embedding models and revankers couldn't find them. And I'm listing six or seven, I think probably eight suggestions, which are some of the aggregations of all my past experiences and talking to our users and other people reading blog posts. I will summarize which one is probably more likely to work at the end, but I will go through some of these techniques quickly. I didn't have citations here, but I think most of this you can just Google search it and find some relevant information online. I don't want to promote some references without carefully going through them.
Tengyu Ma [00:18:22]: That's why I didn't have the references. So the first one which is mostly commonly used is the query generation and decomposition, which means that you somehow say, oh, the corporate is the query is too simple so that nobody can understand what the queries means. Or at least you should help the embedding models and renkers to have more ambiguous and detailed query. So for example, if your original query is rag, sometimes you just type these three letters and expect to give a high quality answer. And then you can extend a query to please explain more details about retrieve augment generation, a parenthesis rack. So this will give more information to the embeddings. And if the embeddings are not so strong, this will help a lot because now the embedded models will see a lot of more keywords and helps the matching a lot. And how do you do this? You can use some rule based methods or you can use the chat GPT to extend the query so and so forth.
Tengyu Ma [00:19:22]: And sometimes people making multiple queries, you can say, please explain the retrieval step first. Please explain the generation step first. How do you generate multiple queries? Sometimes you can use a large language model to say, oh, given this original query, can you give me three a more detailed query with more step by step kind of queries. And there are many, many different ways to do this, you know, if you search online. Um, another way is that you can do something similar in the revamp curse where, um, you can, um, but not kind of by changing the uh, uh, the queries, but just by adding some instructions. For example, you can say sometimes the trick, the challenge is that, um, uh, the similarity definition is not clear upfront, right? Different people have different similarity definition, definition of similarity. And the embedded models and the revankers don't know what's your definition of similarity? They will just guess the most common sense definition of similarity. But you can always give some instructions to the in the query on what the similarities you care about, right? You may care about similarities in only the content, or maybe you care about similarities in terms of the entity.
Tengyu Ma [00:20:37]: Names are mentioned twice in both the query and documents. You may care about some other definition of similarity. You can say that in your query. This depends on whether the rankers are responsible to it, responsive to it. Sometimes they don't, but you can give it a try. Another pretty commonly used one for sophisticated users is iterative and recursive retrieval. And the rationale there is that sometimes your documents have links to other documents or the IBU document cannot be found in one step. So in design workflow that retrieves documents iteratively, you first retrieve probably the most nearby documents, and then you retrieve documents that can be linked to the initial retrieved documents or those documents that are similar to the initially retrieved documents.
Tengyu Ma [00:21:23]: In some other sense. You can make this rule based, you can make this large language model driven, you can, you use embedding models again for the next time for retrieval again, you know, I don't want to go into too much details here because it really depends a lot on case by case situations. Depends on our data. Another dimension people explore is how do you track the data properly, right? So these days, you know, some of the embedded models have very long context. For example, in my company void AI, we have 32k contacts which we are proud of. But it's not always that. The longest possible contact is always the best. Maybe sometimes like five k, ten k is better than thirty two k.
Tengyu Ma [00:22:07]: And the reason is that if you reduce the trunk size then you can retrieve more trunks with the same downstream cost. And if you're sensitive to downstream cost, then it could be possible that you want to use smaller trunk in return with more retrieved trunks. And when you have more retrieved trunks, there's a more chance that you hit something useful on the other hand, if you have increased trunk size, then your trunk will be much more self contained and they are more easier to be retrieved. And you have to balance these two factors. For example, if the trunk size is 512, often I feel like it's probably too small because you don't have enough information there in each trunk. Then you may kind of, like, the models may misunderstand the trunk so that they don't retrieve them. But, you know, 32k, you know, it depends on whether you really need it. Sometimes you don't.
Tengyu Ma [00:22:57]: People don't need it. Sometimes you are cost sensitive, so they still want to have smaller chunks. That's 32k. Okay, so the next slide. Have some, you know, personal suggestions that, in my view, that are on a slight different level. So these are. Okay, so number five, maybe, as you can see from kind of like some of the discussions before, you may want to look at data and find out where the issues are and then fix them as if you are talking to a librarian. Right.
Tengyu Ma [00:23:36]: So that's kind of like how people come up with the first four methods. Right? So you found out that your issue is that it doesn't deal with the links very well in the documents. So then you actually intentionally explicitly help the model to use the links and do some recursive kind of like retrieval. Or you find out that it's because that sometimes, actually, I didn't, I forgot to mention this. Sometimes people also put the headers in every chunk. So even though the headers or the titles of the document is not necessarily part of the trunk. Naturally. But you always say no matter where the trunk is in the document, you always copy the document title and append it to the trunk.
Tengyu Ma [00:24:26]: What's the rationale for that? That's because some people look at their data. They found that the trunks are not interpretable anymore, even by humans, without the title of the document. But you don't know the context. For example, you have a legal document, and you know that in a legal document, after the first few paragraphs, everything, the company, the two parties are referred as maybe the licensor, the licensee, the partner, so on and so forth, the company's name never show up after a certain point. Right. And if you have a trunk that doesn't have those kind of, like, indicators. Oh, I guess I'm out of time. So then I'll wrap up very quickly.
Tengyu Ma [00:25:12]: So if you don't have those company names, then the trunks are not meaningful because you don't really know which companies these documents are the trunks about. So that's why? You have to copy the first paragraph with the trunk. You have to kind of like, um, copy paste the document title, uh, together with the trunk. Together with the trunk. So, um, sometimes, you know, you just found out that actually you just need a better off the shelf embedding models and revampers. For example, you probably can try general purpose or domain specific embeddings by voyage AI. Sometimes you can fine tune embedding models if you have a lot of pairs. And, and here, the idea is you have these positive pairs which have semantic relationship, which could be like title, document, question, supporting evidence, caption, image, and then use this simple logs to fine tune the embeddings.
Tengyu Ma [00:26:00]: Or you can use some of the APIs, you know, or use some of the companies to fine tune embeddings for you, for example, voyage. So, and another, the last dimension I want to discuss is that you also need a proper evaluation dataset. Even you have done all the one to seven in the past, in the past few slides. You also need a proper evaluation set that can check which method works, because there's no method that can just work, guaranteed to work without any experimentation. Maybe using a better embedding models, that's probably always guaranteed to work, but still you want to check whether it really works before deployment. And there are some ways to create data sets. One way is to create a label to achieve a datasets in the interest of time, I mean, not going to the details here. Another way is you can use arch language model as a judge, you basically do a shootout, use the large language model to decide which document is relevant.
Tengyu Ma [00:26:53]: The only thing is that you have to anonymize the model names. You have to randomize the order. We have seen people using the fixed order which create biases and there are some pros and cons. There is simple and cheap for one test, but it could be expensive because you cannot reuse the datasets. Okay, just as a summary. So retrieval augment and RaC in my opinion is always needed even when large language models approach AGI embedding models, rankers are responsible for the quality of the retrieval. And I've discussed the various techniques that I'm aware of to improve the retrieval quality. So I think the challenges here are.
Tengyu Ma [00:27:33]: One is that some of the domains have very, very specialized dragon knowledge. I've seen some of these from my customers and I think here you probably just have to. For example, if you have a chemistry documents, most of the embedding models wouldn't understand enough about the chemistry terms and now you need either domain specific models that are really customized for your domain or you need fine tuning on your data sometimes. Another issue I've alluded to is that the trunks don't know the global context and you need some embedding models that somehow know additional context about document as opposed to just to do this hi talk thing where you copy paste the hiders or the titles to the trunk. We're going to release some models that have these properties and looking forward, I guess. You know, I think the community, to make this community really mature, I think we need something that usually standardized and clean tech stack that can always achieve about, for example, 95 retrieval accuracy. 95% retrieval accuracy. We are probably not exactly there yet.
Tengyu Ma [00:28:37]: In some data sets we have that with some of the advanced embedding models and some of the techniques, but for some other cases the retrieval accuracy is probably still below 95%. And then you start to doubt whether you can use this in production. Then probably a few years later the goal of the community probably is to improve from 95 to 99 or 99.9 accuracy for the retrieval. That's all I have. Thank you so much.
Demetrios [00:29:05]: Well, thank you, sir. That was awesome. And no need to rush because you are the final keynote, so you get all the time in the world. But there's a lot of questions coming through in the chat and rightly so many cool, cool questions. One. Let me see, let me see what the first one is. So many. I'm scrolling.
Demetrios [00:29:33]: All right. I imagine the embeddings will change based on how you break your documents into chunks and it's hard to know what the optimal breaks are in advanced. Any tips on how to deal with that?
Tengyu Ma [00:29:49]: Yeah, so I probably was a little bit faster there when I was talking about this. So I think. So, first of all, voyage AI. Sorry, I'm talking about my own company a little bit too much. We are going to release some automatic trunking strategy so that you don't have to worry about this in the future. After we have launched this product, if you use it, you're just going to send us a document and we'll do everything for for you. And actually it's the machining model that does everything for you to decide optimal chunk. But at this moment I think pretty much you are thinking about these two factors.
Tengyu Ma [00:30:26]: So what's your cost structure? Do you really want to send very, very long documents to GPT-3 GBD four or anthropic cloud? If you don't want that, you probably should have smaller trunk size, right? And. But if you have bigger trunk. Oftentimes the performance is better if you measure the same number of trunks, but not measure the same number of tokens. Especially if you use a lot embedding model that are really truly responsive to long contacts. Some of the embedding models, they claim they are long contacts, but actually when you have long contacts they misinformation in between. But if the embedding models are really truly responsive to long context, increasing the trunk size is often, almost often better as long as you don't care about the cost. So. Yeah, but there's some trade off there.
Tengyu Ma [00:31:25]: So yeah, sorry. The answer is really just that. The intuitively this is the trade off, but eventually you need to do some experimentations with evaluation to know what's the right trade off. You want to choose and between cost and quality.
Demetrios [00:31:44]: The other thing that I think I've heard some people talk about is how you can get really messed up by cutting off a chunk in the middle of a sentence or the middle of a paragraph. Have you found any ways to help against that?
Tengyu Ma [00:31:58]: Oh yeah, that's actually relatively that, you know, it's not trivial but, but it's a soften task to some degree now because there are some of these online packages or some of the companies like unstructured llama index and some of the open source code which can let you to not do that. You just say Metroc size is about 2000 tokens. But I don't want to break any sentences or even documents or even sections. And then they will do it. They won't satisfy exactly trunk size, but it will be approximately 2000k trunk size. And another thing is you can have some overlap to also mitigate this issue. I think that mostly gets solved with these kind of techniques.
Demetrios [00:32:42]: Nice. How often do we need to retrain our rag pipeline with the new documents getting added in?
Tengyu Ma [00:32:51]: One of the benefits of the Rag is that when you have new documents, you don't have to reach anything, you just have to vectorize the new documents and put it into the same system so there's no returning. It's really just adding more documents to the pipeline. So the only thing you have to do is vectorize all the new documents, put them in a vector database. That's it. That's it. But maybe sometimes you want to change your embedding models, you want to change the components you are using. So I think for that maybe a few months, maybe a year would be the, the right cadence because you know, embedding models reimplement their reg stack too often I think most of the embedding model providers also release new models, probably on the order of like six months to one year. And that's at least what we are doing at voyage.
Tengyu Ma [00:33:46]: Maybe open, I said even a little bit slower. So that's probably the time that you can consider changing your embedding models to improve the quality.
Demetrios [00:33:55]: Is re indexing mutually exclusive with re ranking or are they different?
Tengyu Ma [00:34:04]: There is no mutually inclusive means. No re indexing is a way to revactorize your. Maybe let me go to the, I guess it depends a little bit on how you, you define this, but basically these four steps are all very, very modularized. At any time you can replace your embeddings, and by other embeddings you can replace your revankers by other revolvers. The only thing is that if you replace your revolves by other revolves, you don't have to do anything. Basically you just change that line of code you call different API and that's it. Because it's status. Every time you run a revancing, you can forget about everything.
Tengyu Ma [00:34:45]: You have that for embedding models, it's slightly more tricky to change, because if you change the embedding models, you have to change all your vectors that you have vectorized. Basically all the vectors you have got from the documents, you have to be re vectorized. So basically you say I change my embedding models to some other models, and I use that new model to go through my entire document base and vectorize all of the documents and put them into the back database, and I remove all the old information. It's not that hard, you know, it takes probably a few hours, but I.
Demetrios [00:35:24]: Imagine it does have a bit of a cost component to it. So you have to think about that.
Tengyu Ma [00:35:30]: Exactly, exactly. Even though the cost for embedding models these days is often a smaller fraction of the entire reg stack, you will spend more, much more money on large language models. That's why at least at this moment, unless you have 20 billion tokens, it's not that expensive yet.
Demetrios [00:35:47]: Yeah, I can see it being that trade off where new model new, you think about. All right, do we want to run all of our documentation through this new embedding model, or do we want to just wait? What's the performance versus cost trade off that we're looking at?
Tengyu Ma [00:36:09]: Yeah. Another dimension that I, that is interesting is the following. So you replace your embedding models, but you could possibly save your cost for your vector database, because this depends on, for example, suppose you use your smaller embedding model, which has the same performance and the small embedding model will give you smaller dimensionality of the vectors, and that means your vector database cost will go down linearly in the dimensionality of the. The vectors. Right. So you actually may save something, even though you have to vectorize your document.
Demetrios [00:36:45]: Yeah. But then do you see the performance also go down?
Tengyu Ma [00:36:50]: No. You could use your. Sure. If you just use. You need to use a model that has smaller dimensionality but the same performance. Basically, I'm saying, like, suppose, for example, on the market, there's a new model, and this new model is better, all the same performance but smaller dimensionality, then you should consider using that, because it can save your costs for other components.
Demetrios [00:37:13]: Yeah, makes sense. And then the last thing that I wanted to ask was around fine tuning your embedding models. And I know there's a lot of really cheap ways to do that these days. It feels like there. There's some stuff even I've seen people do it, like with burp models, right?
Tengyu Ma [00:37:33]: Yep.
Demetrios [00:37:34]: How so for you, I think one thing that I imagine you get asked a lot is, like, why wouldn't I just fine tune a burp model?
Tengyu Ma [00:37:43]: Yep. Yep. So we got lots of those questions, and people tried it and didn't work very well. You know, in some cases, you could try it.
Demetrios [00:37:52]: Try it. Go try fine tune a Burt model, and then I'll talk to you in a few weeks.
Tengyu Ma [00:37:57]: Yeah, I think the answer is that I wouldn't say it never works. I think it always gives you some lift on top of birds. When I say it doesn't work, it means that it doesn't give you enough performance gain to beat even sometimes off the shelf embedding models without fine tuning. The reason is the following. I think in my. You know, of course, this is not like, as a scientist, I probably wouldn't be able to say this, you know, with strong evidence, but I think my intuition is that. So, basically, your final performance after fine tuning is. So, basically, the fine tuning is mostly trying to understand the style and the preferences of the users queries.
Tengyu Ma [00:38:47]: So, basically, users queries sometimes are ambiguous, but in certain cases, you have the context, right. So, basically, maybe suppose you talk to only kind of like high school students. They have a different way of using the language, and your embedding models don't know it. So then if you fine tune, you start to know their kind of rules, and there's abbreviations, all of those. So those are very easily obtained by fine tuning. But fine tuning, so far, I don't see a strong evidence that fine tuning can gain a lot of knowledge, like a hardcore kind of factual knowledge if you only fine tune a small amount of data. So basically if you have a lot of data then you can fine tune and really get a lot of knowledge and then you can have a substantial lift. But if you don't have a lot of data then you are mostly just learning the styles.
Tengyu Ma [00:39:40]: The styles also give you, you know, the styles and preferences will give you probably five to 10% improvement which is, you know, very good if you start with a very strong embedding model. But if you start with Bert, five to 10% improvement is not enough for you to go above the off the shelf embedding models, the strong embedding models if that makes sense, 100%.
Demetrios [00:40:02]: Well, thanks dude. This has been great. I want to thank you so much for doing our final keynote of the day. This is been awesome.
Tengyu Ma [00:40:10]: Thanks so much. Yeah, it's really fun to be here. Thanks.