Inference Scaling for Long-Context Retrieval Augmented Generation
Sophia Skowronski is a Data Scientist at Breckinridge Capital Advisors with previous experience as a Business Analyst at Pledge 1%. Sophia has also worked as a Data Science Intern at Candid, an AI Investigations Intern at Deep Discovery, and held roles at Singularity University and the Global CO2 Initiative. Sophia holds a Bachelor of Arts in Astrophysics and Cognitive Science, as well as a Master's degree in Information & Data Science from the University of California, Berkeley.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
This November Reading Group conversation covers advanced retrieval techniques, strategies like iter-drag and hyper-drag for complex queries, and the impact of larger context windows on model performance. The Reading Group also examines challenges in generalizing these methods.
Binoy Pirera [00:00:00]: All right, we're live, Valdimar take it away.
Valdimar Eggertsson [00:00:03]: All right. Hi. So we're looking together at this new paper from Google DeepMind with this very technical name. I was looking through it yesterday. I was at a cafe with a friend of mine and he just didn't understand a single word of the title. Inference Scaling for Long Context Retrieval Augmented Generation. I assume many of you, or most of you are at least familiar with Retrieval Augmented Generation, which this paper is mostly about. It's how to use long context to scale up RAG Retrieval Augmented Generation.
Valdimar Eggertsson [00:00:52]: And just go through the first introduction and the related works, highlighting what I thought was most important. So what is the paper all about? They investigate what they call inference scaling. So it's basically about putting more content into the RAG process. Usually when we are retrieving documents to augment the generation of LLMs, we put in maybe 20,000 tokens or something. However, we have these long context models nowadays which can have over a million tokens. So basically like you could read the whole book, which effectively allows us to scale up the capabilities at a cost of compute. And they are using basically combining two different concepts. It's in context learning, which is just showing examples of the task right before you ask the model to do the task and iterate the prompting, which is taking the like basically just making some kind of loop or process with the LLM prompt.
Valdimar Eggertsson [00:02:19]: There are two questions. I'll quickly go to them. Question number one, how does Retrieval Augmented generation, how does its performance benefit from the scaling of inference computation when optimally configured? So they're looking at how can I make it better by putting lots of more compute into it and how can we do it in a good way, finding the optimal config and can we predict the optimal test time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters. So I want to be able to predict this, which Sophia will cover in her part of the talk. I'm going to look at the related work first just to define the basic concepts. So the introduction. But there's these three things we need to understand. So there's the long context LLMs.
Adam Becker [00:03:25]: You.
Valdimar Eggertsson [00:03:25]: Know, GPT 3.5 could handle, I think 4,000 tokens. The GPT 4 when I started using it was 8,000, I think. Then they made a 32,000 token version where you could have like a long conversation before the just kind of crashed. And nowadays it's 128,000 tokens, I think. But the models from Google that they're using have Some, I think 1.5 million token capabilities. I didn't think this would be possible given just the transformer architecture because you need a lot of memory to read like a seat for the process a sequence of a million tokens because the memory complexity, space complexity of the algorithm is quadratic. However, as they mentioned here, early works in extending the context lengths involve sparse and low rank kernels to reduce memory requirements. And recent advancements in efficient attention methods further enable them to train and defer upon input sequences comprising millions of tokens.
Valdimar Eggertsson [00:04:40]: So I would love to understand how they managed to get away from this quadratic memory complexity. That's how we can do this. Read the whole book or read a sequence of the genome like genetics strings, very interesting applications of it. And in context learning is basically something you do to enhance model performance at inference time. So this is only like, kind of like a smart way to prompt it by conditioning on a few demonstrations of the task. And they cite this paper here, this is a GPT3 paper. So I think they. That's what you did to GPT3.
Valdimar Eggertsson [00:05:25]: You could supply a few examples and then it could solve your task quite well. And now that we have these million token context windows we can have like really complicated examples of how to do something which is basically what this paper is about. Applying that to retrieval, augmented generation, which is just a search like a knowledge base. So you have your maybe business data or something specific about a company that's not in the training data of ChatGPT and you search through the documents to use documents to augment the answer. And this is what everybody's been doing for the last year and I'm happy to see a paper about like how to make it considerably better. Considerably better though at a high cost of computation.
Adam Becker [00:06:22]: Very interesting. Like if you, if you scroll down to like to the methods that they use to, if you scroll down a little bit more. Yeah, yeah, yeah. Chapter like long context. So no, yeah, so so long context.
Binoy Pirera [00:06:37]: Lnf.
Valdimar Eggertsson [00:06:38]: Yes.
Adam Becker [00:06:39]: Yeah, yeah. I wonder if, and I don't think they did it in this paper but I think it'd probably be useful to see how do the different methods for increasing the context affect the performance there? You know what I mean? Like I, and I wonder whether they do this in, in the, in the other papers, but in particular for the performance of rag. I wouldn't imagine that they do that for the other papers. Maybe they do for long context rag, but I, I don't know if we've done this paper before but like actually to get deeper into the weeds of how these long context LLMs are constructed. That could be an interesting topic for like the next time.
Valdimar Eggertsson [00:07:24]: Exactly. I mean I'm cur. I'm curious about just how it works because make sense to me. But as soon as it's cost efficient, I guess it'll be like the whole new thing to just put a million tokens in and you don't need to do retrieval anymore if you can just put the whole database into the memory. So yeah, it's maybe something for next time. I noted on a few things. So okay, what's this paper about? Yeah, the combining larger context with showing it in context demonstrations will do rags in a. Efficient and impressive.
Valdimar Eggertsson [00:08:14]: So I'm not going to read everything here but some highlights. They mention that current long still have limited ability to effectively locate relevant information in ultra long sequences. So it doesn't work that well to just put all the info there because it gets lost in the noise. So that's why we, why it makes sense. It's not just about saving money to only put, you know, 10,000 words or 30,000 words. You're just not. Because usually if you have a good retrieval system, most of it, most of the time the important info will be close to the top. And yeah, they show that it scales, they say linearly with the context length, but it's linearly with the scale of the context length.
Valdimar Eggertsson [00:09:06]: This is like their methods and the amount of data you put in, you put in 10,000 words versus like 100,000 or a million. It keeps getting better. You have to put like thousand times more data in and you know, at least a thousand times as much compute and please comment and questions whenever if you can, if you want. So what they do is they introduce this demonstration based drag or drag where multiple examples are provided as demonstrations. So you just basically have an example of rag. Here are the top 10 documents or 100 documents. Here's the question. And you can find the information at this part and use that to construct the answer kind of.
Valdimar Eggertsson [00:09:57]: Let's show you effective example of it. And then they have this iterative demonstration based RAG which is the same concept but doing it iteratively with like query decomposition. So instead of displaying one query, I to drag learns to decompose input queries into simpler subqueries. This is nothing new per se, but combining the two and using it to inspect how it scales, that's their novelty in this paper. You can see the graph here. So there like five different bars. It's a challenging, challenging set of question answering where you can have the zero short question answering when the many short question answering this is just not using retrieval at all. And then kind of sucks because most of the time ChatGPT doesn't know the answer or just makes up some answer.
Valdimar Eggertsson [00:11:15]: But by using rag the performance gets considerably better. And then using the demonstration rag with drag you get a 5% increase, which is pretty good. And then you can go even further with iterations by breaking into subqueries. And this is especially important for the this set here you go for 50% with a normal rack, but breaking it down it's 76.9%. And so this is for questions that need to be decomposed. So it's like who is the author of the book which won the Academy Award in 2003 for Best Movie? And you can't just search this using some kind of similarity in a database and find the answer right away. You have to break it down. See, okay, it was Return of the King and won the Oscars.
Valdimar Eggertsson [00:12:08]: And then who wrote Return of the King? And I think this data set to wiki model question answering is about that. You need to jump between articles in Wikipedia to get the right answer. Yeah. And Sophia and Adam will talk a bit about the inference scaling loss. This is a big thing with analamps is to show the scaling loss of the training. You can get better constantly, just throw energy at it. And something similar we have here, at least you can. It's not like plateau with rag you can throw compute at it and we can understand how it works.
Valdimar Eggertsson [00:12:49]: Yeah, there's these two scaling strategies, drag and iter drag. Talk about a bit more in a minute where they evaluate it, show that it's pretty damn good at least if you're willing to use a million tokens. Which question doesn't cost that much nowadays, but still like $10 I think. And they managed to predict the optimal way to do this, which. Yeah, I'm not going to cover it now. We'll wait for Sophia and others later and I'll just jump into chapter three about the methods. Unless there's some question or comment by now.
Binoy Pirera [00:13:36]: Yeah, Voldemort, we have your question from Mahmoud. Right. Do you see that in the chat? Do you feel that the use case for E2 rag and hybrid rag are different?
Valdimar Eggertsson [00:13:48]: Yeah, hyper drag. I usually think hyper drag as using text based search as well as like what is a sparse embeddings as well as dense embeddings. So I would use hyper drag. I've done that for like when you have texts that are very different from what the vector embedding models are trained on. I'd say the iterative ones is just for answering complex questions. And I think actually like most questions people have are not just straightforward questions. So it's a HIPAA drug. Depends on the types of texts that you're using.
Valdimar Eggertsson [00:14:28]: If it's some obscure language like law, then maybe it's good to use text search. But I trade the one for the complex questions like that. Yeah, hope that answers answers this. I think we can just look at the picture to understand this. There's a lot of text here, but it's quite simple. So this is drag versus either drag and rag is kind of like a central version where you just have. You have an input query asking a question who? Yeah, who wrote the Return of the King? And then you search through documents on Wikipedia or your specific business data or like I was doing this for tax law data for example. And you find the documents that are relevant and the LLM generates the question.
Valdimar Eggertsson [00:15:38]: But since we can throw 100,000 tokens there, we can just show the LLM this is how it works successfully to find the answer in the documents. So we incorporate the in context examples to get the final answer which gives us this few percentage performance boost. But then the iterative one has a loop where you can ask five times or they have end times. It depends on how many tokens you want to spend. But then you break the question down into subqueries. So yeah, which movie won the Academy Awards in 2003 for Best Film? And you get that answer, you know, it's Eternal King. Then you can ask who wrote Return of the King? And every time you retrieve a document you make an intermediate answer and I think they just concatenate everything together in the end up to 5 million tokens or no. Yeah, 5 million in five iterations.
Valdimar Eggertsson [00:16:43]: Actually it was like 1 million each iteration and get the final response. Does that make sense? Like every time they include the in context examples of how you can use a subquery and the intermediate answer to continue the process and they're just iterating this again and again and that's what gives us the significant performance boost. If you look at the graph again, was it jumping from 50% to 76%? Just because it's about doing this multi hop questions which involves going in between Wikipedia articles and it's like fetching different Wikipedia articles based on different subqueries and combining it, synthesizing the information into the final answer. Which I think is pretty cool. Nice that we can do this now we couldn't do this with 8,000 tokens with GPT4 one year ago. Yeah, this is basically done now. Sophie, are you going to take it away and then go into this table.
Sophia Skowronski [00:17:57]: Maybe and yeah, yeah, of course I created slides, so I'm going to share screen real quick. I'm covering kind of the approach that they took to generate all those data points around optimal parameters. And then Adam will go into kind of the parameter estimation piece of the paper. So let me just kind of give you overview of that. Well, I first just wanted to start with this equation that they had at the beginning of this section. And again, it's their approach for generating these optimal parameters. And so this is all about, you're starting with a maximum context length lmax and we want to figure out the best way to allocate that budget given different parameter configurations. And so this is where they've abstracted the number of documents for retrieval from rag, which is K the number of in context examples for.
Sophia Skowronski [00:19:00]: Yeah, for. And they are calling that variable M. And then the number of iterations in iterrag in iter drag actually I should say, sorry. And they're calling that N. And so drag doesn't have iterative examples. So N will always equal one for drag. And so these are the parameters that are changing for each experiment. And so given a set of input queries and their ground truth answers, we are using a strategy F of X given theta to make a prediction.
Sophia Skowronski [00:19:32]: And so the strategy is drag, iter drag or some of the other baseline models that they threw in for comparison like multi hop Q and A or 0 shot Q and A. And so they then use a strategy and then they limit the context length for each strategy to ensure that it's below this lmax. And so then, yeah, and so then they, in order to test different configurations of theta, they select the best one that gives the best average performance across the entire dataset while keeping the context length under this lmax. And so they're just basically abstracting how to allocate limited resources to get the best predictions from our model. So either that's retrieving fewer but better examples or fewer but better documents, or adjusting how much context we feed into the prompt. And yeah, feel free to like cut me off and ask a question. And so again, just kind of going into more specifically what they're doing here. And so before I go into that table of results which Valdemar showed, so they vary the lmax by a number of tokens and they also had all these different parameter configurations, so you can see the ranges here.
Sophia Skowronski [00:21:04]: And then the strategies are inter drag, drag, zero shot Q&A, many shot Q and A and just basic rag. And then they pointed out that not all of these strategies can actually scale to the maximum context length. So zero shot Q&A, that's limited because you can't add context to it. So whether that's documents or examples, so it really is. All these other variables are set to zero. And then many shot Q and A maxes out at 256 examples. And I think each example is capped at 1,024 tokens. So I think they said that places them around the 16k context limit.
Sophia Skowronski [00:21:57]: I'm not sure how exactly, because it seems like it might be able to go beyond that, but maybe not. And then they say that the, yeah, zero shot Q&A is also stops at this limit. And then RAG doesn't scale beyond 128. Oh, wait, that's the. Okay, that's the 120, 10, 24 token limit for each document retrieved. So that's where this 128k limit starts. So yeah, let me make sure that I get everything. Yeah, and then drag is just limited by the context limit of the LLM because it's one prompt.
Sophia Skowronski [00:22:38]: So the largest context window is a million for the Gemini model. So that's where that limit comes from. Whereas InterDrag can split off a query into multiple subqueries that then can leverage that high context limit and basically have infinite context to use at their disposal. And so, yeah, here's the gist of all of the results here. There's a few takeaways. So again, they sum, they average the performance metric for each experiment. And so you can see there's a number of like grid search experiments being done. So I would be curious to see how expensive this was too, to generate.
Sophia Skowronski [00:23:25]: But in any case, so the first takeaway, you can kind of see by the bolded text that drag and iter drag improve as lmax grows. And then the second is that drag is more effective in the lower context long context windows and then inner drag is more effective 128k tokens and above. So you can kind of just get a sense that overall everything increasing lmax is beneficial, but you kind of need to figure out the optimal parameter configuration and then it kind of goes into this plot as well. So the performance versus effective context length. And so effective context length is the lowercase l here. So the context length that's generated from using those configured parameters, either like number of documents or in context examples or iterative runs. So let Me go back here. So yeah, they plot P versus effective context length and then apply this fitting here.
Sophia Skowronski [00:24:46]: This. It appears to be a linear line, but as Valdemar said, it's logged. The X axis is logged. So there appears to be a linear scaling. So it scales with the number of like the increase in parameters or. Yeah, K or M or N. So again you can kind of see with the green overtaking here at the higher context lengths that iter drag is stronger at larger scales. And then they did this also for, and this is in the appendix in the back, they did these on separate on the individual data sets.
Sophia Skowronski [00:25:36]: And you can kind of see rag versus the other approaches and how there's kind of a diminishing return plateauing of rag versus drag and iter drag which enable you to get. I don't know how significant if this is. Yeah, same Y axis scaling, but it allows you to get a little bit further in performance when looking across these data sets, at least for this. Which one? This is hot pot Q and A. So it's pretty interesting. And then the last part of this section, they try to represent a normalized performance of a like a consistent configuration and then just progressively increase either the number of documents or the number of examples or shots. And so their main takeaway here is that not all configurations are equally like helpful. So for a fixed configuration the number of retrieved documents generally is, shows better performance.
Sophia Skowronski [00:26:52]: And then let's see, increasing the number of shots is typically more beneficial for iter drag. And they kind of point out this range right here where going from 0 to 1 iter drag does a lot more with that than or does a lot more with that specific example than drag had do on its own. So um, so there seems for each in both plots you can kind of also see like there seems to be a point for K or M where like there's diminishing returns again so where like it's plateauing or in the case of like iter drag here it's decreasing. So again there's a point in which increasing the context no longer helps or appears to be confusing the model in this range here. And so that's kind of the gist for like how they approached like generating all these experiments and now they're going to apply some least squares estimation to generate to then be able to predict what the optimal configuration could be given. I believe it's like data set. What was it? There's three different things that they imply. Data set.
Sophia Skowronski [00:28:18]: Yeah, I guess Adam will cover this. So I'll just hand it off to him before I start rambling.
Adam Becker [00:28:24]: So. Right, so you left on a couple of points. I think it's. So it's interesting to see that, right, if you hold, let's say the number of shots constant, but you then increase the number of documents, each of these different mechanisms are going to react differently with respect to performance. And same thing for the number of shots, right? So let's say, you say, okay, fine, we'll just. Let's show the model a single document or 10 documents or 100 documents, but let's vary the number of shots, right? And then again, different mechanisms are going to react differently with respect to the performance. So then the question is, okay, fine. You said that there's all of this kind of.
Adam Becker [00:29:04]: There's a relationship between the number of shots and the number of documents that we surface and that we retrieve and the performance. I want to know how many shots and how many documents I should show. Given a particular budget, how do we go about it? Right. So that's kind of. Then they take a step back and they say, you know what, let's. We have these sort of like scaling laws. We can help you predict what would be the optimal set of parameters. So is that.
Adam Becker [00:29:31]: That's clear for everybody, right? You can, I think you can see it here. I'm not entirely sure if each one of these is a different configuration, but you can just sort of see here that there's lots of different configurations. They might just produce lower performance. Well, we want to find the ones that produce the highest performance and to try to model the relationship between the hyperparameters and that performance. So that's what they're doing in, in section five. It's basically a way they're trying to come up with to quantify the relationship and to predict it. I see there's some things in the chat. Okay.
Adam Becker [00:30:09]: They're trying to predict it. Okay. So how they're going about it in section five is a couple of different ways. The first I created this mind map. We have the inference parameters. Those are the things that Sophia already spoke about. So in the case of drag, we have. So this is the vector theta.
Adam Becker [00:30:26]: It includes a couple of things for drag, which is the number of retrieved documents and the number of in context examples. And then iter drag is simply the number of iterations for retrieval and generation. The way they split it up, the section is into. First let's construct the model and then let's validate whether our construction makes any sense. So we're going to look at both. But before we do. Sophia, I believe that you put on the. In the slides.
Adam Becker [00:30:53]: No, in the chat earlier, that we do have some visual examples. And I'm not sure for people that hadn't yet looked at the paper, this might be particularly useful. So I think it really just started making sense once I saw this image. So you see, basically. So the idea here is, I think so. We looked earlier at the plot that showed how RAG simply doesn't scale. Right now we have this new technology and we have a bunch of context, right? And we can just continue to add more things into the context. Well, we're seeing that just doing it normally with RAG doesn't scale.
Adam Becker [00:31:30]: We just keep plateauing. So then they come up with all these different other mechanisms. And so one of those mechanisms is to just. You basically dump a bunch of rags as examples into the context, right? We say, well, we can fit in now with a million or 2 million tokens, it could be like 40 or 50 different novels, right? Like, you're just packing in so much. So you're like, all right, here are five documents, and here is a query. And this is how a single RAG example would normally work. This is what's the answer that I expect from having retrieved these five documents. And each one of those is just a big example.
Adam Becker [00:32:10]: And now you just keep showing them a bunch of different examples so that it learns how to filter through and discover the right answer from a set of different documents. Because you've shown it a bunch of different examples of this, you could sort of see how they do it here. So anyway, below, if people are curious, you should be able to see the actual example of how they do it and what the query is. Okay, so again, so now we have these hyperparameters, which is like, how many documents should we be retrieving in the first place and how many examples should we be showing it so. Or how many iterations. And the idea is to try to find the optimal. The optimal configuration. And so they did that empirically in the beginning and now we're trying to model it.
Adam Becker [00:32:54]: All right, so now let's see how they're going about modeling it. Okay, so this is the example. This is the equation that they're trying to fit. So P here. That's simply our performance. It could be accuracy. Theta is that hyperparameter vector, right? The one that includes the. What is it? The number of demonstrations, the number of documents, and the maximum iterations.
Adam Becker [00:33:18]: And A and B and C are going to be learned. I is pretty interesting here. So I is essentially. So we have to in some way account for the fact that there's a different informativeness level for each data set, each task. Right. So it could be the case that for one task you're just giving me really bad documents and they're not all that informative. And so going from let's say zero shot to one shot is already quite valuable. Or going from no documents retrieved to any documents retrieved is quite valuable.
Adam Becker [00:33:58]: So they're coming up with this other vector I that essentially measures the informativeness of both the documents and the in context examples. And so that one is going to be very data set specific, it's going to be very task specific. And we have to learn that for every single task. And essentially that allows us to model the variance between the different data sets and between the different tasks. So this is the equation here. Try to break it down into some intuition too. So I this is a vector that describes the informativeness of each additional document or each additional example. Why log theta here? So the way I put it is since performance improvements often show diminishing returns, we introduced a logarithmic relationship.
Adam Becker [00:34:47]: You don't sort of expect that, that as you keep tweaking, all of a sudden you're going to get this like emergent, like exponentially increasing performance. I don't think that's sort of the expectation. So plus you did see earlier that we it increases the performance increases linearly but with the log. Right. So I think that's kind of how they're trying to capture that intuition. We also, I think for modeling purposes, and I think there's also some here you're also modeling the performance as an inverse sigmoid. And so if you were to try to directly model performance just as P of theta, it might be challenging because it's nonlinear. And they describe kind of like the non linearity and the sublinearity and the different regimes.
Adam Becker [00:35:31]: So they went about trying to do it with inverse sigmoid. Okay? So that's kind of how, that's how they're building up the framing. And so A, B, C, they're going to be discovered along the way. Here what you can see is basically their prediction versus where the actual values lied. And then on the basis of that they're able to then figure out kind of like what actually helped make the model very useful. So this is the 0 doc, 1 doc, 10 docs, 100 docs. So you can see that again like performance continues to increase as you increase the number of shots and as you increase the number of documents retrieved Right. Everybody can see this.
Adam Becker [00:36:18]: So they're increasing in essentially in both dimensions. You're not necessarily. We haven't necessarily predicted all that well for every single data set, but we tend to be sort of like within the bounds. So now we've moved away from the construction of the model. Now they're trying to validate it. So let's see if the model sort of makes sense and in which regimes it makes sense. So the first thing that they do is kind of like this. Actually they predicted the performance and then they actually measured the performance and then they're running an ablation study and so they're trying to kind of like turn off various knobs and then they want to see kind of what the actual effect was.
Adam Becker [00:37:07]: So in the first case they're just excluding B. So you might remember B is the coefficient that is going to be learned for the inter task variance. So again, that one turned out to be quite important because when you exclude this, the performance drops. Right. This is ultimately we're going to go for 0.9. This is a 0.86. They also tried to model the log is like I think the quadratic of the log, which also seems to be not as well as a linear one, a linear sigma and a sigmoidal sigma. You can see.
Adam Becker [00:37:44]: So basically this is their model. Ultimately. Next they are trying to figure out whether or not they can generalize from an unknown domain to a known domain. So this is. Or, yeah, from the known domain to the unknown domain. This, in my opinion is, I think is pretty important. And I'm not sure that I got a very, like a very clear sense from them that they generalize all that well, I think they say that they do, but I think I'm a little skeptical still because basically what they're trying to say is, okay, perhaps we learned. And if somebody understood this differently, please correct me.
Adam Becker [00:38:27]: One way you could go about this is let's say you learn from all of these examples, but you're kind of like you're leaving some examples here for testing. Fine. But you've still kind of learned from all of these different tasks and now you're trying to predict again within tasks that you had already seen, not examples you had seen, but tasks that you had seen. So that would be one way to go about it. But that doesn't prove to you that you can generalize well to unseen data sets, which is likely to be how it's used in production. Right, so. So the other example might be, okay, let's just, let's say like learn on a couple of different tasks, let's say, on these two tasks, and then see whether or not we're generalizing well to these unknown tasks. If that's been other people's impression, that's good.
Adam Becker [00:39:15]: Otherwise, please let me know if you, if you've read that differently because that feels to me like that is probably the most important thing here. And I would expect to see much more written on it in either way. I think that the fact that you have to. You still have in the inference. Inference time, you still have to derive I right from the target domain, which I think makes it. I don't know how people would go about actually kind of operationalizing this now, like actually deriving this. So I don't know, maybe in the Q if we have some time for Q and A afterwards. I'm very curious about how people are.
Adam Becker [00:39:53]: Would think about operationalizing this like ideation. So that's the first thing that they're doing. They're trying to say, okay, does it actually generalize? And it seems to them that it does. The next is can we extrapolate from lower effective length to higher effective length? So this is kind of what they're doing here, actually. Maybe you can see it in the diagram. You know, we don't have in this example. I think it might be in the example Sophia showed. If you learn this regime here, would you be able to extrapolate the performance over here? That's kind of, I think, how they're trying to do.
Adam Becker [00:40:32]: To frame this. And the answer is yes, sometimes you can sort of extrapolate, but depending on the. On which transition, which regime you're transitioning from. So from 16k to 32k. Okay, the. Almost like the oracle prediction, the best prediction is going to be 39. You can predict 37.0.4. This is for the exact match.
Adam Becker [00:40:58]: So you can sort of do a decent job for from 16k to 32k. From 32k to 128k, it's, it's not good. And from 128 to a million, still decent. See 48 versus 51 million to 5. Not good. Right. So they're saying, okay, we'll just keep it below a million. That should be good.
Adam Becker [00:41:21]: But extrapolating from 32 to 128, that's challenging because as you saw before, drag performs best around 32K. Inner drag excels at long context. So we almost have kind of like a phase transition here. So, so that's how they're trying to model the length. Okay, Some Discussion here too. One thing they point to as very important is, is retrieval. The more documents you include, it's you are increasing your recall. It doesn't stop right? Like at some point you're going to get to like 100% fine.
Adam Becker [00:42:02]: But all of the other metrics, they don't seem to be increasing all that much. And you might be introducing noise. So they talk about this divergence. This divergence suggests that while more documents lead to better recall, the relevance and ranking quality do not improve proportionally and even introduce extensive noise. The takeaway for them is that we have to be investing in much better methods of retrieval. So that's in retrieval. And I think they also point to, yeah, we have to invest in retrieval and we have to invest in making sure that that which we are retrieving is also accurate. And they points to this as another thing to focus on.
Adam Becker [00:42:44]: So we have retrieval here. Then they break it down into the different sources of error and they say we have, well, inaccurate or outdated retrieval, incorrect or lack of reasoning, hallucination or unfaithful reasoning and evaluation issues or refusal to answer evaluation issues. Refusal to answer. That might be, let's say like when you have a date that is misformatted. Right. So first category highlights the need for enhancing retrieval methods and maintaining reliable and up to date knowledge base, especially for complex questions that rely on multiple supporting facts. There's something here that's kind of interesting too, which is like how do you highlight the relevance of a particular document? And they have some like sorting mechanism and like a ranking mechanism. And then on the basis of the rank, I don't know if they put that in in any one of the diagrams, but on the basis of the rank they, they place the highest ranking document closer to the query.
Adam Becker [00:43:44]: Well, that sounds like a whole other set of hyperparameters that can be explored, right? Like what if you jiggle this? What if you. Yeah, I don't think they put that here, but this one here, it's almost like the documents that are closest to the query are ones this at inference time that are, that are most likely to be the most useful. Sounds like a little bit like, like witchcraft long context modeling. We find that retrieving more documents is generally beneficial for RAG performance. Nevertheless, naively extending the context length in each generation step does not always lead to better results. I think I have another thing here. The model's ability to identify relevant information from extensive context remains to be improved, especially when presented with large quantity of similar documents. That's it.
Adam Becker [00:44:29]: I think that's all. That's all I have.
Binoy Pirera [00:44:32]: I don't think we have a question in the chat.
Adam Becker [00:44:34]: Right.
Binoy Pirera [00:44:35]: I don't know if you covered this already, but do you want to check that out? Question from Gavin.
Adam Becker [00:44:40]: I was highly concerned about the optimal performance approach as it seemed to be using test results to do parameter selection and then reporting performance of the selected parameters using the same test results. The discussion of ways to pick the parameters was good, but it wasn't clear to me if that was part of generating the earlier graphs in the results about log linear performance. It seemed more speculative. It. I think that they. Yeah, exactly. Like my takeaway was that I did not. I.
Adam Becker [00:45:03]: I don't know if they got into details sufficiently about kind of like the rigor of actually running that experiment. I don't know whether or not they've. What they've held back from. I don't know how they did their validation in a sense. And that's who knows? And does anybody else know? Does anybody else. Has anybody picked up on that?
Emre Şahin [00:45:27]: There's no comparison. All these tests are done with Gemini flash, right? Gemini 1.5 flash. There's no comparison with other LLMs like chain of thought. Maybe a alternative to this approach. I think Southwest approach is very similar and it in other. For example GPT01 may be performing better in these tasks or something. I'd like to see some comparison with that.
Sophia Skowronski [00:45:59]: Yeah, I think if we scroll to the appendix, they did have like one specific parameter setup and compared chain of thought with ITER drag and they showed I think like they had like a set. It was like k equals 4 m equals 5 for this range of parameters, ITER drag is better than chain of thought. Trying to find what page the table was on real quick. But that's the only comparison with like other strategies. I know there's a lot of other like long context modeling or yeah like RAG strategies but oh yeah, it's appendix B, chain of thought versus iter drag. But again to your point, like I was surprised that I imagine it's like cost constraints that they didn't compare with other large LLMs either.
Binoy Pirera [00:46:51]: All right, so does anybody else have any more questions?
Adam Becker [00:46:55]: Oh, here we go.
Binoy Pirera [00:46:56]: Let me read it out so everybody can the paper talk about the optimal amount of relevant documents or trunks to be fit as part of the prompt. My current client implementation leverage 100 most relevant trunks by default and include them in the prompt. I am seeing better results reducing content.
Sophia Skowronski [00:47:15]: I think that was one of the variables that they were changing in generating these plots. They were changing the number of documents fed into the prompt as part of rag. So it seemed like increasing documents overall led to higher performance. But it all is kind of dependent on the method, the evaluation metric and the data set that you're using. So it's non trivial.
Adam Becker [00:47:46]: Can I add to that a little bit, Stephen? I think the other thing to keep in mind is that I don't think that they're doing just like number of. They're not evaluating just the number of documents as you would in just like a typical RAG paper. I think that they're showing how you would pull a bunch of different documents also like in their mechanism. Like I don't know if you've implemented ITER drag or drag. Right? So like if you haven't implemented those, I wouldn't generalize to the rag to the typical kind of pure RAG case. Right. Because they're showing a bunch of different documents and then a bunch of examples for how to pull from those documents and then they're feeding all of that kind of mess into the prompt. Right.
Adam Becker [00:48:36]: So unless that kind of similar, like a similar setting, I'm not sure I would, I would extrapolate. Yeah, yeah, it makes sense. Thank you.
Binoy Pirera [00:48:46]: Well, all right. So thanks everybody for joining. Once again, if you want to keep the conversation rolling, I've dropped the link to join our slack. You can join our waiting group channel. If you have any suggestions as to what we should discuss next month, whatever it is that you want to discuss, please go ahead and join us like and we can keep the conversation rolling there. Again a big thank you to Adam, Valdemar and Sophia for joining. Brilliant presentation guys. I'm sure you all of you find found tremendous value.
Binoy Pirera [00:49:13]: So we're doing this again same time, same place next month. So see you all.