Exploring Long Context Language Models
Korri Jones is a Sr Lead Machine Learning Engineer and Innovation Coach at Chick-fil-A, Inc. in Atlanta, Georgia where he is focused on MLOps. Prior to his work at Chick-fil-A, he worked as a Business Analyst and product trainer for NavMD, Inc., was an adjunct professor at Roane State Community College, and instructor for the Project GRAD summer program at Pellissippi State Community College and the University of Tennessee, Knoxville. His accolades are just as diverse, and he was in the inaugural 40 under 40 for the University of Tennessee in 2021, Volunteer of the year with the Urban League of Greater Atlanta with over 1000 hours in a single calendar year and has received the “Looking to the Future” award within his department at Chick-fil-A among many others, including best speaker awards in business case competitions. However, the best award he has received so far, is being a loving husband to his wife Lydia.
Sonam is a data scientist turned developer advocate.
Hey! I’m Nehil Jain, an Applied AI Consultant in the SF area. I specialize in enhancing business performance with AI/ML applications. With a solid background in AI engineering and experience at QuantumBlack, McKinsey, and Super.com, I transform complex business challenges into practical, scalable AI solutions. I focus on GenAI, MLOps, and modern data platforms. I lead projects that not only scale operations but also reduce costs and improve decision-making. I stay updated with the latest in machine learning and data engineering to develop effective, business-aligned tech solutions. Whether it’s improving customer experiences, streamlining operations, or driving AI innovation, my goal is to deliver tangible, impactful value. Interested in leveraging your data as a key asset? Let’s chat.
Binoy Pirera [00:00:00]: So, good morning, good afternoon, good evening, wherever you guys are joining us from. So, welcome to the August session of the envelopes community reading group. We haven't done these in a while, and so by popular demand, we're bringing it back. So we wanted to create this space for, you know, people in the AI and ML space that are actual practitioners who have to live the trials of doing AI and ML to kind of, you know, come together and read all these technical papers and, you know, really share some invites, insights and knowledge. And so it's good to have you all here today. And we have also, if you guys haven't read the paper, I'm gonna drop the link to the paper in the call chat so you can check it out there. And it's totally fine if you haven't read it. We're gonna completely take you through it.
Binoy Pirera [00:00:47]: And we have three amazing moderators joining us today to help navigate this conversation. And all three of these people are, have been a part of this community for a while now. And, well, let's, first off, we have Korri Jones joining us, who used to be one of the main organizers of the reading group a while back. He is principal program beat in Chick-fil-a. And we also got Nehil Jain, the co-founder, ERP startup is currently in stealth mode. I think he can tell you a little bit about that later on. And last but not least, we have the wonderful Sonam Gupta, who is the chapter lead at Aicamp. So without further ado, I want to hand the mic over to Sonam.
Binoy Pirera [00:01:28]: Sonam can introduce the paper and get us started.
Sonam Gupta [00:01:31]: Thank you, Binoy. This is alumni seeing the entire world here. That's pretty awesome. And yeah, this is my first time being a part of a reading group, so let's see how it goes. Hope people have got a chance to look at the paper. It's a very interesting paper from the author from Google, Deepminden. The title of the paper is, can long context language models subsume retrieval rag, SQL and more? The title itself is quite intriguing actually, to be honest. But before we get into the paper, I'd love to cover some basics.
Sonam Gupta [00:02:05]: So what exactly are lclms? It stands for large context language models, and they are nothing but language models with a much larger context window. And by that, I mean that these models can handle and analyze longer pieces of text, like entire articles, book, or longer conversations, rather than just the short snippets. And this ability of the models allow them to generate more accurate and contextually appropriate responses. So this is just like the basic overview of what these models really are and why that large context, so huge cases that I've seen people use large LCLNs for. These models are particularly useful for tasks that involve, like, you know, for example, summarizing a lengthy book or analyzing those complicated, comprehensive legal documents that I personally do not understand the wordings of. But yeah, those could be some of the use cases. But when you think about having large context and having that model read like an entire book and parse through it, you would probably think that must be challenging for the entire model, because, yeah, it's great that you can parse the entire book with such models, but could definitely be computationally expensive and may require some more resources. So this is what, like the motivation, sorry, this was the, like some of the basics of the paper of this model.
Sonam Gupta [00:03:36]: Now, I wanted to understand what was the motivation behind for these authors to work on this. So the author, basically they were motivated by the limitations of the existing language models, which typically handle short context windows like up to 100 tokens. And these limitations would hinder the ability to effectively perform the tasks that required understanding and generating text based on this extensive context. But the primary motivation was to explore whether LCLMs could subsume the traditional methods like RaC that stands for retrieval, augmented generation, or SQL queries, and the fact that there is a huge need of a rigorous evaluation of these models, because I feel, or the authors basically feel, that we haven't fully realized their capabilities on the real world use cases. So taking this motivation, what they built was that they developed a long context frontier, which they call it abbreviated as loft benchmark, to evaluate the performance of the LCLMs. And this benchmark included a variety of tasks which was designed to test the capabilities of these models in handling extended contexts of different sizes, such as information retrieval from large corpora of documents without any external or explicit retrieval mechanisms. Then also Q and a systems, and also structured data queries to assess the model's ability to process and then generate responses to natural language queries. Typically that would require SQL.
Sonam Gupta [00:05:31]: So this was the brief idea of what was built. I'm sure my co host will get into much deeper analysis and the technical stuff behind it, but what were the data sets? So of course we would need datasets even to proceed with this benchmark evaluation framework. The authors used a variety of data sets, and in the paper there's a nice table which lists out what are the different data sets? And they sampled up to 100 test queries, five few short queries, and then ten development queries. And the whole idea was to, you know, since these are large context models, they wanted to test different context lengths for models like Gemini 1.5 Pro, I think, and then GPD 40 and also the newer version of Claude three. And the context lengths range from thirty two k, one hundred twenty eight k and a million. And to keep the testing and evaluation fair, the authors processed each of these data sets to have the same evaluation queries across all these different context lengths that I mentioned except the SQL queries. And to talk a little bit more about data sets without getting too deep into it. There was like diverse retrieval and rag data sets that cover tasks from benchmark information retrieval, also known as bear.
Sonam Gupta [00:07:07]: I don't know how people pronounce it, but that one multi turn conversational question answer and multimodal data using a shared corpus that includes gold and random passages. And also they also adapted many short in context learning data sets from big bench, hard and long ICL bench, ensuring that there is a good representation of each of these classes from this data set. And for the SQL task, SQL reasoning, it was evaluated using I think it's called spider and Spark datasets, sparse C datasets with context specific databases selected to fit within the specified context link. And each of these corpuscles was designed to accommodate the instructions and formatting needs. As you know, we'll learn more about it as we go into, further into the data, into the paper. Now, next up is how they designed and built the prompts. And this part is very interesting to me because I have not read this kind of context, sorry, not read this kind of prompting before. So of course we are assuming that we are familiar with what prompt engineering is and how crucial it is to write a good and detailed prompt so that these LLMs could produce as accurate a result as possible.
Sonam Gupta [00:08:39]: Now, given that we have large context language models, we would think that we would need a more tailored prompt technique. And in this paper, the authors introduced corpus in context prompting. Now, what that means is it effectively combines the existing prompting strategies that we are aware of, refining them to leverage the unique abilities of NCLMs for different tasks like learning, retrieving and even reasoning over in context corpora. And they talked about how they designed the prompt. And I'll quickly run down what the four main things were that were mentioned in the paper. It was divided into four parts. First was instructions. Second was corpus formatting 2nd, 3rd was a few shot examples.
Sonam Gupta [00:09:30]: Fourth was query formatting instructions. You can think of. We want tasks specific instruction to guide the behavior of the model. Second, they included they would want careful formatting techniques. They put document id at the beginning and the end of the passage for the retrieval task. And in few short example they gave like samples of what the task or the question is for the LLM to answer and also told it like where it could find the answer in the document that the user provides. And one thing the author mentioned was that chain of thought reasoning definitely helped a lot in task, especially the ones that required like the multi hop q and a systems or the complex kinds. And last is the query formatting.
Sonam Gupta [00:10:25]: Like the query that you want to evaluate, it must be formatted similar to the few short examples that you provided and to refer it back. There is a table or like a figure in the paper that basically shows a very good example of what a corpus in context prompt should look like. And so when I was reading, you know, the prompt techniques, I was curious like, and you know, this question is from our next host or anybody in the audience, do you think that this approach can replace all traditional retrieval methods or are there any limitations? That's my question. And with that I hand over the mic to Nihil.
Nehil Jain [00:11:07]: Anyone has questions or thoughts to discuss.
Sonam Gupta [00:11:11]: Or if anybody has an answer to my question too, that would work too.
Samantha Zetlin [00:11:16]: Working with Opus, one thing we've noticed is that even though you've got a longer context, it still has trouble paying attention to the stuff in the middle of the prompt. So if you have a really big prompt, it still seems to forget things. And I don't quite understand what to do about that other than repeating what we really needed to pay attention to.
Matt Squire [00:11:35]: Wasn't there another paper titled like the missing middle or something?
Sonam Gupta [00:11:39]: Yeah, yeah, lost in the middle. That's the officer I was going to.
Samantha Zetlin [00:11:42]: Yeah, yeah, exactly. And did they talk about that in this paper? I didn't, haven't read the whole thing.
Sonam Gupta [00:11:47]: Yet, so I don't remember.
Nehil Jain [00:11:50]: Yeah, they talk about like, I think they're almost assuming that that is a problem that needs to be evaluated on and you have to get better on. So like the, the analysis is written in a way where they know that loss in the middle is a problem and they call it positional querying, I think where you're basically trying to figure out like what is the right position and how the performance degrades. I'll go a little bit over what they're talking about this paper around that, and there are other things, but yeah, that's definitely one of the biggest problems of long context we know of right now.
Sonam Gupta [00:12:24]: Right. And also I feel like the providing that document id in the beginning and the end, just to see, like, where to get the context from. Maybe that is a point that would help this problem. But yeah, I'll let you take over.
Nehil Jain [00:12:40]: Cool. Any other thought, guys? I can start sharing my screen. I just wanted to share, not like talking in the ether, but looking at my notes on the scribbles I made on the paper.
Matt Squire [00:12:53]: There was one thing. Oh, sorry. Yeah, one thing I was wondering about. I mean, there's the lost middle point, which was made, but the other thing was, suppose that this can make rag obsolete. Would you want to do the level of computation? It implies that you'd want to load all of your data, like absolutely everything that can potentially be a lot of data and a lot of computation. So would you want to do that just because you can, if you see what I mean.
Nehil Jain [00:13:24]: That's perfect. That's the first paragraph I was going to talk about. So they start talking about efficiency in an added the analysis of the whole piece by talking about the fact that if you're putting 1 million tokens in their context, that's very expensive computationally, both for the model and also for you to repeatedly ask different questions off of that. Whereas you can retrieve it from a database using whatever the vector databases are doing. For semantic search, for example, that's more efficient. But there is this technique called prefix caching, where you can import that part of the query once and then not do it again and only change your few short examples and query again and again. And so you're still kind of not spending the compute resources to work with that piece of big corpus again and again. And so that was their idea on how to bypass that problem.
Nehil Jain [00:14:19]: And because empirically they found that it is better to put the query in the end. Now I'm spilling the beams of the whole analysis, but, like, if you put the query in the end, it is much better than putting it at the top. Performance is really bad, which means that, again, you put all your corpus in the beginning and you cache it, and then you only keep changing the bottom part and it will perform the best. That's their way of solving the. So, as Sonam was saying, it is very important to carefully evaluate all these systems, and we have seen so much drama happen around benchmarks as well. So here it is. They're talking about evaluating against the task on multiple different datasets and then different models and so on, so forth. Of course, they're trying to say that, you know, like, we are the best.
Nehil Jain [00:15:13]: There's always a little bit of that going on in all these papers, but not always in SQL, they're saying that we couldn't beat it for them. And so let's just go section by section, which was very interesting for me because this poses a question of, okay, like if long context models can solve something better than the current way of doing them with small like slms or even the common models, then should we just move to in context learning and then make everything use lclms and reduce the complexity. And there is value in doing that. So for text retrieval, which is the task of you have huge amounts of text, let's say you have Harry Potter and you want to find a given passage. Can you retrieve the right passage? That's all text retrieval is. So it is the r of the rag. So it's the same thing, but given a lot of data, can you retrieve the right data given some kind of query? The assumption mostly is a semantic. It's not like you're searching for the specific keywords and they're comparing against these specialized models, which are the latest or SOTA models for that specific task.
Nehil Jain [00:16:23]: So in this case they're using gecko, which is a Google embedding model. I didn't know about it. I came to know about it through the paper. They released it in March. And currently I think they hold the best against these benchmarks on text retrieval. And they actually found that it is almost comparable, if not better, to use lclms for text retrieval. And they talk about this, and I think we were talking about the position of where can you retrieve stuff lost in the middle, et cetera. And so there's this graph which talks about if it is in the beginning, it is really good, and then if it keeps going towards the end, it's not able to retrieve or like from the middle to the end, the performance is not the best.
Nehil Jain [00:17:12]: Although that being said, if you add few short examples near the corpus that need to be retrieved for your query, which is almost cheating in my mind, then part of the corpus is relevant and you're putting your few shots there to kind of engineer from that way, then you half know the answer already. So it's kind of cheating. But that's what we have found. And yeah, they perform comparably, that's the main thing. And as we move further, the performance degrades. So that's text retrieval. Any thoughts? More questions?
Emre Sahin [00:17:58]: Are there prompts updated, optimized for each LLM, or are they using the same prompt for all of them? I think they're using the same problem clarification for that. So they optimize for Gemini and use the same prompt for others. In Gemini will be the first, the most successful one, and the other two will be least. Yeah.
Nehil Jain [00:18:26]: Yeah, that makes sense.
Emre Sahin [00:18:28]: Yeah. Okay.
Nehil Jain [00:18:29]: Yeah.
Korri Jones [00:18:30]: I think you're getting on, like, the whole research. Like, if you can, is it kind of repeatable in a way? Can I take this result? And they cannot repeat this in a way that makes sense and does the data. Again, we're all data people, right? The math needs to be mapping. It's not optimized for each individual one to get the best possible outcome because there's nuances within each one. Does that mean that, is it valid, the numbers valid, or are they invalid? And so you start asking those types of questions, but I'll be quiet. I was gonna talk a little bit about that, but go ahead.
Nehil Jain [00:19:04]: Awesome. Yeah, it's amazing how the question just lead us to the next thing we are going to talk about. So that's awesome. Visual retrieval, same thing. They're talking about retrieving frames from a list of frames and clip, which is an OpenAI model, is the sorta model here, a state of the art model here, which is, like, doing the best. And they were like, lot three doesn't even take the amount of data that they want to throw at it, so it's out of the benchmark. And I think what, again, they're finding is like, it's better. Gemini is better than OpenAI's clip model, but, you know, take it with a grain of salt, and Cory will dive deeper into that audio.
Nehil Jain [00:19:52]: Similar thing, I think they're saying that audio. Both other models don't even take audio as input. Gemini is the first multi modeling, API based model which can take audio. So it's the best. It's better than whatever exists. It's better than giving the specialist. Then comes the interesting pieces, at least personally. For me, it was building a lot with AI.
Nehil Jain [00:20:16]: Rag is a very common thing you end up doing. And then in rag, well, if I don't have to set up all the different pieces to build rag and can just put everything in context and can retrieve over it and reason over it, and that's the best. And it does sound like. So the datasets they chose were very interesting data sets. There were two things I didn't know about, because I'm not late reading all the papers and understanding all the different methodologies that are going on. But there are two types of tasks inside rad that they were doing. One was multi hop. What? Multi hop datasets.
Nehil Jain [00:20:55]: And it's important for people building LLM applications to understand at least the conceptual and meaning of what multi hop is, is you need to answer the question by hopping to multiple sources. That's why it's called multi hop. So you can't get the answer just from one source. So you have to look at multiple retrievals and then synthesize the information from there. Multi target means even if you have one document, but you're trying to answer multiple target questions or queries from it, then that's multi target. So there, I guess one is a little bit more complex, generation one is more complex, achievable. You can think about it that way just to dumb it down. And so they try to compare LCLM's performance on both of them, and they did find that for multi hop, it outperforms the current models and current ways of doing it.
Nehil Jain [00:21:49]: Of course, you have to add chain of thought, because what chain of thought helps model do, if you read the original papers of chain of thought is it lets the models do multiple passages on the same document or context, which helps them do better reasoning. And now that you can reason through the whole corpus, that just makes the reasoning so much more powerful. And that's why I think LCLMs are doing much better on multivarp datasets. And that in general, I found is a huge problem for, I think the CEO or CTO of ocean recently challenged the fact that multi op will never be solved with LLMs, and he's going to give up on this piece, blah, blah, blah. But maybe not, maybe lclms will solve it. So let's see. And then multi target, I think is there, but it's not like really doing much better in multi target, but it's doing much better in multi hop. And so that reduces the cost and complexity.
Nehil Jain [00:22:49]: That was a mental note for myself for using LLM, LClms for rack. For the rag. Corey was talking.
Bruno Lannoo [00:22:57]: Yeah, for the rag. I had a question because, like you explained how they explained, like, the chain of thought helps a lot with the LCLM, but. And with the rag, is this being compared to a single rag? Or like you could also do an iterative rag where like, for every chain of thought, like you would allow the model to go back, do a rag for the next bit of information it needs. Is this being considered here or. Yeah, I don't know if red traditional reg considers this kind of options, John.
Nehil Jain [00:23:28]: I think they are only doing. Yeah, they're not making multiple calls. So what they're doing is they're doing rag once and then they're saying you can do multiple rags by doing rag once in the. If it, if the corpus is there, so to speak.
Bruno Lannoo [00:23:41]: Yeah, but then you rely on already knowing what you will need for step three and four of your reasoning in step one. Well, if you would do rag initerative processes, then you could use the result of the first rag and reasoning after it to actually determine what you need to request from your database, which is maybe the difference here that the chain of thought gets to do that because it is all integrated and maybe that's also the simplicity of that system.
Nehil Jain [00:24:08]: Yeah, exactly. I think then the cost might go up and which is what they're like, trying to evaluate that in one shot. Can you do everything better? And that's my understanding, but like shywin makes sense. Feel like they're doing multiple drag or they're doing something else. So I was going to SQL, which is one of my favorites, because as Corey said, we are mostly all data people. And I would love a system which would just like make stuff serve analytics possible. And for that we need ability to convert hard business questions into. A recent demo were to convert into SQL questions on top of your data, which is what this task is.
Nehil Jain [00:24:52]: And bail SQL is the state of the art solution there right now. And what Dale does is it actually divides and conquers a. Think of it like chain of thought. So it says this question has ABCD parts and then it generates prompts for ABCD and then creates SQL for ABCD and then combines the answer later on. So that's what Dale SQL is doing. And it seems to have the best performance on Spider, but still nowhere close to eighties or nineties. I think it's a. I read it somewhere, actually.
Nehil Jain [00:25:29]: Yeah. Oh, no. Yeah, it's 0.6. Yeah. So 65% of the time they're able to get it, but mostly they're not so solid on doing SQL reasoning and then contest models still are not able to do it at all. That's my take from these numbers. So that's still not solved. We don't have, uh, self serve analytics just yet.
Nehil Jain [00:25:56]: Even with LClms, um, there's a lot of room for improvement. Nothing really fancy coming out of this analysis, in my opinion. I mean, they show this chart which is saying that, oh, it is better at doing equality as compared to average or some, or like those aggregate functions. Um, but that's just probably because you have to figure out like how do you group by and how do you actually filter the things to get to the right average, etcetera. So that's what's happening on SQL, not this year, maybe next year. And then we have many short ICR, which is basically how do you teach new tasks, which has never been shown before to bllm. How do you teach new tasks? And then the best performance was cloud three opus. So they were able to.
Nehil Jain [00:26:44]: So that this model does even better than us in long context situation. And I mean, as we would expect, the more few shots you put in, the better it will be for learning the new task. That's kind of my understanding here. And they have a lot of different benchmarks for those type of tasks. It's big, bench, hard, vegetable. That's the benchmark they're using. Then they have multiple variations in there. So that's that section.
Nehil Jain [00:27:17]: Before I move on to the abolition space, any thoughts, discussions you have here.
Sonam Gupta [00:27:25]: In the chat, Keith mentioned that I think around the table four about the ablation one. They think that it's only results from Gemini. So the development of those techniques were only happened on Gemini. Is that the case?
Nehil Jain [00:27:40]: I'm not sure.
Sonam Gupta [00:27:42]: Yeah, I was trying to look in the paper too, like, are there any other ablation studies that they did with the other models? But it seems like it was only with Gemini.
Nehil Jain [00:27:51]: Yeah, it looks like it's only with Gemini.
Sonam Gupta [00:27:53]: With Gemini, yeah, yeah, that's, that's the right observation, Keith. And there was one more question in the, in the chat earlier, just before, like, you know, you move on to the next part. The question was, does it mean that we no longer need to chunk the documents? I think with the whole long context thing. Yeah.
Nehil Jain [00:28:15]: Yep, for sure. So that's what it means. Like, if you can, this multi hop thing does say that you reduce the complexity of doing the overrad, which means you don't need the retrieval system, which means you don't need the chunking piece. Of course, your context still needs to, or your corporation needs to fit in the 1 million token limit, et cetera, et cetera. So you may need to chunk it, maybe differently, but yes, eventually it means that you may not need chunking. And 1 million is so large that, like, for a lot of tasks, you may not need to do it.
Sonam Gupta [00:28:49]: Yeah, I think you can pass in like, the one whole paper, I would.
Nehil Jain [00:28:53]: Say way more than a paper. I think, uh, I was listening to somebody, they were like, you can pass in two years worth of email, for example, of a single individual. So, yeah, 1 million is a lot of tokens.
Sonam Gupta [00:29:06]: Yeah.
Nehil Jain [00:29:08]: So that would be amazing if, if just like, prompting is all we need and not all the auxiliary systems you have built to work with LLMs and.
Sonam Gupta [00:29:16]: Imagine the next paper coming out says prompting is all we need. Wow, that would be interesting.
Korri Jones [00:29:22]: But I thought it was. It's everything.
Sonam Gupta [00:29:25]: Yeah. Yeah.
Emre Sahin [00:29:30]: You something about optimize that performance, uploading that 1 million tokens every time you don't need it, you had to encode it or something. That new problems will emerge, I think.
Nehil Jain [00:29:43]: Yeah. Okay, so let's go to abolition piece, which is probably quick. And we've discussed a lot of this through the analysis. They're just like doing it again, what abolition is. They're saying, oh, um, can I obfuscate one part of the problem and see like, what is the effect of one change on the performance? And they found a few things overall which were interesting. Some of them we already know. So they're saying like, chain of thought reasoning generally improves performance. Well, nothing very, very new about it.
Nehil Jain [00:30:18]: Query at the beginning is not the best, but at the end is much better performance, which we discussed in the beginning. Then they're also talking about the instructions for the query or the instructions for the task, which means like prompt engineering need to be precise. And the better your instructions, the better your output will be. They also talk about few short examples will enhance recall. Makes sense. Very interesting thing that I found, which Sonam was talking about in the beginning was if you can put your documents and give the model away in the prompt to talk to it through an id, sort of almost give it numerical id, you know, like 123456 for the passages. So there is some kind of chunking in that sense, then I, if I think about it now, but if you can give your documents different ids, then it does much better in chain of thought and reasoning and like retrieval of those things and then thinking about what it needs to do. So having ids and repeating the ids and output does have better retrieval accuracy.
Nehil Jain [00:31:25]: Those are some things overall that we found. I didn't find it completely new. Revolutionary thing that they talk about here in the end. Chain of thought. More examples. We know all of this already. Um, I guess the id thing was interesting, but that's very unique to LCLRs, which we cannot take to other quality techniques just yet.
John Savage [00:31:47]: I thought it was interesting how they said if in your, in the few shot examples, if when they used them like realistic, like real ones versus like kind of fake ones, that it was able to do a lot better. That was something I hadn't seen before, which again, makes sense.
Nehil Jain [00:32:06]: Cool. And then the question of, well, is this only Gemini specific or is this only like what Google thinks is the thing and we have to take it with a grain of salt. So that was the question we need to understand, and I think Corey will enlighten us around that piece. Is it repeatable?
Korri Jones [00:32:25]: I don't know. Is it repeatable? I think everybody is using Gemini for all of their prompting and all of their work right now, and so it should be second nature. But first off, I just want to just take a moment and a pause. I know we've been, there's a lot of information, a lot of things. I see the chats going. I'm loving the dialogue as people are having there like just a different perspective, folks questioning things and just sharing some really good insights. And so I'll press pause for just a moment before I start talking to just listen. Is there anything that anyone just has to get off your chest? Something you're like, hey, I just really, this portion of the paper was weak.
Korri Jones [00:33:01]: I know, Samantha, you were sharing some things right there in the chat. Anybody that has anything that you really want to get off before we kind of talk about, like, overall, like, is this even repeatable? Is this something that we feel comfortable with? Is this something you want to take and you're building your own research? Do you want to use this as a reference in your research? So I want to press pause and I want to listen to the community. There's a lot of people here.
Emre Sahin [00:33:21]: My personal experience is that Gemini is on parental cloud beamboard three, but cloud 3.5 is better these days. In my, this is my personal experience. I'm using it for contracts, legal ops, extracting data from, for this kind of retrieval tasks, and I'm testing each new one when it comes up. And I think these results I take with a grain of salt.
Korri Jones [00:33:57]: Thank you for sharing your experiencing. And I agree.
Matt Squire [00:34:01]: I mean, I guess for me, I'm thinking it's interesting research, it's interesting to know what the state of that research is, but I'm not going to start telling my customers, oh, we don't need rag anymore, let's rip it all out, and had changed my mind. So it's more like I'd want to see, I guess I'd want to see a technical demonstration of, of a long context language model in a real world scenario. And I'm sure that will kind of come over the years. So it's like, it's nice to be able to keep an eye on what's on the horizon, but at the same time I'm like, well, that's somewhere in the future, maybe.
Sonam Gupta [00:34:42]: Yeah, I second, with Matt, I think rag, we still need it, and I think Samantha in the chat also agrees, and I think there are more and more advanced rag techniques that are coming out. I'm recently, like, exploring graph rag and all the benefits of graph rag. Then there's agentic rag. So I think rag is going to stay just. It's like, yeah, there's a lot of controversy. As soon as some research comes over, there's a controversy associated with it. But I think, yeah, as Emre mentioned, that we need to take it with grain of salt. That's the best way to go.
Matt Squire [00:35:13]: You've reminded me, it's a very good point that there's a lot of depth to rag, that beyond the kind of so called naive rag. And one of those things that we're looking at a lot right now is different patterns of retrieval. We're looking at hybrid search, and we're looking at re ranking things like that. It feels like the retrieval gets subsumed into the model's behavior in this mode of work, if I've understood it correctly. So I guess either we don't need those advanced retrieval techniques, or if we do need them, there's no way we can do them.
Nehil Jain [00:35:46]: From what I've understood, I think that really makes sense. But there was a sentence you said in the middle where it's like, do we need them? That's the thing. We don't know. Maybe after hybrid search and after long context drag, maybe you can just do it with long context. You don't need to both keyword search and all of those. Which brings me to one of the things I looked at in these papers that I read nowadays is, how did they evaluate? And can I just even use the evaluation technique to check my work? Because it's very hard to know the performance apart from wipe checks. And so I come across these data sets or these techniques of how they're evaluating a given task and adapt that to the thing I'm building. I think that's why they also distinguish things, um, to read papers and not just, like, the insights of, oh, this model is better at this XYz thing, or this model is better, but how do you evaluate?
Korri Jones [00:36:38]: I think you touched on, can I trust this just. Just a little bit? Can. Can I trust this? Yeah, I think that's really kind of the big question we always have to ask ourselves as we begin to read. The concept is fascinating, right? It. It tickles the brain in a different way. You start looking at things just a little bit differently, but then you go back and say, I can. I really trust this as my guiding compass. And I think it kind of goes back even to, and I'll put this, just say this right now.
Korri Jones [00:37:08]: When we look at research papers, right? Like, ultimately, when I look at research, research ultimately is a systematic process that aims to generate new knowledge or deepen understanding of a topic or an issue, right. It's to deepen or expand new knowledge. And so I posed a question. As you read through the paper, we look at some of the numbers, we talk about some of the values, like, hey, there's a really wide range, can I repeat this? Like, you know, good research is reproducible, right? It's defendable, and it also broadens and provides you knowledge. I think this one brought new concepts. In some ways, it's really, really good. But when you look at it, does anybody feel that? And I don't know if anybody here works for Google or work on DeepMind, anything like that. We love you not saying anything bad about it.
Korri Jones [00:37:53]: I want to make sure I qualify that. But I think when you look at this, do you feel that this is telling the true story of what it's really talking about, or is it giving you just a tailored view of, like, this is like an ideal world, right? Gemini is going to do all of these things and all that other stuff. And, like, how do you feel? How do you feel about that when you look at. And we call it a research paper, but do you truly feel like it's defensible research that broadens or deepens our knowledge? So I'll post that question out there for the group. I see a lot of head nods, I see folks smiling. So I posed that question. What do you got?
Sonam Gupta [00:38:34]: In my opinion? I feel like this is a good direction. But, you know, as earlier mentioned, like, take a grain. Take it with grain of salt. It lets you, like, I think the wheels start churning in your brain just to think it from that perspective. But seeing that Gemini is the world best model, I'm going to pause and think about it or think over it before making claim or deciding that, okay, this is the best one. So I think it's in the good direction, is what I would say. And there's still more research out there. I'm sure there are other problems that are associated with these kind of models or even the benchmark evaluation, but I guess it gets people started, is what my opinion is.
Emre Sahin [00:39:17]: There is also this. We are still in the process of finding datasets, standards, measurers, and we don't actually know how to measure these. You're just learning. It's not like we have, like we had MNisT data set, handwritten digits data set that everyone was measuring their, the CV system with that computer vision system with that. There's not such standard data set yet, and standard problems and you can, and standard prompts or whatever. So it's, there's a large room to improve this aspect as well. I think. Yes, we need to measure, but we are learning how to measure these lms in the, as it progress.
Korri Jones [00:40:13]: For the record, I normally do like 11 seconds of my brain when we have quite silent just at that moment, just again, there's so much brain power and just know we appreciate everybody who's here. I just want to at least have that, that door open radiation perspective. So give 11 seconds. Anything else from anyone else to your.
Keith Trnka [00:40:33]: Question about, you know, is it defensible research, I would say like each different conclusion has a different level of defensibility. In my view. Like the benchmark that they produced feels defensible to me, that feels useful. I trust it as a benchmark. Their ablation study and how they designed that part feels quite biased. Their conclusions about Jeb and I feel quite biased. So I would take those with a grain of salt, but so it's not, I don't think you look at a paper and say like the whole thing is defensible, but there's some parts that I feel much more reliable about, or even that a long context language model can approach a specialized approach on some of these tasks. Like, that feels like a reliable conclusion, whether it's Gemini or not.
Keith Trnka [00:41:26]: I, I'm not going to bet money on that, but, yeah, so there is value in there. It's not like everything is equally biased. Some parts are quite biased and some parts are less.
John Savage [00:41:38]: So I think it's just under the context of, I think it's interesting to think of it in the context of, I guess, the development lifecycle. And I can't remember who brought it out, but they talked about the idea of LLMs as like version one of your products and then that, that probably isn't what you can actually deploy because it's too expensive or it's too slow, and it's just a proving ground of like, okay, this is good for product market fit, then take that and distill it or make an old, you know, classic ML model out of it or whatever it is so that you hit your like latency or quality kind of like points once you've proven out of that product market fit. And I think this is like super useful in again in that v one of like, okay, let's just like grow our entire corpus into the context. It's going to be too slow, it's going to be too expensive, but we get to find out, like, is there product market fit in whatever this thing is we're building? And then once we've got to the point of like, okay, there is product market, then like, okay, let's build a rag system or build a fine tune system or whatever it may be. But I think, like, yeah, I, this, this like, the ability to do these super complex things without having to build out a search system or whatever it is to do rag seems super useful. And that looking at it from like.
Korri Jones [00:42:58]: A product design point of view.
Nehil Jain [00:43:01]: Yeah, I think totally. I think the complexity reduces a lot and you can run experiments quickly and then as you see, and I'm a believer that like, cost of intelligence or cost of each call is going to go to like zero almost at some point. And at that point, you don't even have to rewrite the screens, use the models and so on, so forth.
Emre Sahin [00:43:22]: One thing we can add, maybe, is that there will be improvements in the reg space as well. Not only am I in this lamp space, but Greg has this problem of vector databases. Their aggressive pressure is very low, actually. We don't notice that while developing. But you may not get vectors you store in the database in Pinecone and that build certain effects or any other vector database, and that affects the performance considerably. So if someone comes up and improves this part of RAC, we can get very high retrieval performances from react systems. Probably higher than this, higher than this, but yeah, it's an open question.
Korri Jones [00:44:10]: Now, I love all of the exchange. I love looking at these names of people that have just been contributing so much through the entire community. And just the great dialogue as we have, we just started to dive into some of the technicalities of this paper. I also want to acknowledge, and I think it's always important to acknowledge, is there anybody on this call where you heard something and you were like, I have no idea what this is. I'm Google searching it right now. Has anybody done that? I mean, it's okay if you want to throw a hand up or something. I think there are a couple of folks. I see a hidden honor, too.
Korri Jones [00:44:48]: I think that's really the heartbeat of what we get to do. Just the opportunity to hear something and say, oh, Mandy, the missing middle. Like, yeah, the middle. I remember hearing about that paper. Let me add that to there. Hey, let me add this. And so I think there's just some beauty that's here for being able to collect and come together as a greater ML ops community and just have these discussions. So I want to just take a moment there, acknowledge the folks who don't have as many reps in this space who are still learning and are here to learn.
Korri Jones [00:45:22]: And so I just want to say we celebrate you. We also want to celebrate those that have the experience and exposure that have dove really deep into a lot of the different things. And some of this is secondhand and you're learning from it. So we want to say thank you for being pioneers and thank those of you that are still learning and trying to, like, ramp up, like, okay, I just, I just, I just heard about LLMC last week and I heard, I saw this and I'm here. We don't know who you are. We want to say thank you. I just really think that's just really important to acknowledge all of the different spectrums because hopefully our vision and our thoughts here and ninja Binoy, I see Demetrios is here as well. Hopefully.
Korri Jones [00:45:58]: One of the things is you can really say that this is a safe place to come in and ask your questions, even if you're like, I feel like this is a dumb question, but I feel safe. Can I ask it? And so hopefully, as we continue to do more with the reading group, this becomes that place. There's no dumb question. There's no crazy question. There's just questions that no one else may have thought of that you're going to help us all get that much better. So just a moment to say thank you. A moment to acknowledge that this is a space to learn to grow and have a little bit of fun. So that being said, I'm going to be quiet.
Korri Jones [00:46:35]: I'm going to nip this. Hopefully that brought some joy as we begin to wrap this up. We've got about four minutes left. Somebody has a meeting in four minutes. I don't know who it is, but someone's over there doing multiple things, preparing for that next meeting. Other folks are running for lunch and someone's about to go to sleep. So we just want to acknowledge all the time zones and that. And so I do have one big question, and again, if y'all think I'm going off on a tangent here, fist to five.
Korri Jones [00:47:06]: I love a good metric five. I really enjoyed this one. I should have been asleep. Anything else in the middle? Whatever you think. But I want to see how do you feel about your time today for this hour? For us talking, you could drop it in the chat. You can throw your hand up. I just, I just think it's good to get that. I feel five.
Korri Jones [00:47:28]: I'm learning and I'm over googling all these papers.
Nehil Jain [00:47:31]: Okay.
Korri Jones [00:47:32]: Yes. Love this. I love this. Like 500,000. Uh, hey, can we get that in dollars? Um, wait, what's the world currency that's worse than most? Whatever's the most valuable currency right now. Give it, give it to me in that. Um, but I. Awesome.
Korri Jones [00:47:52]: I love that. Lots of fives. All fives worth getting out of bed for. So, um, we have some more opportunities to connect and chat and other things. And we'll continue to have different papers and discussions, but know that this is our time, all of us. So thank you. And I didn't know if any other host has something to say. I just felt compelled to say those things.
Sonam Gupta [00:48:13]: I just want to drop a question in the chat for people to think about this paper later during the day. But otherwise, I definitely enjoyed my first reading group session ever. And yeah, the chat and the dialogue was amazing. Good luck to everyone and have a good day. Evening, night, whatever.
Binoy Pirera [00:48:32]: You are great, guys. So I just want to mention that I just put the link to our slack channel. If you guys want to ask more questions or suggest any papers that we should cover in the upcoming sessions, you can do that by joining our slack channel. You can find the link there. And if anybody wants to post these sessions or want to recommend some papers that maybe we should cover in the upcoming sessions, please feel free to reach out to the anytime you guys want. So I just also want to give a quick shout out to Nahil, Sonam, and Corey. Wow, fantastic job, guys. I think this was a very productive session.
Binoy Pirera [00:49:06]: So thank you so much for all of that. And thank you so much for everybody else for joining as well. And maybe we can do the same thing next month. We'll keep you all in the loop. So if you have any suggestions, please feel free to send them to us.