Sign in or Join the community to continue

Context Rot: How Increasing Input Tokens Impacts LLM Performance (MLOps Community Reading Group)

Posted Sep 01, 2025 | Views 123

# Context Windows

# LLMs

# Prompt Engineering

Share

speakers

Kelly Hong

Researcher @ Chroma

Kelly Hong is a researcher at Chroma, where she explores open questions in retrieval. She studied computer science at UC Berkeley before deciding to take a break from school to go all in on working in this space. Her recent work includes projects like generative benchmarking, driven by the motivation to help developers systematically evaluate and improve their AI systems.

+ Read More

Adam Becker

IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

Matt Squire

CTO and Co-founder @ Fuzzy Labs

Matt is CTO and co-founder at Fuzzy Labs, a consultancy dedicated to using MLOps to help technical teams get the most out of AI and ML. He enjoys AI, bio-inspired computing, and functional programming.

+ Read More

Bauke Brenninkmeijer

AI Engineer @ orq.ai

Bauke Brenninkmeijer is an experienced AI engineer with a background spanning start-ups and large corporations. Having built ML and AI projects across multiple industries, Bauke brings a unique perspective on the challenges of execution in environments that demand diverse skillsets — and on whether those skills are best centralized or distributed.

With a focus on scalable solutions that deliver measurable business impact, Bauke balances the roles of builder and researcher. After gaining firsthand insight into the structured but complex world of corporate AI, Bauke has returned to the startup space — where one person often wears every hat, and the pace is simple yet chaotic.

Currently, Bauke is building LLM applications and tackling the growing complexity of frameworks and providers. Their interests include MCP, multi-agent orchestration, and the evolving frontier of LLM development.

+ Read More

Arthur Coleman

CEO @ Online Matters

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .

+ Read More

SUMMARY

When Bigger Isn’t Always Better: How Context Length Can Break Your LLM

Longer context windows are the new bragging rights in LLMs — now stretching into the millions of tokens. But can models really handle the first and the 10,000th token equally well?

+ Read More

TRANSCRIPT

Arthur Coleman [00:00:00]: And let's hit record. Yeah, yeah. And Benoit, if you can let people in while I'm chatting.

Binoy Pirera [00:00:08]: Yeah, I can do that, but are we being recorded? Let me see.

Arthur Coleman [00:00:12]: Oh, and I got to put the. Yeah, we're good to go.

Bauke Brenninkmeijer [00:00:15]: All right.

Arthur Coleman [00:00:16]: Okay, so everyone, before I get started, I just put the link to the question and answer doc into the chat. You should click to that because the way we work is you put your questions in there and it's a. Like a stack. So we start with the first question and we go to the last. So we'll. We'll use that doc for purposes of getting your questions in so we can answer them after the presentation part when we go into the discussion section. Okay. Good morning, everybody.

Arthur Coleman [00:00:47]: I'm in a public place. I'm going to try to talk quietly. There's not too many people around, so I don't think I'll be in too much trouble. Welcome to our discussion today on context, how increasing input tokens impacts LLM performance. This is a fascinating topic, may I say, because I am a wordy prompt engineer, like all good data scientists, and I write very long prompts, so this was very relevant to me. Our speakers today are Kelly Hong. She's the author of the paper that we're going to review. She's a member of the technical staff at Promotion, which I have to clarify, is not the embroidery AI company.

Arthur Coleman [00:01:23]: It is the database AI company. We have Adam Becker, who is the founder of Headon and has been. I think all of you know Adam pretty well. I got to say this right. Brennan Meyer, who is an AI engineer at work AI and of course, our nestable Matt Swire, who's the CTO of Fuzzy Labs, who publishes all kinds of interesting newsletters on various AI topics. Now, we all love talking to you, so below you'll see how you can connect with us on LinkedIn. And do me a favor, because I get a lot of fake people coming. When you do, leave out a note that says you are from mlops.

Arthur Coleman [00:02:02]: So I know that it's you. I get about 10 new people after every one of these sessions, so please let me know it's you and I'll make sure to connect with you. The agenda today is we'll start with introduction and related works. Kelly is going to cover that. And then we're going to talk about nqs and impact of Distractors. That's going to be Adam and Distractors are an interesting topic. Matt's then going to talk about the haystack in which the needle occurs. Then Valkyrie is going to talk about long and eval and the related words that go with things and we'll be doing some survey questions in between the speakers.

Arthur Coleman [00:02:37]: Although today I've kept that to a a minimum. Guiding principles for those of you who've been heard before, this is. This is old news, but let me just repeat it again. These are your sessions. This is not us to present, this is for you to learn. Sessions are intended to be interactive, will present for 35 minutes and then we have great discussions. We have had last sessions, fabulous discussions. In fact, in many ways the discussions gave us more insight than the presentations did.

Arthur Coleman [00:03:04]: The more you participate as a result, the better the outcome for all of us is going to be. And let me say, because I'm the guy who already asked, always asks the dumb question. Ask the people on this call when we get in meetings about organizing. Whether I ask the dumb question or not is a no judgment zone. Which means any question is fair game. Do not be afraid. While we're here, no one is going to judge anyone else. There are no dumb questions and as I said earlier, the questions go into the tool document and will be answered in the order received.

Arthur Coleman [00:03:35]: Lastly, we have a post event survey. We want to do better every time. So please fill out the post event survey that you'll get a email about and with that I will turn it over to Kelly. Let me stop sharing. Kelly, you are up.

Samantha Q [00:03:50]: Thank you.

Kelly Hong [00:03:51]: Also, do we have some sort of like I guess like shared document for to like screen share or should I share on my screen?

Arthur Coleman [00:04:00]: We. I don't you talk about like the doc for questions or.

Kelly Hong [00:04:04]: Oh no, no, just like the technical report. Like I was wondering if we're going to share that on the screen or I can do that on my own too.

Arthur Coleman [00:04:11]: I think you should probably do that. Adam. Agreed.

Adam Becker [00:04:14]: Yeah, either that or I'm going to share the mirror board later. So I'm happy to do that already now if you want to drive it around. But it's probably fastest. I mean if you have it easy accessible you should do it otherwise.

Kelly Hong [00:04:27]: Yeah, yeah, sounds good. Let me just share my screen. I just have like one visual to reference so see facing some trouble. Wait, I'll just present and then we can do the mirror board later. But yes. So I'm Kelly, I'm a researcher at chroma. As Arthur mentioned, we're a vector database company so we work on retrieval for AI. But we also work on a lot of research on the side as well.

Kelly Hong [00:05:05]: So Contextrot was one of the recent technical reports we published, and it's basically on how increasing input length degrees LLM performance. So I'll kind of start with the motivation for this research. So. So a lot of labs have been coming out with larger and larger context windows. So you've probably seen like, Gemini GPT 4.1 and most recently Sonnet 4. They all have 1 million token context windows. And this is typically presented in a way that makes it seem like these models maintain performance across their entire context window. So, like, regardless of if you have, you know, like a task at 1,000 tokens versus 1 million tokens, it's kind of assumed that, you know, they'll have good performance regardless of how long your input is.

Kelly Hong [00:05:50]: But there's a few problems with how this is presented, and I'll go into one of the reasons for why. So they commonly use this task called needle in a haystack, which we'll go very deep into, I think, throughout this reading group. But it's essentially a very simple retrieval task where you have a long document, say you have like, Pulagram essays or like research papers, and they are. They all add up to like a long token length.

Arthur Coleman [00:06:17]: So.

Kelly Hong [00:06:17]: So they could be up to like 1 million tokens. And we call that like a haystack. And within that you have a question needle pair. So you would have some fact, like within the middle of that very long document. And then you basically prompt the model to retrieve the fact. And this is essentially like a very simple retrieval task. And it's a good scalable test because it's easy to adjust the length of your inputs. But the problem is that this test is also very, very easy for LLMs because, um, you know, first of all, it's like a simple retrieval task.

Kelly Hong [00:06:48]: You're only telling the model to pay attention to that one fact within the entire document. You don't really get insight into, like, how it's processing the rest of the content. And the needle question pairs are also like perfect lexical matches a lot of the time too, meaning that they contain a lot of like, like word overlap. And it might not be like, semantically understanding, like, the needling question every time. It might just be, you know, just doing, like, word matching. So this doesn't really compare to how people use long context in practice. Like, for example, if I take a coding agent, for example, you can have like, hundreds of tokens of like, code files, like tools and previous conversation history, and you can prompt the model to do some coding task. And this kind of task is a lot more complicated than, you know, just having Like a simple needle on a haystack retrieval task.

Kelly Hong [00:07:38]: So. So it's kind of hard to generalize, you know, like, how does this performance on needle haystock, like, generalize to more, like, complicated, realistic scenarios? So the conclusion we had was that these current benchmarks aren't very comprehensive, and thus they can be pretty misleading for people to think that they can maintain performance, you know, regardless of context length. And I know this isn't entirely a new context. I'm sure a lot of people have experienced context fraud, you know, in practice, I've experienced it myself in my daily use. But basically, anytime you have a long conversation, whether that's just back and forth with ChatGPT or you're using some kind of coding agent, like cloud code, you can kind of start to notice that you get lower quality inputs as the conversation progresses. So this is something that I think a lot of people have resonated with, but it wasn't something that's been thoroughly tested yet. I think the most that people have done is, like, needle and haystack and some other benchmarks too. There's been, like, MRCR and, like, graph logs as well, but these models haven't been, like, very thoroughly tested and, like, quantified for how they perform across, you know, their entire context window.

Kelly Hong [00:08:50]: And that's essentially what this research does and what we'll be going into today.

Arthur Coleman [00:08:58]: Hey, Adam, I think you're up.

Adam Becker [00:09:01]: Awesome. Okay, so, Kelly, thanks. Thanks for the introduction. And also, first, also, thanks for being here with us and telling us about your research. It's interesting because we were maybe, I guess, also potentially guilty of some of the hype, because I remember it must have been maybe, like, a few months ago, we were talking about a paper about maybe it was when Gemini Pro came out or something like that, and everyone was just, like, railing and excited about the extra context window and what it actually like just having a sobering analysis of what the actual impact of this and what is an actual effective length that we can actually use is, I'm not sure that was done back then. And so I think that you were called to task for it. So it's very fun and energizing to see this work being done. So, okay, I'm gonna talk about a few things.

Adam Becker [00:09:55]: Let me share the mirror board. You guys know that I like to draw these things if people want access to it. Also, I'm happy to drop it in the chat, so go for it here. And. Okay, if anybody wants to follow along. So the way we're gonna do it is we have A bunch of different sections for each of the experiments that they ran. And I'm gonna spend. I'm just gonna spend two minutes on the needle question similarity, and then we'll open it up for questions.

Adam Becker [00:10:29]: Right. For Kelly for a few minutes, and then we'll go on to the next section, which is the impact of distractors, and then come back to questions. I think that's the. The structure that I'm following. But, Arthur, if I'm off, then.

Arthur Coleman [00:10:44]: My bad. Adam, you're right. We decided to do it this way. This time I was following the old. So thank you for correcting me. That is correct. So use the. Use the word doc.

Arthur Coleman [00:10:53]: And we'll answer the questions in order after each section. Okay, cool.

Adam Becker [00:10:58]: So. So the first one is needle question similarity, which I think is a fascinating topic, even not in the context of large context windows. I think that even just a needle question is very interesting. So the classic needle in a haystack. That's the thing Kelly had already gone over. The idea here is just let's say we have a whole haystack of information that's not quite necessarily very pertinent to the answer. For example, I've discovered a handy test for figuring out what you're addicted to. Imagine you were going to spend the weekend at a friend's.

Adam Becker [00:11:30]: Okay, it's something distinct, and all of a sudden, the needle. The right answer is. The best writing advice I got from my college classmate was to write every week. Okay, so this is the needle, and this is what we're trying to find. The question is, what was the best writing advice I got from my college classmate? So this would just be a typical classic needle in a haystack setup.

Bauke Brenninkmeijer [00:11:49]: Right?

Adam Becker [00:11:50]: We see the needle. It's buried in the haystack. We have a question. Let's identify it. Let's see if the LLM is able to pick it up. Okay, so now let's figure out what do we make of the relationship between how similar the needle is to the question? Because maybe it's very easy when they're very similar. It says classmate. It says classmate.

Adam Becker [00:12:10]: It says best writing advice. It says best writing. Okay, they're pretty similar. So maybe. But maybe if they're very similar. But now you increase the context window. Now it's very difficult. Or maybe if you shorten the context window, even if they're very similar, maybe it's easier.

Adam Becker [00:12:25]: So let's actually see what the impact is of the similarity between the needle and the question. So, okay, the goal is to vary the similarity between the Needle and the question and see whether it degrades. Right. The performance of the LLM degrades with increasing context length. That way we can build intuition about how increasing length adds to the difficulty of it. So they go through a process here of how they actually went about it. I think I went over this a few times and I realized that maybe the best way to think about it is not to think about it at all for a minute and just to see if you were to try to replicate the type like this experiment, what would you do? Right? What actually matters here? How would you run an experiment like this? And once you build up that intuition, I think that flow makes sense. So I think the requirements need to be that the needles should naturally blend in so that they're not odd sentences out.

Adam Becker [00:13:19]: Right. So let's say we have the haystack here being, let's say, like Paul Graham's essays. And now all of a sudden you put in a needle that talks about cooking. Okay. It might be very easy to tell it because it's pretty different from everything else. And so it's easy to pick it up. So we shouldn't make it too easy. Let's blend it in so it looks natural.

Adam Becker [00:13:39]: Next is we need to control how close the needle is to the question, right? So maybe instead of cooking, you talk about food, maybe you talk about nutrition, maybe you talk about diet. So how close do you want those to be? And then we need to make sure that there aren't any other answers in the haystack already. Arthur, I might be running out of time. Let me know if I, if I am. But I'm getting close to the punchline. So we need to make sure that if, for example, the right answer has something to do with cooking, I want to make sure that there aren't a bunch of other right answers there where the LLM just could guess even at random because there's a bunch of needles or a bunch of heads hay that looks like needles. So let's make sure there's no needle looking haze in the haystack. So if that's, if those are the requirements, what you want to do is actually look through the haystack, try to pick up the right topics, understand what the topics are there, and then come up with a question and invent a bunch of answers and embed those answers in and then run the test.

Adam Becker [00:14:43]: So this is what they do. They chunk it. So they take all of, let's say Paul Graham's essays, they chunk it into one to three sentences. They take each of these chunks and embed those. Now we have, let's say this, I think it's like 3,000 dimensions. Well, you can't really do clustering all that effectively then. So you're going to reduce it down to 50 dimensions. Then you find the clusters, then you get 20 representative chunks from each of those clusters.

Adam Becker [00:15:06]: And now let's manually review those so you get a sense of, okay, what are they actually talking about and what's the style? And then you find meaningful topics. And let's say the topic is writing advice to say, okay, Paul Graham's essays, they tend to talk about writing advice. Fine. So now let's invent the question. The question we invented was what was the best writing advice I got from my college classmates? Let's invent an answer. So we can invent one answer and make sure that answer isn't actually already in the haystack. So that we could do that, or we could do it a bunch of times. Let's come up with eight different different answers.

Adam Becker [00:15:40]: So eight different needles. Each of those is a little bit further away from that question. How do we know it's further away? The cosine similarity is a little bit lower. So now we take each of those and we're going to bury them into the haystack. And we need to see whether the LLM is able to pick it up even if we have a very large haystack. So that's basically the experiment. What is the results? Well, first they have, I think, 18 different models that they're going to try this out on. And now they're mapping the results.

Adam Becker [00:16:10]: So this at the top here, you see, this is for the models that take over 500k tokens. And you can see that the blue one at the top is the high similarity. That's when the question and the needle are quite similar. The red one is when they're less similar. You could see that as you increase the input length from 1,000, 10,000, by the time you get to 100,000 tokens, even the highly similar needle in question, they start to degrade. Now it certainly starts to degrade if they're not even all that similar. If they're not all that similar, it's a very fast drop. Even worse for medium performant models.

Adam Becker [00:16:48]: Right. So this here we just see the high performing ones can get a sense here, I imagine which, which ones are the high performing ones? The ones that are medium performing. You can see that even for the medium performing, even if they're highly similar, it's going to start to drop certainly. So I don't think there's anything that's too surprising here. But if low performing models, you just keep giving it lowly similar, quite dissimilar question Needle. It starts out bad and it gets much worse.

Arthur Coleman [00:17:18]: Okay, let's stop there. Questions. Let's go to the question board. And I'm on my own machine here. I have to pop between the view of you guys and the question sheet. So pardon me if I'm not looking at you right now. So the first question was from Tutor. Tutor, you want to ask your question? Tutor? You there?

Binoy Pirera [00:17:50]: I think he might be at work, so could you just read it? Author.

Arthur Coleman [00:17:56]: Okay. Do you have plans to expand the current research? I guess that's for Kelly.

Kelly Hong [00:18:03]: Yes. I think he just sent a chat in the. Okay. Meeting chat as well. But I can answer it right now too. As of right now, no, we're not going to do like a technical report that directly builds on top of this. But yeah, so nothing like from Chroma directly, but we have been seeing a lot of people. I think like you mentioned here too, like you made like an interactive version which is really cool.

Kelly Hong [00:18:26]: I'd love to take a look at that if you share a link. But yeah, there have been people kind of building onto this research which has been really cool to see. But yeah, I think for Chroma ourselves we won't be doing something that directly builds on top of this.

Arthur Coleman [00:18:45]: Okay, I don't have names next to this next question. So do we know whoever asked it? Could you jump in and ask your question? How many real world applications? No. All right, I'll ask it.

Maine [00:19:04]: Yeah, that's mine. Yeah, wondering like I'm seeing like a lot of applications that are not in the form of and haystack for instance, like summarization or like analyzing a piece of data or code generation right where you need. Where the LLM needs to analyze the complete context that you provide to it. So how many real world applications I guess apart from rag are in the form of needle and haystack?

Kelly Hong [00:19:36]: Yeah, that's a really good question. I'm glad you asked that. That was something I was thinking about a lot too as I was designing these experiments. I think for needle in the haystack specifically, it's not a very realistic task. It's a more. It's a very like controlled task. And we explicitly chose this because it's one that you know, it's easy to like control various variables for like, for example, you can just like increase input length by like appending More essays. We can, you know, control like the position of like the needle in question and we can also control for like, things like similarity as well.

Kelly Hong [00:20:13]: So I think the main point here is that like we were able to see non uniform performance on this very easy task like needle. A historic is very easy relative to the tasks you mentioned, like agent tasks or tasks like summarization where you have to tend to the entire context. Like for this same model really only had to tend to like that one like sentence within the entire context and it still started to fail across like longer context lengths. So I think you can imagine that this kind of performance would translate over to more complicated use cases as well. Maybe you would have even like worse performance because the task is even more complex. But for those tasks it was more difficult to do experimentation on because if you have tasks that require attending to the entire context window, then you also have increasing difficulty as you increase like context length. Like for example, if you have like, you know, some like agent task and you give it like code files, like a bunch of like tools, and you have a task that requires the model to tend to like 1000 tokens versus like 100k tokens. The 100k token version is a lot more complicated for the model.

Kelly Hong [00:21:16]: So it's kind of hard to like disambiguate whether, you know, like model failures are due to like just task difficulty or if it's just like context length. So I think we wanted to like, separate out those two variables, which is why we didn't like design experiments that are like, directly applicable to like real use cases. But I think we're, you know, we're kind of like doing the exact simple experiment showing that models fail on these very simple tasks. Therefore, like, you can imagine that they would also fail on like, more complicated tasks. I hope that makes sense and kind of like our motivation for how we design these experiments.

Maine [00:21:50]: Yeah, that makes sense. Thank you so much for the answer.

Kelly Hong [00:21:53]: Yeah, thank you for asking.

Arthur Coleman [00:21:55]: Are there any other questions? There are none in the queue.

Bauke Brenninkmeijer [00:21:59]: I just added the question to the queue.

Kelly Hong [00:22:03]: Okay, let me see.

Bauke Brenninkmeijer [00:22:06]: So my question was that looking at the results, it seems to be so Adam. Adam mentioned initially that if the questions are not blending enough with the text, we expect them to stand out, which I relate to perhaps a higher chance of retrieval, so a better retrieval chance. But in these results we see the exact opposite. Right. Where things that are not similar are harder to retrieve. So do you have any feeling of why that is the case? Because it seems a bit counterintuitive, perhaps.

Kelly Hong [00:22:47]: Could you clarify the counterintuitive part.

Bauke Brenninkmeijer [00:22:50]: So. So I think Adam kind of described the intuition in the beginning that if things are dissimilar from the sort of needle is dissimilar from the background that is blending in with it would stand out, which could perhaps lead to an easier. Yeah, easier retrieval for the model. Easy is maybe the wrong word here, but it stands out from the. The environment.

Kelly Hong [00:23:14]: Yeah. Yeah. So for the needle and haystack similarity experiment, I mean, needle question similarity experiments, we did that across all, like, needle and, like, haystack pairs. So it could be, like, a needle that's related to, like, the writing advice, and it would be, like, within a pulmonary essay haystack. We also did that same needle on writing advice across, like, the archive haystack as well. And we did, like, a research question on arxiv haystack research question on, like, pulgram essay haystack. So for needle question, similarly, we did it across all those combinations, and the only thing we were manipulating was the needle question similarity. And we kind of defined that as a highly similar needle has very high lexical notching.

Kelly Hong [00:23:56]: So a needle and question pair have more for worm matches, whereas lower similarity needles, they have more semantic similarities. So you could have a question like, oh, what was the best writing advice? A low similarity needle would be. Oh, like, one time in my English class, my classmate told me, like, blah, blah. So it's like, a lot more, I guess, like, harder to, like, match, like a needle in a question. So, yes, we go more into, like, needle, haystack similarity, like, later on. But, like, for this one specifically, we are only changing, like, the similarity of the needle in question.

Arthur Coleman [00:24:30]: Okay.

Bauke Brenninkmeijer [00:24:30]: Yeah.

Arthur Coleman [00:24:30]: All right, we have to move on to Bowker, since you're up, But I will say I'm at the end. I want to talk about ambiguity on this subject. What caught me about it, Kelly, was that we never defined it in the paper. And I will talk to you a little bit about what I discovered and either supports or not your approach. All right, so let's go on. Bowker, you're up.

Bauke Brenninkmeijer [00:24:54]: See, we can make this thing full screen.

Arthur Coleman [00:25:01]: Good questions, everybody, by the way. And also during questions, I would love it if people put their video on so we can actually see everybody. Like, it's a real human discussion. Yeah.

Binoy Pirera [00:25:13]: But I think a lot of people actually at work, and they're listening to us in the background, so thanks for that.

Arthur Coleman [00:25:20]: Yeah. All right, so whoever can. Whoever can. I understand the issue. It just be. I saw the tutor was at work, but I saw him he was on for a while.

Binoy Pirera [00:25:29]: For example, before Balka starts, let me just give a quick introduction. Barker is the chapter lead of Amsterdam chapter and he scaled up Amsterdam in person community quite a lot and he's an absolute legend for doing this. So, yeah, over to you.

Bauke Brenninkmeijer [00:25:45]: Bother.

Binoy Pirera [00:25:46]: Thanks for being here.

Adam Becker [00:25:56]: As he's pulling it up, I will also give a shout out. I went to hang out with them in the Amsterdam meetups a couple of months ago and it was a sick scene. So Baka, well done.

Arthur Coleman [00:26:11]: He's back. He's back.

Bauke Brenninkmeijer [00:26:13]: All right.

Arthur Coleman [00:26:14]: He hit the wrong button probably as he was trying to start the share.

Bauke Brenninkmeijer [00:26:19]: Not. Not the wrong button. No. I realized Zoom didn't have the permissions to share my screen. So now we're all good. So I could be rejoined as I was being introduced.

Arthur Coleman [00:26:29]: Yeah, you missed all the good words people said about you.

Bauke Brenninkmeijer [00:26:32]: I heard all the niceties. The beginning at least. Let's see. This should be the right one. All right. Okay. I hope you can see it now.

Matt Squire [00:26:45]: Yeah, yeah.

Bauke Brenninkmeijer [00:26:46]: Okay. So where did I put my things? Yeah, so I'm going to be talking about the Memeval experiment, which I thought was one of the. But you know, I don't want to pick favorites, but I think this was one of the more interesting ones, at least to me, relating it to real life usage, perhaps for myself. So the longmen evil is a benchmark. And essentially what we're going to do here is evaluate complex Q and A for long context with chat data. So the data here is, I think, actually coming from like a ChatGPT type system. And the benchmark has these different types of questions, for example knowledge updates, temporal reasoning and multi session that it does over these long context conversations. And so long context, I think it ranged from about 50,000 tokens to 100,000 tokens.

Bauke Brenninkmeijer [00:27:44]: But so really significant chunks of text in terms of the experiments or the retrieval tests were done both on the full input and then also on the focused input, which only contains the chunks that are relevant for the specific question being asked. Just to compare. Okay. How much is the impact of the additional distractors or the additional context that we're working with? Yeah, so quite a simple, neat experiment and I think we can just jump into the results immediately. So I did circle the results around a little bit because this one, I think this, yeah, looking at ChatGPT is a good start. And here we see quite interesting. So the focused chats are in orange here or red, and the blue describes the full context retrieval. And then we have all the different models on The Y or on the x axis and and their performance for each test.

Bauke Brenninkmeijer [00:28:49]: So we can see a clear decrease between the full context and the focused context. You know the more intelligent models like O3 they perform really well in both actually, so almost matching the focus context situation. But if we look at more older or smaller models we see there's a significant decrease in performance or 4.0 for example. It's almost like 30% overall holding up well. But you can see the impact of the full context here very clearly. As we move on to Gemini we see a fairly similar story. Smaller models drop more on the focused context. All of them work pretty well.

Bauke Brenninkmeijer [00:29:37]: And one trend we also see is that thinking generally does help a lot for the full context one. So here we go from 72 to 90. Here we went from yeah, low reasoning. This only is a very small increase but I think we'll see a similar story here. So for cloth, cloth is a bit of a different situation because Claude is very hesitant when it's unsure about the answer. So we see here with Claude Opus for thinking it's doing fairly okay in a full context. But with the non thinking variant it drops down to 38 which is very low and even below some of the non thinking models from earlier like 3.5 and 3.7. Yeah, what did I have here?

Matt Squire [00:30:24]: Yeah, yeah.

Bauke Brenninkmeijer [00:30:25]: When it. So it abstains when in doubt. I was really wondering a question to Kelly for afterwards. Is this a consequence of anti hallucination training or what could be a source for this? Because I, I do know that clause was really, you know they really want to have it work very well for coding and programming and those, those situations and there obviously that's a very important thing. So I thought they focused a bit on that. The research also focused on why or how thinking and non thinking impacts for the cloud model specifically because there is that big performance gap and we see that for temporal reasoning questions the we go from 27 on the full context up to 75. So a really large difference, a little bit less but still similar. For the multi session one we go from 36 to 61 and the knowledge update is a little bit less 62 to 86 but still very significant improvements not distributed equally over the different categories of questions.

Bauke Brenninkmeijer [00:31:41]: And then the last one that we have here is Gwen which does surprisingly well. I thought so the A, sorry the 235B obviously quite a large model, it's performing about as you would expect. And then the non reasoning model drops in both and it's doing surprisingly bad actually. In the non thinking focused context version here only 77. What surprised me here as well is how well the 32B is performing and it's almost matching the 235B on most of these benchmarks. Right? 68, 67, 93, 94 and the same for the non reasoning version of this. And 32B is like if you have a large MacBook you can run that and then the, the 8B is actually surprised me even more because it's, it's dropping almost no performance like you know, a couple percentage points here and there and 8B is, is runnable on your phone almost. So I was very impressed by the, the performance of this.

Bauke Brenninkmeijer [00:32:46]: Like I don't know how large flashlight from Google is for example. So I don't know if that's 8B or 32B but here at least we know the exact sizes and, and you can run them yourself locally. Yeah, I think that was the end of my Memeval section. I hope that was almost two minutes. Probably a bit longer.

Arthur Coleman [00:33:07]: You did great. Actually. Whoops. I was putting the. Trying to put the question link in. By the way guys, I have to apologize to Matt. I got the order wrong. Matt was supposed to go next.

Arthur Coleman [00:33:18]: I don't think it matters too much from the discussion because these are very unique and separate tests. But sorry about that Matt. That's what happens when I'm working from a single monitor. Yeah. So Arham had a question to Kelly. Do you test with text only input or did you try mixing the type of input to text and tabular data?

Kelly Hong [00:33:39]: We only do text only input.

Arthur Coleman [00:33:44]: Okay, that was fast. Next question from Samantha. Why are there no error bars? How many times were each of these run?

Kelly Hong [00:33:53]: Yeah, good question. We ran all these models on temperature zero. So they're all like deterministic outputs. So we just ran them one time. So yeah, if you set temperature to like zero, they're pretty much reproducible.

Samantha Q [00:34:09]: Disagree. Respectfully. My experience has been even with temperature zero you will see different results. So strongly, strongly recommend to anyone doing this kind of work always run things in triplicate at least.

Kelly Hong [00:34:21]: Yeah, okay, thank you for pointing that out. I did run it multiple times when I was just like testing it on my own too. And yeah, maybe adding error boards would be good to keep in mind. But I think for the most part, especially on this task too because for long min eval it's only. Yeah, I think this was. Yeah, basically like when I saw this I didn't see too much variation so I just included that one number and for this task specifically, when I was running it on like temperature zero, especially because like the task itself is like pretty simple too. It's just like question answering and you just have like a binary success. Like is it like correct or not? Like we saw very similar results.

Samantha Q [00:35:05]: Yeah, I think if it's all or nothing, it's a little bit easier. But we certainly saw even a temperature zero for some kinds of tasks.

Kelly Hong [00:35:13]: What kind of tasks did you run?

Samantha Q [00:35:16]: I mean some of these were similar sorts of things where we were trying to do retrieval but you know, doing like a query on a doc.

Arthur Coleman [00:35:27]: So.

Samantha Q [00:35:28]: Yeah, not, not that different from what you're doing. Sometimes we would just see it fail in and it, maybe it would be 10% of the time, it would be 1% of the time, but sometimes it would just completely fail. Even at temperature zero, Even with all the same inputs and same model, everything the same. So I wouldn't necessarily trust the temperature zero will always get you the same thing.

Bauke Brenninkmeijer [00:35:53]: Correct me if I'm wrong, but the numbers here are just the average accuracy or something. Right. So in terms of error bars, like this is one aggregation. So what would the error bar indicate here?

Kelly Hong [00:36:07]: I think if there's any multiple, multiple runs of this. Right?

Samantha Q [00:36:12]: Yeah.

Bauke Brenninkmeijer [00:36:13]: Yeah. Okay. Yeah, but yeah, it's an average. So like are we going to average three averages or five?

Samantha Q [00:36:21]: No, you always show the error bar. Even with any kind of average, you always show it. That's, that's the correct way to show an average.

Bauke Brenninkmeijer [00:36:32]: But, but this is binary, right? So it's either true or false. So.

Samantha Q [00:36:37]: No, but that's what I'm saying.

Kelly Hong [00:36:38]: It's.

Samantha Q [00:36:39]: If you have any outliers, your average is going to change. Right. If you have any time that it fails, but you're not going to know that unless you ran it enough times.

Bauke Brenninkmeijer [00:36:48]: What's an outlier in a binary output?

Samantha Q [00:36:51]: That, that you got a zero and normally you get a one.

Bauke Brenninkmeijer [00:36:56]: Yeah, okay, but I'm having like 300 zeros and ones.

Kelly Hong [00:37:02]: Right.

Samantha Q [00:37:02]: So then your average is not going to be 100% all the time. You're saying that the error bar should be the same for all of them. Is that, is that your point?

Bauke Brenninkmeijer [00:37:13]: Well, I think the, the number of experiments that are ran here are significant enough to not fluctuate a lot by running more of them. So we have, for each of these bars is, what is it like 300 data points. So if we, if we would have 900, I don't think the average would change a lot given that it might do like 1% point, but that's not a deviation. That is right.

Samantha Q [00:37:40]: Well, my point is you're making assumptions about that and unless we see the distributions, we're not going to know. Yeah, quantiles. I mean, there's other ways to show the distribution that could sometimes reveal there's more variation than you assume. And that's why you show what the distribution is and not just an average.

Adam Becker [00:38:02]: Okay.

Kelly Hong [00:38:03]: I mean, I think that would be good to make it more robust. So yes, I agree with you. I think that would be good for I guess having more like confidence on these results. I think for this specific specifically like we do have the confidence just given that like making our claim that models have non uniform performance across context length because we tested it across 18 different models and each bar is like an average of like 300 tasks. So yeah, that's like our main claim. But yes, I think adding error bars, that's like a good point to make it more robust.

Samantha Q [00:38:37]: Well, and also my other point there is that often when we've done these kinds of experiments and looked at the distributions, we find that it's very uneven. And so there are subsets of tasks where the differences are bigger and sometimes that can be really informative. So often you learn more by doing it that way.

Arthur Coleman [00:38:57]: So I just want to compliment Samantha for the way she approaches that kind of challenge. First of all, I come from a family where you had to fight to get awarded at the table. We all argued. But besides that, this is what these sessions are for, to bring your knowledge to bear to challenge assumptions of the folks. So thank you, Samantha for doing that. I'm going to turn it over to Matt now and Matt will let you run, you know, your, your period of time and then we'll get into open questions.

Matt Squire [00:39:24]: Fantastic. Well, yeah, I think, you know, we've, we've covered a lot of ground here and I guess in the interest of time I'll, I'll try to keep my bit relatively brief. I just wanted to check are we going to. And I can see somebody's screen share still. Just, just want to. If that's intentional, perhaps what I'll do is take over the screen share and show the document in just a second. But I'll, just before I, before I share the document, I'll just say a few, a few things in terms of background. So we've, we've talked about the sort of needle question similarity and one thing I wanted to disambiguate is needle question similarity versus needle haystack similarity, which is one of the things I'M going to talk about what I wanted to just check at the point of order is are we also going to cover distractors? Are we going to come back to Adam to cover distractors? Is that the plan?

Adam Becker [00:40:26]: I think I would love to. I think we're running out of time. Probably fine.

Matt Squire [00:40:30]: Let me at least close the loop on this needle haystack similarity point in that case, and something about structure and then we can have open questions. So I will share my screen because that is the. Oh, hang on. I can share a window. Lovely. Okay, never mind, never mind. Let's see.

Arthur Coleman [00:40:58]: Can you. Can you give him screen share capability? I don't know that I can.

Matt Squire [00:41:02]: It's not so much. Yeah, it's not so much that as I can't even describe the use. Yeah, yeah, no, no. The problem is literally the user interface on Zoom is so bad that I can't click the button because it's not on my screen. Instead, what I'll do is I'll suppose that people following along can see the paper. If you can't, I'll just make it accessible as I go through it and we'll do it that way.

Binoy Pirera [00:41:26]: You know what, it's interesting. We have a developer here from Zoom Ojis. I think he'll be able to fix it for you.

Matt Squire [00:41:32]: Zoom. Oh, just really? Yeah, this is on Linux. This is really, really bad, I'm sorry to say. Okay, so anyhow, the point of needle question, similarity is to talk about the impact that is found when we have. So just a little bit of. Just take a step back there. What we're doing is we hide the needle in some larger piece of context and we're going to ask a question, we're going to try and find that needle in the context. And the question is, well, if the question we're asking is very, very similar to the needle that we've hidden versus if the question is quite different on that spectrum, how is the performance impacted? Well, we can ask a similar question about the relationship between the needle and its surrounding context.

Matt Squire [00:42:26]: And that's what needle haystack similarity is looking at. So we want to know. And this was brought up earlier by one of the questions, right? It's. We want to know if we hide a piece of information in a text that is very closely related to its surrounding text. I think the example on the right here. Well, the example on the left, actually, I found this image not especially helpful. But the example on the left is like, okay, we have an essay, a piece of creative writing, and in the middle of it we say the best writing advice I receive from is to write every week. That's at least somewhat connected to the Paul Graham essay versus thing on the right, where it's very distantly related.

Matt Squire [00:43:15]: It's really got nothing to do with the surrounding text. You can at least imagine a narrative where we're making a point about writing in a piece of creative writing, but you can't imagine a sensible narrative where we suddenly start talking about low latency rankers. Re rankers. So that's the idea here, that's what we're contrasting. And the interesting thing that is found, which has been highlighted by one of the questions earlier, is that almost counterintuitively, and it was certainly counterintuitive to me as well, that if you have that needle stand out, if it's completely unrelated to everything surrounding it, the performance is better. The large language model is somehow better able to recall that from the text. What they also found, and we've got the results here, we can kind of compare the numbers here. So they used two different corpora to do this.

Matt Squire [00:44:11]: One was a bunch of archive papers, the other was essays from Paul Graham. Those are two really relatively different kinds of text if you think about it as well. All Graham's essays are long form creative writing about startups and things related to that. Archive papers are going to be scientific papers, papers like this and what we have. So on the left we're asking if the needle we've hidden is related to arXiv, how well does it retrieve, if it's an archive like needle, how well does it retrieve from an archive paper compared with a Paul Graham like needle? I suppose it's worth remarking that the numbers here, the accuracies, aren't hugely different in this case, on the right hand side, they're a little bit more different, but again, not, you know, substantially different. And I would echo the point that was made earlier about error bars and about asking ourselves, really repeating these experiments over and over again and asking ourselves about in the average case, how performance varies, which I feel is perhaps missing from this picture. In any case, that's the idea here. That's the idea.

Matt Squire [00:45:24]: And I hope, certainly I hope that the needle haystack similarity versus needle question, similarity is clear. The other thing that is worth looking at really briefly is structure and how structure impacts things. Because once again, this is something that I found surprising, somewhat counterintuitive, but maybe not surprising because our expectations about how a large language model reads a piece of text should not be. We shouldn't kind of overlay Our assumptions about how we might read a piece of text. And maybe that's where our intuition gets messed up here. So they asked how the structure of a piece of text impacts the performance of retrieval when we're trying to retrieve these needles and we can take the original piece of text. If you take an essay, then an essay should be a well connected piece of narrative. Each paragraph should lead to the next paragraph.

Matt Squire [00:46:25]: That's how we expect to read something, right? There's a logical structure to it. If you shuffle all of the sentences or all of the paragraphs in a piece of text, then it's not going to be very easy for us to comprehend. Well, it turns out that actually, and I just want to make sure I've got this right, that if you shuffle the sentences, actually performance at retrieval is better. And that's again, that's the thing that I thought was. That's kind of weird and almost. It's like the maybe because there is no logical structure, the needle stands out. Maybe there's an intuition there. But again, that's kind of weird, right? It's kind of, you know, if you were reading a piece of text.

Matt Squire [00:47:16]: Well, maybe it would be easier to find the needle if you're reading a piece of text that has no logical structure, because you would very quickly learn to not bother paying attention to the logical structure. In any case, I think that's all there is to remark on when it comes to structure. Just want to scan these. These last bits and that in that case, do we want to move on to whatever's next? I've kind of lost track. Arthur.

Arthur Coleman [00:47:43]: Sorry, I was on mute to save the noise in the background.

Matt Squire [00:47:45]: That's fine.

Arthur Coleman [00:47:48]: We have a couple of questions. We may have time. I really would like to get to distractors and I want to make sure I get a chance to say one thing toward the end as a sort of a question statement. But let's go to the questions. John, if you're able to chat. John Savage, I assume, because I can see you. John. Hi.

Arthur Coleman [00:48:06]: You want to ask your question?

Bauke Brenninkmeijer [00:48:09]: Yeah, I'm just curious about the impact of the thinking models and just more. Did you have a gut feel of why the thinking models perform better? Is it because they're kind of more reliably able to kind of like work with longer context? Or is it more like an artifact of. Of this exact task, this long mem. Task?

Kelly Hong [00:48:30]: I. I wouldn't have enough confidence to say like what it is attributable to. I think generally for like, more like complicated tasks, like having Reasoning typically boosts performance as we've seen like other benchmarks that these labs release too. So I think it's just like a result of the model being able to reason through, maybe think about where this needle is located. And then I think especially for long min eva, some of them require multi help reasoning. First you have to think about all the relevant parts of the context and then reason about the answer. Give the model the opportunity to think about that. It typically performs better.

Kelly Hong [00:49:10]: So I think could be generalizable. But yeah, we didn't run specific experiments to test like reasoning models, like the impact of them specifically. So not totally sure, but yeah, I would think that it would generalize to more like complicated tasks.

Arthur Coleman [00:49:33]: Okay, Balkay, you're at Docker.

Bauke Brenninkmeijer [00:49:38]: My question relates to the shuffling chunks increasing performance finding, which is is surprising. And also I feel like it should have an impact on like how we do rack pipelines or long, long context Q and A. But I don't look like I'm not sure what that step looks like. Like what are the things that we now are going to change in frameworks, for example, or should we change anything? So I'm curious to hear if you had any takeaways from that in terms of practice.

Kelly Hong [00:50:10]: I think for the haystack structure experiment, I wouldn't say there's a practical takeaway because I think the main motivation for this was to see, okay, how does structure just impact LLM behavior? And I think ideally if you're working in a Q and A or retrieval scenario, you would want to limit as much irrelevant context as possible. So you wouldn't want to have those chunks in the first place. But I think for this it was really interesting because we thought that having a logically sound haystack and then the needle in the middle, that would break the logical flow and therefore make it easier to stand out. But what we actually found was that if you have a needle and a shuffled haystack, we thought it would blend in more, but that actually made the models perform better. So I think it was just an interesting observation of model behavior and more so a direction for interpretability rather than practical applications. For this, I think for needle and haystack the main takeaway from that is to limit the amount of irrelevant context, limit the amount of distractors, just try to focus your context on as much relevant information as possible. So I don't think there is something to apply from this, but it was just an interesting observation that we had. Hope that makes sense.

Bauke Brenninkmeijer [00:51:27]: Yeah, sure.

Adam Becker [00:51:29]: Yeah.

Bauke Brenninkmeijer [00:51:29]: So maybe one follow up because to Me, it sounds like if I say I have 100k tokens in my programming assistant context window. To me, it sounds like if I want to ask a question about something that's happened earlier, it would almost be better to chunk it, redistribute it into the context window and ask the same question. And it would work better. Or maybe I'm misunderstanding then be defined.

Kelly Hong [00:52:00]: Potentially. But this wasn't really on conversational. So I think the type of context could matter to whether you're working with conversational inputs or this is just archive and polygram essays. So it might be different. But I mean, if you're working in a long running conversation and you want your next query to tend to a specific part of the context, I think I'd rather do a re ranking step where you maybe just do another LLM pass of okay, what is the most relevant part of this context in relation to my question? And then do another LLM pass asking the question. So I think I would rather filter the context for relevant content rather than shuffling it. I think that would lead to much better performance.

Bauke Brenninkmeijer [00:52:45]: Yeah, okay, but that's also much more expensive. But okay, I understand where you're going.

Samantha Q [00:52:53]: Yeah, I was just going to back Kelly up on that. We had a similar observation that filtering works much better than shuffling.

Kelly Hong [00:53:04]: Thank you.

Arthur Coleman [00:53:06]: I'd love to. I want to talk distractors, because my comment has to do with that.

Adam Becker [00:53:12]: You want me to do a quick.

Arthur Coleman [00:53:13]: Yeah, do it quick because I want to talk about ambiguity. I really do.

Bauke Brenninkmeijer [00:53:16]: It's.

Arthur Coleman [00:53:17]: It's.

Adam Becker [00:53:18]: So let me share that one. I thought distractors were really very, very interesting if anybody wants to zoom back into them even afterwards. So the idea here, newer models are claimed to reliably handle any distractor, but is that true? And is that true as input length increases? So the idea is very simple. A distractor is a topically related concept. It's related to the needle, but it doesn't quite answer the question. And everything else that's not even related to the needle, we just call it irrelevant content. Right. So then the idea here is let's try to trip it.

Adam Becker [00:53:57]: Let's give a bunch of different distractors. So what was the best writing advice I got from my college classmate? Well, the distractor could be the best writing tip I received from a professor, or the worst from a classmate, or the best from a classmate or the something totally different. So the idea is to throw those in and then to see how performance degrades. And this is the case of no distractors here. When we throw one in and then we throw multiple in and we see that almost unsurprisingly, the more distractors you have, the worse it gets, certainly as the input length increases. There's a couple of interesting aspects here, but another thing that's very fascinating is that one of these distractors is particularly pernicious. Why? What's happening there? One of them tends to be really throwing everything off and then they go into a full error analysis modes to figure out which one it is, how to think about it a little bit. But that's the 32nd rundown.

Arthur Coleman [00:54:53]: Okay. And this was what was bothering me, Adam. That's why I wanted you to go first. So throughout this, Kelly, you keep talking about ambiguity, but you never defined it. And I was like, okay, wait a minute. First of all, what really bothered me is, as I said, I am a wordy, wordy prompt engineer. I mean, I have literally at one point hit the limit of the context window. Okay? So that's how bad it can be.

Arthur Coleman [00:55:19]: And basically you were telling me that, you know, I'm always going to do worse, that I need to be like the machine, I need to think like the machine. But I'm a product person and I've also been doing an, you know, AI of the robotics guy. I've always wanted the machine to come to me, not the other way around. And so what I asked ChatGPT to do, because I wanted to see this is I, I basically wanted to recreate your experiment for the other direction. So what I asked CHAT GPT to do is I said we're going to do a long conversation and in there I'm going to ask two questions that are similar and I want you to stop and tell me when you found the similar question. And so there were plenty of distractors. And I asked at the same time, if we're going to do this, you have to tell me why it's ambiguous to you. So come up with an ambiguity metric so that I know my relative level of ambiguity.

Arthur Coleman [00:56:14]: So we're doing this test and I realized, wait, let me look at length to your point, context length. And I want to share my screen here versus whoops. Thank you. Hold on. I'm going to share two images. So the first thing was, what was the ambiguity I had to go through? I've been doing with ChatGPT for three weeks, extensive work, like every day, hundreds of questions, hours of work. And I asked you to go through and take a sample of my questions and tell me, you know, what the average ambiguity was and came up with a metric I won't go to the detail. My average was 30 or so.

Arthur Coleman [00:56:55]: And you can see the comment about distribution. Here is the distribution of my ambiguity scores on the various questions. Now, it really is interesting. And Lakeshore, can you see this chart? The scatter plot? Can you see it?

Kelly Hong [00:57:11]: Yes.

Arthur Coleman [00:57:12]: Okay, so I asked it to look at the number of tokens. This is the number of tokens, and this is the ambiguity score. And basically the, the conclusion is there's very little relationship between the ambiguity score of my short versus long questions. And it's like, well, wait a minute now, do I interpret that as saying that slightly better, lower ambiguity, higher ambiguity with longer questions. And the answer was, wait a minute, maybe it's just that each individual has an ambiguity level in their questions and that's the real driver here is how much each individual actually is ambiguous in the way they speak. It's about natural language and how you talk to the machine. That's the question that I think I really would have loved to have us answer. And I actually could tell you the results of asking about the two questions.

Arthur Coleman [00:58:06]: But we are. I really wanted to point this out because I think the question was phrased the whole point of the research. I just would have come at it as a practical product person. A person deals with customers from this perspective rather than from asking how the LLM worked, ask how people work relative to the LLM. So that's my comment back to you. And I apologize if it sounds critical, but I wanted to, I wanted to at least people hear that, that viewpoint.

Kelly Hong [00:58:33]: Right? So I think, like, you're curious about how we define ambiguity. I guess, because I think if we just talk about ambiguity, it sounds, I mean, I think it's very subjective. So in our research, specifically how we defined it is we use cosine similarity between needle and question pairs. We can't really do that with just one individual question because to do cosine similarity, you need two embeddings. But basically, like, we like ran this across like five different embedding models and we would get like the cosine similarity between like each, each like question and needle pair. So I guess that kind of gave us like a quantified metric for like, how similar like needle question pairs are. And that's how we define ambiguity. Because I agree that just saying, like, oh, this needle question pair is like ambiguous.

Kelly Hong [00:59:18]: That's not really like, enough. So yeah, that's why we like quantified that through those metrics. But I think for your experiment, how did you get the histogram and Boris, I'm a little confused on that.

Arthur Coleman [00:59:33]: I will take that offline with you because we're two minutes over. I just wanted to point that approach, that turn it around approach issue to discuss it for further. But we're over time and I always try to keep us on time. I want to thank everybody for coming. I want to thank our speakers who did a great job talking through and Kelly, I want to thank you for taking time to come and present and not only that but to write the paper because it's all very interesting work and writing research takes huge amounts of time that probably you don't have given your day to day job. So thank you everybody and we'll see you all in three weeks. September 11th, right? Is our next session Hanoi, right? Yep. All right guys.

Arthur Coleman [01:00:18]: All right, very good. Take care.

Adam Becker [01:00:20]: Thank you everybody. Thanks Kelly.

Arthur Coleman [01:00:21]: Thanks everyone.

Bauke Brenninkmeijer [01:00:22]: Thanks Kelly.

Adam Becker [01:00:23]: Bye.

Matt Squire [01:00:23]: Thanks.

Kelly Hong [01:00:24]: Thank you.

+ Read More

Watch More

Beyond Prompting: The Emerging Discipline of Context Engineering Reading Group

Posted Sep 17, 2025 | Views 954

# Context Engineering

# LLMs

# Prompt Engineering

MLOps LLM Stack Hackathon Winner: Exploring the MLOps Community Trends

Posted Jul 21, 2023 | Views 682

# LLM in Production

# LLM Stack

# Virta

MLOps Reading Group - December : A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents

Posted Dec 27, 2024 | Views 386

# AI Agents

# Observability

# AI Systems