Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
speakers

Sophia Skowronski is a Data Scientist at Breckinridge Capital Advisors with previous experience as a Business Analyst at Pledge 1%. Sophia has also worked as a Data Science Intern at Candid, an AI Investigations Intern at Deep Discovery, and held roles at Singularity University and the Global CO2 Initiative. Sophia holds a Bachelor of Arts in Astrophysics and Cognitive Science, as well as a Master's degree in Information & Data Science from the University of California, Berkeley.

David DeStefano has joined Adonis as a Senior Machine Learning Engineer in the United States, previously serving as Lead Engineer for ML/AI Platform & Infrastructure at EvolutionIQ.

Raised in Reykjavík, living in Berlin. Studied computational and data science, did R&D in NLP and started making LLM apps as soon as GPT4 changed the game.

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .
SUMMARY
As AI agents become more capable, their real-world performance increasingly depends on how well they can coordinate tools.
This month's paper introduces a benchmark designed to rigorously test how AI agents handle multi-step tasks using the Model Context Protocol (MCP) — the emerging standard for tool integration.
The authors present 101 carefully curated real-world queries, refined through iterative LLM rewriting and human review, that challenge models to coordinate multiple tools such as web search, file operations, mathematical reasoning, and data analysis.
TRANSCRIPT
Arthur Coleman [00:00:00]: Okay, good morning. Hopefully everybody can see my screen. Me get up here where I can control it. Now I can't see you guys while I'm doing this, so actually what I'll.
Valdimar E. [00:00:15]: Do is I'll set it up here. There we go.
Arthur Coleman [00:00:28]: This double wide screen is just very interesting for. To deal with when you've got the meeting going. Okay. Good morning. There we go. Welcome everyone to our monthly reading group. The discussion today is on a very interesting paper called Live around a technology, a new measurement technology for MCP performance called out live MCP101 which stress tests and diagnosing diagnoses persist performance of MCP based applications, agents and setups. Basically, it's a really fascinating paper.
Arthur Coleman [00:01:13]: There's some interesting, very interesting results in it on a number of dimensions and I think our speakers are going to give you a really interesting insight. I was trying to have a poll here to see how many people have set it up, but the polling system I use, Slido, is not cooperating this morning. If I get it to work, I will ask. Let me take you through our speakers. First of all, Valdemor is back. Who is our. Who is an AI team lead at Smart Data Inc. And Valdemar, do you run one of our chapters?
Valdimar E. [00:01:46]: Yeah, kind of the Berlin one, right?
Arthur Coleman [00:01:49]: That's what I thought. So we have one of our chapter leads from Europe, which is always a pleasure. And let me tell you something, guys, being a chapter lead is a lot of work. And so I greatly appreciate when someone who is already doing a lot of work for the organization takes the time to prep and do one of these sessions because it is a lot of work to prep these sessions. And the second is Sophia Skoronsky who is at Breckenridge Capital. I've introduced Sophia before and I've had her actually talk to my son, believe it or not, who is a physicist out of MIT and a researcher there in climate and sustainability. He was very grateful, by the way, Sophia, for your time. And she's a data scientist, which I haven't figured out.
Arthur Coleman [00:02:30]: Sophia, how you are a data scientist at. And by the way, we are recording, right?
Valdimar E. [00:02:37]: Let me check.
Arthur Coleman [00:02:38]: Yep, we're recording how you can be a data scientist at a, at a venture capital PE firm. You're going to have to explain that to us someday. Okay, not at the moment. Not at the moment.
Sophia Skowronski [00:02:51]: And you can all about mcp.
Arthur Coleman [00:02:53]: Okay. You can see our linkedins. We do like you to contact us and connect with us. It's always great to be talking to members of the community. Let me just say, if you do connect with me, I get like 10 invites a day, please. In the note where you say, add a note when you connect, say I heard you or met you through MLOps community, because otherwise I will reject the request. And I'm sure that's true for Valdemar and Sopia as well, especially since Sopia is in the fundraising business. I bet she gets hit like 10 times a day for people who want to connect with her to see if she has any money to invest.
Arthur Coleman [00:03:28]: Okay, so the agenda today is Valdemar is going to start. We lost one of our speakers. He couldn't make it today. So Valdemar is doing double duty. Thank you, Valdemar. He's going to start with the abstract and intro, as I understand it. Sophia then is going to go through the methodology, which is really interesting. And then also the results.
Arthur Coleman [00:03:46]: Valdemar will jump back to do the results. We have questions at the end. The link to the question document is in the chat. And so you put your questions there. You'll see. I actually added one already of my own. So you can follow that model where you put your name and then your question. Because I will call on you when the time comes.
Arthur Coleman [00:04:06]: I'm going to call on you if you put a question in to ask your question directly. I'm not going to be the mediator for question and answer. You can talk to the speakers directly now. Guiding principles of the reading group. These are your. And I do these all the time. You're probably bored of hearing me say it, but they're important guides. First of all, these are your sessions.
Arthur Coleman [00:04:26]: They are meant for you to learn. So the more that you participate, the more that you ask questions, the better. These are going to be okay. And we've had great interactions in the past. That's really added value, especially when people who are actually doing something like multiple agents on MCP say, hey, I was trying this, I found this. Why are you saying that? Because it's not what I'm seeing. Those kind of comments are really valid and very valuable. Excuse me, all comments are valid.
Arthur Coleman [00:04:54]: Also, this is a no judgment zone. I want to be very clear. There are no dumb questions. No one is here judging you to say, oh, boy, you know, is this guy smart or not? We're here to learn together in a safe environment. So please bring that attitude to the, to the session. Lastly, Valdemar, I think you're going to show a Moreau board. I just want to be clear. People are going to ask, hey, can I get a Link to the Moreau board.
Arthur Coleman [00:05:18]: That Moreau board confirmed for me. Valdemar is view only.
Valdimar E. [00:05:24]: Yeah, yeah, it's. It's. Me and Sophia have access, but I cannot make it public, so. Unfortunately not. But it's just snippets mostly from the paper and a couple of things from Sophia that you'll see in the presentation.
Arthur Coleman [00:05:40]: Very good. And always, always guys, we want to serve you. You are our customers, if you will. I'm a product guy, so I always think like that. And we want these sessions to be a great use of your time. Not even kind of a good use, but like a fabulous use of your time. Filling out the post event survey helps us do better every time. So please do fill out the post event survey that will come in your mail.
Arthur Coleman [00:06:05]: And at that point I'm going to stop sharing my screen and turned it over. How do I do that? Stop share and I turn over to Valdemar. Go for it.
Valdimar E. [00:06:16]: Yeah, thanks. I'm sharing my screen. I will put this into presentation mode. Where is that? So we have this Miraboard. I will scroll down. Do you see it? Yes, yes. All right, I'm gonna just. Yeah.
Valdimar E. [00:06:47]: Cover the introduction, hand it over to Sophia and then we will. I'll talk about the results. We'll have some comments in the end, just first, quickly go through it. It's not a very long paper. It's mostly just a description of how this benchmark was made. So what is it and what is it about live MCP101 stress testing and diagnosing MCP enabled agents on challenging queries was basically about building this 101example benchmark or test set test environment for benchmarking or measuring the effectiveness of different agents. Sophia will talk about more the methodology, how they constructed it, this kind of synthetic data with human overview against MCP tools and servers, what they can do. Yeah, they made this evaluation approach which was also like another one of the main contributions.
Valdimar E. [00:07:57]: It was a bit special or they were kind of evaluating that at the. The agent went the right path in selecting the tools because the benchmark is about these challenging input queries. The agents are not chat bots. They're just more like if you're familiar with the code assistance where you can just tell the code assistant, hey, let's make a app that does this and this. And he has to decompose it into plan and. And return the right answer. So we'll learn about that more. More about that in a bit.
Valdimar E. [00:08:33]: And they use the benchmark to evaluate different frontier LLMs. So basically evaluating the different foundation models to see which ones are best. And yeah, it turns out you cannot quite yet just ask LLM with MCP tools to do, you know, do anything you throw at it, especially not these challenging queries. So they're actually a bit peculiar. I. I thought they were hard when I was trying to understand how to do them. They're also just a bit weird. We'll see them in a bit.
Valdimar E. [00:09:10]: Then we have a bit of a failure analysis like what's failing in what models? There's different. It's different in the leading proprietary models versus the open source ones. For example those smaller and bigger ones. We can have a little look at that. So here's the big picture, I guess that starts here kind of in the user query in the middle. I'm assuming you see my mouse. We have the MCP tools so maybe before going to that let's have a look at what is mcp. Kind of assuming people are familiar with these LLM AI assistants and that's part of the related work is that we have now these agents with tool use and you can make your own tools.
Valdimar E. [00:10:09]: But a lot of people in the last year or half a year have been publishing their tools using the model context protocol that was made by Anthropic last year which is like a standard protocol similar to you know, like HTTP or supposed to be like the USB of posting tools between agents and systems USB C or something at now it's porting data everywhere and you can make MCP server with your tools and make it kind of open source and there's a lot of options. I've used it a bit but I heard that there's a lot of MCP servers and a lot of tools. But I'm not sure if there's a lot of users actually using them. I'm not sure how well that technology is come has been realized. But in theory you can like they do in this paper, give the agent access to whatever here they are. Travel. Yeah, you need to find the hotel or flights or whatever coding tools and just pull that in, add it into your AI agent's prompt. Basically you give it the documentation for the tool and you can use it to solve your query or your problem.
Valdimar E. [00:11:36]: And doing that it has to make a plan. And what they do is they like construct these queries and how this real plan should be. That's what their contribution was and end up with this benchmark of different 101 different queries and accompanying paths that you should solve solid task. So yeah, let's say you Want to what? Something about the hotels you want to find. One of the examples in the paper was related to finding accommodation not too far from a basketball game for a specific team. And these are like different data that needs to be pulled in and merged kind of. And you'd use a tool for Airbnb for the hotels or whatever. Airbnbs booking.com.
Valdimar E. [00:12:29]: that's an MCP tool you call, and you have the Google Maps API and the NBA MCP tool, but apparently somebody published. And then it's all evaluated using this LLM as a judge, which we. My headphones disconnected. Can you confirm that you hear me? There's someone there.
Sophia Skowronski [00:12:59]: Yeah, we can hear you.
Arthur Coleman [00:13:00]: We can hear you.
Valdimar E. [00:13:01]: I'll be without the headphones then. Yeah, It's a lot of AI generated stuff. So it's like maybe a limitation that you have the AI evaluation, but a generated benchmark to test the AI, and that has some effects. For example, we know that certain language models prefer their own kinds of texts. So it's all generated by GPT and not Claude and whatnot that might affect how it's evaluated by. I think it's evaluated by OpenAI as well. Anyway, so this is a big picture synthetic benchmark AI evaluation. And you can use this to evaluate different foundation models like they do, but you could also use this benchmark just to test different prompts or different tool selection components or whatever you want to experiment with for your agent.
Valdimar E. [00:14:00]: And the benchmark isn't out yet. This paper is out. I found the guy. One of the authors texted him and he talked about, told me it will be out soon. But, like, should be useful for. I mean, yes, benchmark should be used. Not just like. It's not really science.
Valdimar E. [00:14:18]: It's more like an engineering tool. Now I hand it over to Sophia. Do you want to control the screen? I can also scroll.
Sophia Skowronski [00:14:29]: Yeah, I think I can control it, right? Yeah, looks like it. Sorry. First time using Miro, but seems like we set this up easy enough.
Valdimar E. [00:14:38]: Do you want to share? I'll give you that.
Sophia Skowronski [00:14:40]: Yeah, I'll share. Share Sounds good. All right. Yep. Okay, great. So you should be able to see same screen, right?
Arthur Coleman [00:14:54]: Yeah.
Sophia Skowronski [00:14:57]: Cool.
Valdimar E. [00:14:57]: You could go to presentation mode. Yeah.
Sophia Skowronski [00:15:03]: Let me see. Hold on. Sorry. Share. Join. There we go. Well, this is as close as we're gonna get. Sorry, guys, but.
Sophia Skowronski [00:15:18]: So I'm just gonna quickly walk through.
Arthur Coleman [00:15:21]: I see your screen. Sophia, you can't. No.
Sophia Skowronski [00:15:23]: Okay, so let's try it again. Sorry, guys. Okay, let me. The zoom. Share screen. Just.
Arthur Coleman [00:15:37]: I think I can see your Screen.
Arthur Coleman [00:15:39]: Sophia Voldemort.
Sophia Skowronski [00:15:40]: Can you, can you not see?
Arthur Coleman [00:15:41]: Sorry, my bad.
Valdimar E. [00:15:42]: Yeah, it just looks the same, this looks the same as mine earlier, but I think it's, it's yours now.
Sophia Skowronski [00:15:46]: As me, I'm under control now. Yeah, it was like zoom disappeared. Okay, anyway, so, so as Valdemar said, I will be covering the construction, the evaluation design and the experimental setup which is in like this middle of the paper. So we'll just start off with the construction of these queries and these execution plans. So the goal of this data set, this benchmark, is to push these LM agents beyond just mere toy function calls, which they kind of describe as single step calls, using a synthetic environment like a mock database. And they want to generate 101 tasks that combine realistic user queries that can be used across multiple domains and most importantly are in dynamically changing environments, because the real world is the actual test for LLMs in production. So they mentioned each of these tasks that they generated had to meet these three qualities. They had to be solvable, they had to be, they had to have a verifiable end state and they had to, they could be easily mapped to some level of difficulty.
Sophia Skowronski [00:17:12]: So easy, medium and hard. Pretty straightforward. So first step for generating these queries was to first sample from a domain. And they, in the longer paper that I put in the chat, they created this little pie chart that kind of covers all the domains that they found within these 41 MCP servers that they used for generating the benchmark data set. And so you can see the pie chart because it's not a bar chart, it's hard to tell which one's the largest one. But in this case the largest domain was travel and leisure, followed by software development and Office. So they use GPT 4.1 to sample to select the domains for each of the 101 tasks. Let's see if there's anything else there.
Sophia Skowronski [00:18:04]: Yeah, and so next step then, so they know where, what type of domains they're going to use to generate the queries. The next step is actually generating some queries to use in the benchmark. So they used 4.1 to sample the domains and then they use the O3 reasoning model to create the initial natural language queries. And so they had these queries generated, they mentioned, conditioned on the specific domain and specific tool specifications. So they wanted to make sure each query just naturally led itself to multi tool reasoning. And they stated in the paper that some of these initial rounds of query generation were really messy. So they did multiple rounds with Manual revision. And so the final outputs for each of these they determined met those three qualities that I listed, that it's, you know, and especially that it's, that it's clear.
Sophia Skowronski [00:19:07]: Which seemed to be an issue with some of the generation of these early drafts. So from there that led to 101 tasks that could be segmented, segmented into these three different categories with tool chain links between 2 and 15 average being 5.4. Not that, not that exciting. But so then we have these queries here and how do we go about evaluating the plans for these queries? Will the final output be enough? And the authors make the case that no, it's not enough. And this is why we need this new type of benchmark, because MCP servers run on live time varying data. So using just ground truth, outputs of these queries at any given time may drift between runs. So they created these ground truth execution plans. And again the longer paper has these examples in the back.
Sophia Skowronski [00:20:05]: But so again, how did they create these execution plans? Very similar to the, the queries themselves. They use the O3 model to draft and again they compared the draft. Yeah, so they just did multiple rounds of drafting where they had humans in the loop, these 120 PhD hours to check that there's like no errors in the logic. Let's see, the tool choice is correct, the parameters to invoke the tools are correct, and just overall it's following a logical output. And so then, so they generate these execution plans, the PhDs got to go home, and the resulting set is 101 queries plus these gold standard tool invocation chains as they kind of referred to it. And so is this kind of begs the question, are these human validated plans the only valid trajectory for all of these queries? It could be for some of the easier queries, probably the very simple ones, but you can kind of say that there's probably multiple validations solutions for some complex queries. So we just need to keep in mind that potentially we're favoring models in this benchmark that mimic these human validated tool orderings. And so I think that's kind of the gist for construction.
Sophia Skowronski [00:21:52]: And so that kind of covered this first half of, if you can see my mouse, of course, this first chunk of the diagram from the paper. So now we're kind of going to go through the evaluation framework pretty quickly.
Valdimar E. [00:22:07]: So. Sorry, sorry. Can we maybe look at one query and.
Sophia Skowronski [00:22:11]: Oh yeah. Oh yeah.
Valdimar E. [00:22:12]: Just to make it more concrete.
Sophia Skowronski [00:22:14]: Yeah, yeah, yeah.
Valdimar E. [00:22:15]: So hard one is weird one.
Arthur Coleman [00:22:19]: Yeah. So obviously don't have kids. Valdemar.
Sophia Skowronski [00:22:25]: The speak in riddles. Yeah, this one's kind of funny, like realistic user queries. I think there's an even stranger one that I could go to. But this hard query starts off with a riddle first, like you need to know what MBA team your 9 year old son is referencing and then from there come up with like a travel itinerary using like Airbnb, Airbnb pricing and figure out like walk score of all these Airbnb addresses and produce a report to kind of surprise your son with a vacation. Yeah, with a spreadsheet. Yeah, they love that stuff. But yeah, so you can see the riddle itself is not part of the tool calling because in theory the LLM has read all the Wikipedia for the NBA teams and All Steel, Steven Spielberg's film. So in theory it.
Sophia Skowronski [00:23:25]: It's supposed to get that it's the which who is it? Toronto? I don't know Pascal, at all.
Valdimar E. [00:23:31]: It was the Toronto Raptors. That's. The Raptors were famous. It's like they were. It was named in 95 back when Juicy park was the biggest movie. Yeah. Oh, sorry, no, just the Velociraptors.
Arthur Coleman [00:23:46]: Yep.
Valdimar E. [00:23:46]: Anyway, go on.
Sophia Skowronski [00:23:49]: And it's the movie. Ready player one. Right. That's what we found.
Valdimar E. [00:23:53]: That was what Gemini told me. But it's definitely Jurassic Park. Because of the dinosaurs.
Sophia Skowronski [00:23:59]: Yeah, yeah. Okay, got it. So yeah, so the first step of this execution plan is to look out 60 days from now when they want to travel, figure out like the ideal check in and check out dates based off of 60 days from now, which is the second tool call, which is just Python execution. Then the next tool is searching Airbnb listings given the region and the dates. And then step four is using a Google Maps distance matrix tool, which sounds pretty cool. Not use that. And so it allows them to basically get the Airbnb information with the walking data and then in the final two steps, process it into like a table and then write to a markdown file. And so I guess like that could be the only way to do it.
Sophia Skowronski [00:24:54]: But yeah, I'm trying to think like you can probably look at the appendix Valdemar, or we can look at the end. But there's a really complicated one too that I wanted to highlight. But so yeah, that's kind of the gist for what some of these execution plans look like. So in terms of evaluating, as Valdemar already alluded to, there's parallel. There's a parallel design here and that's to again ensure that the Time varying piece is not an issue for the live MCP servers. Each task runs just two agents at the same time. I don't know if it's just two PhD students pressing play at the same time or if they have some like caching MCP server process going on. They didn't really go into that level of detail.
Sophia Skowronski [00:25:46]: I'm guessing that when the code and all of the queries are released we'll be able. You could dig into it a lot more in more detail. But in any case. So there is a reference agent that is GPT 4.1 that strictly follows a validated execution plan that the human validated execution plan that I showed above. And its intent is to produce the reference output. And then there's a test agent which is one of 18 reasoning LLMs that only gets a natural language query and a predefined set of MCP tools. And so it must do independent reasoning. It must select its own tools.
Sophia Skowronski [00:26:27]: It has to choose and write the parameters and tell and also indicate when it's done finding the output. And so again temporal drift and APIs. So they do it at the same time. Like if they're. One of the examples they gave was like YouTube engagement stats. So again like I imagine like YouTube likes for a video can change by the minute. So again like having these run at the same time is very important for having similar, comparable outputs. So how did they set up each task run? So you can kind of see it in some of these listed settings here.
Sophia Skowronski [00:27:09]: So each task is run once per test agent. And I'm not 100% sure if that's the case. It kind of seems like for each of the 18 models they just run through the 101 tasks one time. I'm not sure if that's for budgetary reasons. And then each task or each agent run is limited to 30 iteration rounds. And then I already mentioned that there is a predefined MCP tool pool which is it contains the essential tools required to complete the correct solution. So this was also generated at the same time as the queries and the execution plans. But they also added per task different sets of distractor tools that are drawn from unrelated servers.
Sophia Skowronski [00:28:01]: And so that's in the range. So each task had its own set of its own specific tool pool that was around 15 MCP servers and a range of tools available depending on depending on the task. Again just to prove that each agent can discover and yeah like use the right tools and ignore the noise. And they say that this replicates real world production Conditions, which I'm not necessarily sure that is the case that you would throw in a lot of distractor MCP servers when you're building an application. But again, I think this is really just to stress test for the purposes of making it as complicated as possible to see how well they do. And so, yeah, let's see, let's see. So again, so the tool pool. And so each test agent has the same prompt using a REACT framework.
Sophia Skowronski [00:28:55]: And just I think you can still see my screen. Right. And so here's. I think some of us have either used React before or have seen this before. But yeah, we're. It's. All they're getting is the query. And this is.
Sophia Skowronski [00:29:08]: These are the steps that they need to take in order to produce the output. So they want them to think about it, execute a tool call and then observe the results and then go back to thinking again or determine if their thinking is done. So yeah, let's see if there's anything else. And then what's interesting is again, using 4.1 as the reference agent and using a separate one for the judge. Let's see. Yeah, so yeah, they want the. And then. Okay, and then for evaluation, there's a big table that Valdemar will run through.
Sophia Skowronski [00:29:48]: But just to give everyone a sense for what these two metrics are that are being reported. So the LLM as a judge looks at each test agent's outputs, compares it to the reference reference agents outputs, and also compares the execution plans given the reference output. And so they ask. There's a prompt at the back of the paper where they just ask the LLM judge to rank between 1 and 5, 5 being the best, obviously, but they are ranking basically the outputs themselves and then the trajectory. And so for the results which are in the paper, they basically look at all of the runs for a given LLM test agent and look at the proportion of tasks with a perfect score and then the average of all score. And same for these other two groups of metrics. They're mostly used in additional plot and some of the additional resulting plots, but the trajectory score is just the same, the mean score of the trajectory per LLM. And then there's some sense of efficiency for average token consumption and average tool calls.
Sophia Skowronski [00:31:09]: So all these metrics together should capture whether an agent first solves a task and second how efficiently it does it. And so again, a big contribution here is the fact that they created a dynamic MCP benchmark, since as they state in some of the prior work that most benchmarks only compared static references and outputs and so that's pretty much it for method. So I'll hand it back over to Valdemar. I first must stop sharing. Yeah, there we go.
Arthur Coleman [00:31:57]: While we're resetting here, please guys, if you have questions, feel free to put them in the document. Question document.
Valdimar E. [00:32:14]: Well, okay. I was gonna try to go full screen here but option disappeared. It should be fine. I'm gonna look at the results. Do you see the main results table? Can also look at the graphs.
Arthur Coleman [00:32:37]: Can you take me off the screen?
Valdimar E. [00:32:41]: Yes, just close, close it.
Arthur Coleman [00:32:44]: Thank you.
Valdimar E. [00:32:47]: Okay, so what stands out here? We'll look at the graph in a bit. So what did they compare? Yeah, they took all the models from this summer. It's before the new cloth dropped. So GPT5 was here. So a couple months ago we have the GPT5 the O3 for the cloth models from the except the newest one and then a few open source one there's QUIN three and Llama and the small Gemini and the large gemini. So Google OpenAI and anthropic. All right. They tested it on the 101 tasks which can be easy, medium or hard and looked at the overall task success rate and the average result score.
Valdimar E. [00:33:42]: So the task success rate here is around is what 58%? I think it would be an even number if it were a hundred tasks. So it's a number of tasks it fully completed. Got the 5 out of 5 score by the AI judge. So it varies from 1% or one task fully handled by the 8 billion parameter llama one easy task to 58% and the average score which. Yeah, I mean it's. I guess sometimes it would like answer most of the query or question would find you some hotel. It's maybe not the closest hotel to the basketball game or something and gets 4 out of 5 then yeah, 73 was figured out by GPT5 and interestingly only like 41% for GPT4O or I thought it was interesting because I'm still using GPT4O a lot. I felt GPT5 was a disappointment but apparently for these kinds of tasks, first of all the thinking reasoning really matters and the size of the model.
Valdimar E. [00:35:11]: So I guess GPT5 is very similar to O3 but considerably better at this. So it's not not the failure a lot of people said. You can see it won all the different in all categories. We have some graphs here. We plot just looking at both the task success rate and the average result score which obviously is very correlated.
Sophia Skowronski [00:35:47]: And.
Valdimar E. [00:35:47]: It'S colored I'm not Sure what the color was. But GPT5 clearly the best O3 and the clusters we have these proprietary models where Claude falls a bit behind. We have the older generation which is considerably worse. So you can see there's been like a jump from 25 to 50% to 40%. Yeah, 25 to 40% from the last through all the generations or smaller ones to the current ones with extended thinking and post training and whatever they do. And then the small ones, I mean 8 billion parameters not going to do much. You can run that like I'm using quant 34 billion on running it on like a local machine. It's a different, different thing than the supercomputer machines.
Valdimar E. [00:36:40]: Then we have how wasteful they are in terms of tokens and it's actually if you're looking at expenditure so the cost basically GPT5 was using you know, 16, 17,000 tokens and it was calling a lot multiple tools. That's actually a lot. It's like the number of tool calls. It would like try some tool and it didn't really work and try some other tools but in the end they would achieve, you know, solve half of the tasks correctly while O3 is spending half as much and you know, considerably worse. But still it's like from 60 to 50 and Claude Way cheaper. And I'm not sure what all these tokens are. It's just calling tools and it's just talking to itself. I think there's like thinking traces where it's doing extended inner dialogue so to say.
Valdimar E. [00:37:48]: And they looked a bit into that population study where they tried to vary one component here is just a number of iterations. Probably should have kept the text here, but it's not that relevant. The top line is GPT5 the the winner. And if you could limit it to 15 rounds then it wasn't that good. But when I had more attempts. So we would like run, try a tool, call an MCP tool along with a self thought or something. If you give it 20 or 25 attempts then it becomes better. And the second place which is like O3 or something for one or the lesser ones they like it helped them to, to have room for mistakes.
Valdimar E. [00:38:33]: But then it plateaued and that's why in the end they use 30 for this the final results even though you, you don't need 30 tool calls to solve a problem. The highest one according to this gold standard was 15. Sounds like a lot. And that's also worth to mention that these gold standards are not like. It's not necessarily the only Way to solve the problem is infinite ways. And then try to vary the number of noise in the data kind of the MCP servers and number of tools they sample from. And if you added more tools going to have like 15 servers with lots of tools about traveling or whatever and it doesn't even matter because you're doing a programming task then the best one, GPT5 was quite robust and those tiny ones got confused. That's what this one was about.
Valdimar E. [00:39:29]: And an important result was the error review here where they made this table heat map or whatever you call it where they categorized the different types of errors. Can briefly talk about them. There were semantic errors and syntactic errors were quite prominent. That's about tool call. You could have a syntax error when you make a tool call. You're going to find the hotels at the 5th of October, but it's a malformed date or it's like a string instead of a date time parameter. The syntax error and semantic error is when the input to the tool, the parameter used to call a tool is just wrong. It has the right form but you're saying it's October last year or something instead of this year.
Valdimar E. [00:40:29]: And all the tools tended to fail into that. But the bigger, the less prone to failure for that. Well for the syntactic ones, the tiny model or like not tiny centipede parameters, it just wasn't trained on tool calling. It's an older one, so it just failed putting integers when it's supposed to be strings or something into the tool calling Overconfident self solving means when you just answer it without calling a tool it's like. So I've met a lot of these agents for companies. It's one of the basic problems to remind them all the time to look it up and not say like yeah, yes, you can have a discount. Just look at the policy call it too. But GPT5 is much better than GPT like I don't know, Cloud4 Sonnet for example.
Valdimar E. [00:41:24]: @ that there were output parsing errors. I didn't fully get that. Well, that was was something to do with the JSON. So it came from the tools. We don't need to dive into the details. Tool selection is always a problem and especially if you have these big sets of tools. So yeah, that's what I like I would think would be the problem. Yes, the wrong tool selection and the semantic error and the overconfident self solving.
Valdimar E. [00:42:06]: A bit curious about how high these numbers are. They're like calling to I guess documentation for the tool isn't good enough because it might it should provide you with maybe an list of available options for the parameter. Maybe it was missing it was like an API and GPT was just hallucinating what could go into the API, but when actually it's only like 10 different categories. You can go into the hotel booking search like two bedroom, three bedroom or whatever. But yeah overall GPT5 is on top. I think Claude 4.5 could be there. Also there's a big difference between the different types of or kind of big difference between the older and newer ones. Yeah types.
Valdimar E. [00:42:57]: I don't know what to infer from this there's any useful comments. Unproductive thinking remains a problem. O3 I remember when O3 came out I asked it I was kind of not know it didn't know how to use it so I asked it to give me the bash command to compress a folder into a shape or like something I needed to do at the moment and it would just talk to itself and think about it for a minute instead of telling me the command. So I guess that's a problem with the thinking models but overall the thinking models are much better than the old ones and I'm gonna upgrade some things to GPT5 after this. Finally they evaluated the how good was the LLM as a judge is that human LLM agreement? So they were wondering like does the LLM as a judge, does it really work? So you make a person evaluate the tool like how well the task was solved on the one to five scale or something and then check the correlation using this kappa here. Okay and overall it's pretty good and regardless of which model was chosen, the humans tend to agree with it kind of however this, this whole synthetic AI having AI evaluate AI is kind of flawed. So that's one thing I I recall having read the paper a while back it's when GPT4 came out but I think the same thing probably still is there which demonstrated that GPT4 exhibits a significant degree of self preference bias. So using GPT4 or any model as a judge may lead to excessive influence from GPT4's unique styles and policies.
Valdimar E. [00:44:52]: So we have the benchmark was generated by ChatGPT, you know and then we're using CHAT CBT to evaluate which answer was best and here's some example from this where it's from the old paper describe your current outlook of the financial markets and the US economy and GPT4 says just some basic non not very useful answer. I don't have real time data or ability to provide the current look on the financial blah, blah, blah. While some simpler model gave an answer that the human preferred. Currently the US economy is in a period of expansion, or at least, you know, a few years ago, and the financial markets have been performing well. But GPT4 just likes its own stuff. So it's one of the fallbacks or you say limitations of this paper and we're done with the results. I had some comments I can maybe start with. And then, Sophia, we have some comments and open up a little dialogue.
Valdimar E. [00:45:58]: Before we finish, why don't we take some questions? Open up questions. Yes, sir.
Arthur Coleman [00:46:03]: From the floor first. I, I think your comments will go into. You're still sharing your screen.
Valdimar E. [00:46:13]: I can maybe open the doc or do you want to go. Yeah, you, you can share and look at the questions.
Arthur Coleman [00:46:18]: Well, it's just, I'm looking at me. I get, I just don't like looking at myself on screen that way. All right, I'm going to start with Samantha, because, Samantha, you've obviously been doing some work in the field, watching your comments and why not, first of all, tell us what you're doing. And by the way, everybody, we're in, we're in question A if you want to show us. I know everybody's from. Some people are in their bathrobes probably, but if you want to go online and show us, you know, who you are in the video, turn it on. That's always appreciated. So we're talking to each other, not to just pictures.
Arthur Coleman [00:46:53]: Samantha, tell us what you're doing, how many tools you're using in your implementation and what you found and where your questions are coming from. Because you have some concerns.
Samantha Zetlin [00:47:06]: Yeah, I mean, I think I've told you all about the projects that I've worked on in the past, but we did quite a bit of comparing across models on a few different kinds of tasks, but mostly things where we were asking the models to look at log data or to read like a SQL query or a human question and translate it into a SQL query. And we definitely saw some things that are consistent with what they're reporting in this paper. So that was kind of validating. But some of this stuff, like tool selection, I'm kind of wondering if it would be better to just have a lookup table or something. If the models are that bad at it, I would expect they should be able to get that right 99% of the time. And it seems like they're not. And I definitely have some concerns about using LLMs as a judge with only a five point scoring system because when we did that, we were finding it was very difficult to get a good enough spread to really evaluate. And especially since they didn't do enough replicas on the same questions, I would be pretty suspicious of their statistics.
Valdimar E. [00:48:18]: So.
Samantha Zetlin [00:48:18]: Yeah, so, I mean, I think it's really interesting, but I do, I have some questions about the numbers. I think the general thrust of what they're trying to do makes a lot of sense, but some of the implementation I might, I want to redo if I was going to try and draw conclusions from this.
Arthur Coleman [00:48:38]: Let me follow up on that. I don't mean to be the person asking the questions, but I do have a question specifically related to this from the Pytorch conference that I was at yesterday, which if anyone, we had free tickets and if you went, thank you for making use of them. I'm concerned that they only ran the task once, Samantha.
Samantha Zetlin [00:48:57]: Yes, because that's a big one for me.
Valdimar E. [00:49:00]: Yeah.
Arthur Coleman [00:49:01]: Because there's some studies, I actually have one on the screen, but I'm not going to share it, that show that in terms of reproducibility, model reproducibility on answers is very low, the Y range. And so how do you really know that the run represents the actual average of how the machine would perform? That's my concern. Do you agree? And Valdemar Sophia, do you agree with that?
Samantha Zetlin [00:49:26]: That's absolutely been my experience and I've even had some kind of ridiculous conversations with people who said that they thought models should be deterministic or something. And I was like, that's not how these models work. But yeah, that's why I'm saying if they could run the same question at least three or four times, then you would get a better sense for are these really valuable results. But I could also see an argument for your typical user might not know that they should try it again if it doesn't work. And so you could claim if they only did it once, this would be their impression. But I think that's again, just a lack of understanding of statistical thinking and that whether it works on any one try is luck, basically.
Arthur Coleman [00:50:12]: Well, and let me, let me add something to that from yesterday also. The second paper that I was at, at the conference, which is one of the things that really caught my attention is in the. When they did the runs over, they, they had the same data but formatted in different ways, just in the.
Samantha Zetlin [00:50:29]: And that can make a huge difference. Yeah, exactly. Well, and that's why prompt engineering is an art. It's not just. Yeah, I feel like A lot of people don't understand the value of prompt engineering and why it matters. Anyway, I will be quiet now, but thank you. This has been very interesting.
Arthur Coleman [00:50:46]: Valdemar Safiya, any comments on any of this? I see you nodding your heads yes.
Sophia Skowronski [00:50:51]: Yeah, go for it, Valdemar.
Valdimar E. [00:50:54]: Yeah, just often researchers would run it, you know, five times and get standard deviation or something like arrow bars or something. And then sometimes something that seems to be better is just luck. And one thing that you just mentioned is the paraphrasing, the sensitivity to slight changes in the input, that's a big thing for neural networks in general. So for these queries all, they were all phrased peculiarly and if you would just keep having a ritual in one of them, then maybe the less smart models would be able to solve a task. So it would be cool to have different queries that should result in the same output because people write in different ways and whatnot and test it a few times. So yeah, we're sure it costs more, I guess.
Sophia Skowronski [00:51:50]: Obviously it seems like this is like the first, it reads like the first draft of the paper, sort of. Because you would imagine like the longer version we found had a lot more detail. And so I'm guessing that like if you were a reviewer you would ask them to like use an ensemble LLM judge. You would ask the LLMs to not judge queries or executions written by the same family of LLM models. Like, you wouldn't want GPT judging other GPTs. GPT models. Basically what everyone else is kind of reading into this paper, it seems like they lack some good comparisons in their experimental design that would have made it a bit more robust and like a, heck yes, this is a great benchmark, but it seems like for the most part like one time run through just using GPT to generate queries. And I think someone made a comment about like the human evaluators, they didn't even go into the detail.
Sophia Skowronski [00:52:55]: They said it was a blinded human comparison test, but they didn't really go into the detail of like what part of the evaluation was blind. So yeah, it just seems like this is a good first draft of the paper.
Valdimar E. [00:53:10]: Yeah, I think just one thing, something that didn't happen before and we have now is like evaluating everything with an AI. Like if you're publishing a benchmark, maybe you should spend those 120 hours to just construct a benchmark. Or instead of, maybe it would take more hours. But seems like they weren't saving that much time on making the queries if they had to spend A lot of time reviewing it, and it's just like vibe code science, you know, so it's a bit discerning. What do you say?
Arthur Coleman [00:53:52]: Yeah. All right, let's move on to the next question. Tamiz, if you are. There you are. Tammy. Zalai, if I said it correctly, to me, if you can show your face, that's great. There you have a question that was about errors. I think we want to ask it.
Arthur Coleman [00:54:08]: It's a very good question. Could you. Could you phrase it in the. In a broader context that you meant it? Because it's a very short comment you made in the. The chat.
Temese Szalai [00:54:19]: Yeah, sure. Arthur, let me turn my video on. Thank you. So, yeah, they provided that table where they categorized the errors by type, which is certainly helpful to understand where things are going wrong. But I'm curious, I mean, I have a lot of background in classification and usually when a machine AI gets something wrong, you want to look at the things that it got wrong and what are the patterns in those things. Is there some similarity in the hard questions? Was there something about the structure of the hard questions, the topics of the hard questions, the train of thought of the hard questions, or anything like that? I have no idea. And I'm sorry I didn't read the paper beforehand, but I will read it now. I have no idea if they even included all of the queries in the paper, but I would be really curious to see that because I think that would overlaid on their table of error by type, I think could be very informative about where these LLMs go wrong and what kinds of things they're good at and what kinds of things they need to be better at.
Arthur Coleman [00:55:47]: Any comments from anyone in the community? That's on one on that, I thought that's a really good insight. All right, we have two minutes. I need 30 seconds at the end to talk about something that's related that we're going to do as another paper. But Brahma, you said with. And I think it's a great question. Could you ask it, please? Or if you're. Are you still on?
Valdimar E. [00:56:12]: Yeah. Hi.
Arthur Coleman [00:56:13]: Hi, Brahma.
Brahma Sen [00:56:15]: So you asked me about the 85% success rate.
Arthur Coleman [00:56:19]: Yeah, I think that's a very critical question. Could you. If you want to show your face, that'd be great. You don't have to. You don't have to.
Brahma Sen [00:56:26]: No, it's all right. Yeah, I turned my video on. No, all I wanted to ask, you know, all the success rate they're showing, and particularly for other LLMs, the success rate is so low. So it seems that we can't even use it for anything. It's just, I mean, for any practical purpose, we can't really use anything.
Sophia Skowronski [00:56:48]: That.
Brahma Sen [00:56:49]: That's my challenge right now. And, and also I made a comment also, since they use GPT 4.1 as a judge, LLM as a judge, and that might have, in my view, skewed the results in their favor of GPT5 being top. You know, I'm sure that definitely played a role because if you, you know, if you LLM as a judge, that definitely favor their GPT5. Coming on the top.
Arthur Coleman [00:57:19]: One minute, guys. Just so our. Our speakers.
Brahma Sen [00:57:21]: All I want to say.
Valdimar E. [00:57:24]: Yeah, one thing about this is that it was like a challenging query and it's like a single attempt. It's. It's more like you make a task for the AI and it's supposed to solve it by itself. But maybe if you had a chat with it and it would like, ask you back questions, like, the success rate would be higher. So I think chatbots can help you figure things out, like in dialogue, much better than these autonomous systems. So obviously we're far from autonomous agents, regardless of what people are preaching. Sophia, any comment?
Arthur Coleman [00:58:04]: Okay, I'm gonna share my screen for a minute because it relates to this and we're going to say good night to everybody and good morning. I mentioned this paper. We just talked about verification by humans versus machines. This is a different approach to validation and verification. Instead of having a group ask one large group of people do a series of answers and tests against the machine. This uses a number of small groups and then combining what they call weak verifiers as a result into a single large unit. John Saad Falcon spoke. I've invited him to come speak and I haven't gotten any pushback from the organizers yet that we should do this.
Arthur Coleman [00:58:46]: So I'm going to be. He's agreed to speak on this subject. I think it's a really good topic. It relates to the question of how they measured the performance of, of the human input, not just the machine input. And so that's why I wanted to mention it. And I'll let you know. We'll let you know when that's going to happen. But that was just a follow on.
Arthur Coleman [00:59:04]: If you want to look at this, it's a. There's a product or software that they've built called Weaver that does this. So if you're interested, take a look at the paper and we'll catch up on a session on it when we have a moment. And with that I'm going to wish. I'm going to stop the recording and wish everyone.

