Smart Agents Start with Smart LLM Choices // Shai Rubin // Agents in Production 2025
speaker

I’m a tech builder with a Ph.D. in computer science and over 25 years of experience turning ideas into real products. I’ve led teams and launched innovations at IBM, Microsoft, Citibank, and OwnBackup (acquired by Salesforce)—where I built a product from scratch that now generates millions in revenue. I’m passionate about combining deep tech with real customer impact. Today, I lead Studel AI, where we help support teams close highly technical tickets—fast.
SUMMARY
Everyone is building AI agents, but at their core is the LLM—and choosing the right one is critical. With new models launching every week, each promising game-changing productivity, how do we make informed, data-driven choices?"" In this talk, I’ll focus on LLM selection for a critical agent skill: code understanding. I’ll present a study applying 15 leading LLMs to real-world code summarization tasks, using practical, agent-relevant metrics like verbosity, latency, cost, human-aligned accuracy, and information gain. We’ll explore how these models actually perform in practice, beyond benchmarks and hype, and what that means for building effective, capable agents. Whether you’re building autonomous coding assistants, dev-focused copilots, or multi-modal agent systems, choosing the right LLM isn’t optional—it’s the foundation. This talk aims to cut through the noise and offer actionable insights to help you select the best model for your agent’s real-world success.
TRANSCRIPT
Shai Rubin [00:00:00]: Hello everyone. What you see here is essentially an MRI of my brain recently and I assume it is also common to the audience. What you see here is a crossroads of names of LLM and all we do in the last two years maybe is trying to figure out which LLM why LLM to choose and this crossroad is changing constantly to add and remove. And this is what we are going to talk about. How to make a little bit more order in this area and how to choose an LLM for your agent. My name is Shai. I have more than 20 years of experience in software development on various areas I was working in big companies and small companies. I have a PhD in computer science from the University of Wisconsin in Madison and currently I am a co founder of Strudel which is building AI tool to bridge the gap between support team and engineering teams.
Shai Rubin [00:01:19]: And most importantly I'm passionate about software engineering. So what we are going to do today? Well, let's start with a big question. What makes models the best? Well, maybe it is a big question but not a very good one because there is actually a better and more correct let's say question which is best for what? What are we trying to achieve with this model? And the today task I'm going to try to understand code. What would be my best model for understanding code? I'm going to present in approach that I that works for me and I believe can work for other tasks beside code understanding too. So let's start. The first step in defining what you are trying to do is to define it. What do exactly you are trying to achieve and what do you mean by code understanding? So luckily we don't need to think too much these days. I asked ChatGPT to answer what code understanding is and this is the answer from ChatGPT accurately interpret the structure, behavior and intent of source code in order to reason about what it does and why.
Shai Rubin [00:02:33]: This seems pretty reasonable. If you ask someone in software engineering probably you will get something close to that. So this is good. And just to be sure, I also checked with other experts, Gemini and Claude and to answer the same question and I got very very similar answers, I used coloring here in order to show how how much they are actually similar in terms of concept and words. More importantly, if you read those three answers you will see that or you will feel that as humans they are identical. For us as human they mean the same thing or almost the same thing. And this actually brings a question. Okay, if you are going to use more expensive model to answer your question, what exactly you get back, do you really need to use a more expensive model if actually what you get back is relatively very similar across models? And this is a question that I'm trying to answer today.
Shai Rubin [00:03:47]: And the first step is to understand that best is first by what you define not as a benchmark defined, but what is good for me or good for you. LLM response in this case, explain this Python file and there are several metrics on which that are important to me. First of all, the answer should be concise, otherwise I will just read the code myself. I'm looking for low latency because I have human in the loop. So if it is important to you too, then you need to consider that as a metric. Of course we want it to be accurate. We already saw that accurate is kind of weird concept. We'll talk about that later.
Shai Rubin [00:04:31]: And of course we want it to be cost effective, as cheap as possible without sacrificing was quality. Okay, once you defined your metrics, then you can actually implement an experiment that compare models on on, on, on on your metrics. And this is the setting for my experiment. Step one. I took 17 LLM models. We will see the list in a moment. I choose 60 random file from Pytorch repository on GitHub. Now I combine them into pairs.
Shai Rubin [00:05:12]: All models are covered by all files, total of 1020 pairs. I take each pair of model and file and I run the model on the file with a prompt that is unified across all my 1020 pairs. What I get the output is all the files are summarized by all the models and this will enable us to compare what we need to compare. Cool. So just to show you the model I use, on the left side you see the model I used. Some are very common, some are less common. Just a random list that you know, I just choose on the right, you get my, you see my prompt that I used according to the best practices maybe of prompt engineering. I don't know.
Shai Rubin [00:06:03]: It took me 20 seconds to define it. For me, the most important part of the prompt is the up to three sentences emphasize which essentially trying to convey my intent that I look for short answers that actually captures the essence of what the file is doing, what the Python code is doing. Okay, so these are the models. This is a prompt. Here is an example of an output. This is a Llama one model running on the file torch, functorch, aot, autoguard, py. It is a one pair and there is a summary for this specific pair, model and file. You can see the file summary as well as other metadata that is related to the run of the model on the file of the prompt.
Shai Rubin [00:06:56]: And we will use this metadata in order to do our evaluation. Let's start with analysis. Well, the first thing to analyze is a cost, which is relatively very easy. So what you see here is the chart of cost. The models are on x axis and the cost in dollars on the left side. And the cost is input and output tokens. What you obviously can see that there is one model that is, you know, so expensive that, that it is hard to imagine which is 377 times more expensive than the cheapest model, which I defined as 0.1 cents. So this again raised the question of okay, if we pay so much though, what exactly do we get back? And if, even if, even if I take charge of T 4.5 out of the equation in this chart you can see a lot of variation in the themselves, right? So, and if we treat LLMs as developers that help us do our job, you can see that the salaries of the LLM are varied and it's sometimes in a very big factor which is not always common in the, at least in the high tech industry.
Shai Rubin [00:08:21]: So it is a wide price swing and we, and we need as engineers to insist on matching value. Okay, so this is a first metric of cost. The second metric is conciseness. So if you look at the academic work on what are good sentences in English, there is a consensus that sentence should be less than 20 or words or less. And you can see that some of the models or most of the models are actually ready read probably this academic definition and produce answers which are about on average 60 words. 60 words, which is about three sentences. This is great. This is what we ask for.
Shai Rubin [00:09:10]: However, there are two models that absolutely ignore the instruction. The instruction and produce on average a length of an answer that is 132 or 164 words, which is kind of annoying to read. So another principle that I suggest to adopt is follow the prompt or hit the road. Okay? It is a my way or the highway. If the model does not follow the instructions, it's very difficult to use in software. And now let's come back to the latency. I want answers back from summarization of files relatively quickly because I have a human in the loop. In my scenario, I measure the latencies, the average latency.
Shai Rubin [00:10:05]: To get the file summarized, I defined three categories. The yellow line, bottom up, below the yellow line, which is about three seconds of latency, are the models that I consider okay, are good for interactive use. Below the red line and above the yellow line are models that are between 3 to 6 seconds of latency on average, which are maybe okay depending on how patient is your audience and what they do. But above six seconds I would claim that it is not for interactive human in the loop. And we should use those models in batch, maybe processing, but not for interactive, at least not for when your task is code understanding. Now, once you have more than one metric, you can dig in into a group of models that may suitable for you. In this example, what I show is that if I, from what I'm looking for, care about latency and cost, then these are the models that are suitable less about about 3 seconds to in interactive interaction. And the cost again varies but and we see the most expensive One is the four is a GPT4O and others are less expensive and you can see the range.
Shai Rubin [00:11:39]: And now you have a reasonable place to start comparing the models. And let's talk about quality. So we talked about consult, we measure consultants, we measure the latency, we measure the cost effectiveness of removing the non obedient models that compare latency to cost. And now we need to talk about the question of accuracy. Well, we saw that this is difficult. Why? Because even with a simple example of code understanding, there was no fundamentally more accurate answer than other. And this leds me to define what I call differential accuracy, which essentially says look, if the output are similar, then accuracy doesn't really matter, right? Because everyone is saying the same thing. So just pick someone.
Shai Rubin [00:12:34]: And I wanted to verify that indeed or verify or at least get some more confidence that this concept and this method of evaluating models is valid. So in order to do that we need to compare similarities between models. And I will try to explain how to do that. Here it's relatively simple. Here you have four pairs of pictures, let's call them 1, 2, 3, 4. And I will ask you as humans, how much they are similar. Probably you are going to give me answers like that. Pair number one, they are identical.
Shai Rubin [00:13:11]: It's the same picture. So they are identical. And let's give it the value of one similarity. Pair number two, they are similar. They are still similar. It's the same thing, same color, same symbol. So great, they are highly similar. Pair number three are less is less similar than pair number two, so somewhat similar.
Shai Rubin [00:13:31]: And pair number four, even though we put both of them on the head, they are barely similar. Beside that these two actually does not similar at all. So as humans it was really relatively easy to compare similarities, right? It was obvious. However, if you want to do that in Software and in computer system, this is not also very difficult. We have the technology already and it is not very, very new. It's called embedding and similarity function. So what you do, you embed or create vectors that represent each picture. And you can do that for test too.
Shai Rubin [00:14:16]: And when you have those vectors, you can compare them, something that is called similarity function. And there are many of them. And you get a number how much these two texts are similar to each other. So let's do that. Before we do that. Just let's get a sense of similarity based on this, based on embedding. Let's create some kind of a baseline, a benchmark to similarity. We are going to do the following experiment.
Shai Rubin [00:14:50]: We will take 17 different file on 70 different models and we will produce embedding and compare all the results to them. What do we expect if the files are different and the models are different? What do we expect if we compare two files that are different? We expect that similarity will be low, right? If one file is doing matrix multiplication and the other is doing bank transaction, then they need to be, besides the fact that they are files, their functionality is fundamentally different. And we expect the similarity metric similarity function to give us a low number close to zero. So I did this experiment and you can see a similarity matrix of different files. You can see that it is mostly blue, which means closer to zero. Remember, these are different files on different models. So very, very low level of commonality across the files. And the result and the visualization is actually because it illustrates to us that if you compare different things, you actually get low similarity, which is great.
Shai Rubin [00:16:03]: Now let's look at a result of the experiment. On the right side you have these random files that were chosen and compared and you can see the low similarity. On the right side you can see all models evaluating one file, which is rng prims py and you can see the similarities there. And the visually the difference is clear. The left matrix is much more similar than the right, much, much more similar than the right one. And the average similarity in this matrix is 0.85. So this is great again, because it's validate our assumption that we will get similar result for similar models. Now, there is one model that is unique here, and this is a Nova Pro V1.
Shai Rubin [00:17:00]: And you can see visually that it is less similar to the other models, which is interesting. And again, as engineers, if you want to understand the difference, you can dig in one level more, maybe even to human evaluation in order to understand how this model behaves differently. Okay, so I Didn't pick just one file and show you how similar it is. Here are samples from the 60 files that I have analyzed and the range of similarity is between 0.9 on the left upper side of the matrix to 0.72, which is the most not similar results that I got for the 65 on the left bottom. So what we learn is that models on the same files gives relatively similar results, which means that it's not clear that we need to actually use the, let's say the most expensive model or the fastest model, but actually look at the results that we get from similarity metrics to choose a model and let's try to do that in a small scale, a smaller scale, and let's evaluate the ChatGPT family on the left. You see the ChatGPT family on a single file and you can see that indeed the models, the four models are pretty close to each other. They are much more rare. ChatGPT 3.5 is a little bit, let's say behind and it is not always as similar to another model as other models to this model, right? You see that it is more orange than red.
Shai Rubin [00:18:53]: Okay. Another thing to Note that for GPT4O, GPT4O is actually close as much as to all the other models, more than any other model that is close to other models, which is kind of expressing average of result. And maybe it is a good choice in terms of what it says in terms of the quality, because it takes everything that is important. Let's assume from one from all other models on the right. Now the GPT4O, if you remember the cost is 14 cents per summary, which is still relatively expensive. If you ask yourself, okay, maybe I will use GPT4O at production, but in development, when I evaluate that, can I save some money? Is there another model that is also good enough or at least compared to what JGPT 4.40 says? The answer is yes. Nova Pro V1 is close to GPT4O and therefore to all other in a relatively nice way. And it is cost, if I remember correctly, 8 cents.
Shai Rubin [00:20:10]: So half the price or 7 cent half the price of GPT4.0. So I think that I will conclude here, don't be part of the crossword or maze or the mice race in this world. Define your goals, define your metrics and define a workflow that helps you identify what is important to you, what is less important to you and what is the best ROI that you can get from the various model. You will save time and a lot of money. All the work that I presented here is in GitHub. It is an open source. The code to run the experiment to the code to generate the charts and the code results themselves. So if you want to contribute, great, go ahead and do that and you can contact me on LinkedIn and we will work together.
Shai Rubin [00:21:12]: Thank you very much and I would be happy to answer questions.
Adam Becker [00:21:17]: Shai thank you very much. That was brilliant. We do have at least one question here from the audience. Ricardo is asking Shai the conclusion we can take from this is that there is models with almost identical answers with extremely high cost variation. Is this valid only for code summarization or summarization in general or can we do this for every answer in general?
Shai Rubin [00:21:43]: I don't know. I know only thing I know is about code understanding and this is what the results show. I suggest to do that for other goals like I don't know summarization of I don't know articles in. In. In. In newspaper on the same subject and see what you get if this is what you are trying to and but I think my feeling is that for simple questions like explain a single thing, all models or most model will say the same thing. And this is also somewhat reasonable because they are trained on the same data. Right.
Shai Rubin [00:22:31]: The amount of data in the world is actually finite and you know, they put a lot of effort to learn everything and they know know the same thing.
Adam Becker [00:22:41]: Folks want me to put up the. The GitHub link from the last slide yet? One more time Let me put it up if you want to move maybe one slide back. Yeah, everybody grab it. Yeah. This one.
Shai Rubin [00:23:01]: Yeah. Do I have access to the chat or you do?
Adam Becker [00:23:05]: It's just on the other link. It's in the actual. Oh, I'm going to send this to you in our private chat. Please stick around there for folks that have more questions. Ricardo says. Loved it. What an insight. Great presentation.
Adam Becker [00:23:22]: Shai I think I just want to double click on the concept that I thought you shared that I thought was novel. Interesting. Probably not very obvious until you stare at it and squint. And that concept is first there's something that, you know, a lot of people ask me, you know, how should I plug in AI to my business? What should it do? I always say just what's the actual problem that you have? You need to understand your customers.
Shai Rubin [00:23:48]: Exactly.
Adam Becker [00:23:48]: Just an invitation to better reflect on what the value that forget about AI. What are you trying to deliver to the world? And I feel like what you've done is put that into a framework and said those things that we're trying to deliver the things that actually matter. Let's define those as dimensions and then let's make evaluations against those dimensions and not just against those dimensions. What we should do is look at the differential of how one model compares against the other model. If you're already going to be using a model, well, then that's a standard. So now how good do things get compared to that standard? And I feel like that's what you're zooming into. It is that difference. And given that difference, now you can play all sorts of interesting games like embedding and similarities and then continue to explore that difference.
Adam Becker [00:24:33]: And I think that that's very interesting. So if people didn't. People came in late. I think that was a punchline and I think it is a punchline worth repeating. Shai, thank you very much.
Shai Rubin [00:24:43]: Thank you. It was a great summary of the talk. At least you understand it, and I'm very happy. Thank you, everyone. And you can reach me and find me on the web and we can continue talking. Thank you very much.
Adam Becker [00:24:53]: Awesome. Thanks sh.
