Evals Aren't Useful? Really?
speakers


At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Your AI agent isn’t failing because it’s dumb—it’s failing because you refuse to test it. Chiara Caratelli cuts through the hype to show why evaluations—not bigger models or fancier prompts—decide whether agents succeed in the real world. If you’re not stress-testing, simulating, and iterating on failures, you’re not building AI—you’re shipping experiments disguised as products.
TRANSCRIPT
AI Conversations Powered by Prosus Group
Chiara Caratelli [00:00:00]: Why do you even need evals in the first place? Right. We started to see where things were failing and started to build evals for that specifically. You know, it's easy to say, let's just test the agent end to end. What does end to end mean? In the times of machine learning, we were training a new model and you roll it out, you know, now it's coming to agents because agents are being productionized.
Demetrios Brinkmann [00:00:27]: How are you evaluating? How has that changed over?
Chiara Caratelli [00:00:32]: We learned a lot, obviously in the past years at Process, we've been building many agents for many applications. And it was a lesson. We quickly learned that you need to have a way to understand how your agent is doing. Now it's very easy to wipe code something and get it in production, but that doesn't scale. If you want to release this to millions of users, you will find that many times they break. And this is also the reason why people don't trust any agents 100%. But we're here to change that. The way we approach it is also from kind of scientific approach because I'm a data scientist myself by background and like it's easy to use models with an API, just run a prompt and get a response.
Chiara Caratelli [00:01:18]: But actually what you write in this prompt can change a lot of the results. And the way I see it is much more like a data science kind of task where when you write a prompt, you're kind of training a model because you're determining the output of the agent. Right. So it's very important to have good evaluation sets. And at the beginning you don't have really production data, so you need to kind of bootstrap this whole thing. And it was very important to have a very created set of examples. For instance, this could come from product manager insights discussion in the team, or maybe this is a specific use case that you want the agent to work on. And there we basically built all these examples.
Chiara Caratelli [00:02:12]: So at the beginning to bootstrap this agent, we created a curated set of test cases. We approach it like a software engineering project. In this case, we wanted to have some flows that we were sure that were going to work. Those flows you can use in CI CD pipelines and everything more like you would do for a software development project. And there you make sure the main use cases are covered, but you also might want to see all the parts that need testing and make sure that nothing breaks along the pipeline. That was the first level that we did. So understand what goes wrong, where fix it like iterate fast. And then we started to test the system ourselves like on a wider team within the company and getting feedback from users was very specific feedback because we had internal users that were helping us with this and with that.
Chiara Caratelli [00:03:10]: We started to see where things were failing and started to build evolves for that. Specifically something we did was also building evals for multi turn conversations, like for guardrails. This is obvious use case because when you have single turn conversations it's very hard to see if the guardrails can be broken or not. Sometimes you need a user who's very persistent. Like I'm not talking really bad stuff because that is usually covered if you're using foundation models. But there are things like platform circumvention that is a bit more sneaky, like trying to users trying to get information that is not supposed to be public. You know, there is a lot of.
Demetrios Brinkmann [00:03:55]: Things that we want or users trying to get discounts. I imagine that happens a lot.
Chiara Caratelli [00:04:01]: You know, once you release it, people are going to try their worst or their best.
Demetrios Brinkmann [00:04:07]: So yeah, wait, and can you break down this multi turn conversation evals and how you were basically it was four guardrails, but you also wanted to evaluate that it wasn't giving discounts when it needed to be or it wasn't telling secret information that it shouldn't be. Did you basically set up another agent to try and red team it? Were you specifically trying to red team it? How did that work?
Chiara Caratelli [00:04:32]: So we created Personas. We created an agent that was supposed to impersonate a user who had certain intention and we were defining this at the beginning and the agent had to be very persistent to try and break our own agent. So for instance, no matter what the other agent responds, try to get this information, try to be persuasive and so on.
Demetrios Brinkmann [00:04:56]: Nice.
Chiara Caratelli [00:04:57]: And sometimes it was working. And yeah, that's where you know you need to do something about it.
Demetrios Brinkmann [00:05:03]: How were you updating that? You realized something went wrong? You realized that the agent gave away information it didn't need to. What did you then do to you just changed the prompts or did you also add specific like hard coded rules?
Chiara Caratelli [00:05:18]: Yeah, it depends on the case. Like sometimes we didn't encounter it in our case, but sometimes you might have security issues that did not surface at the time and then you need to do changes in the actual code. Sometimes it's just prompting being a bit more aggressive. Or you can add a reviewer that evaluates the output and checks some sort of content moderation step before the response. There are different techniques. Always start simple and then build up if it doesn't work.
Demetrios Brinkmann [00:05:51]: You did talk to me a bit about the simulations that you were doing, because I think that is in line with what you're talking about now. You were simulating in different ways, like can you go into that?
Chiara Caratelli [00:06:04]: So the simulation that we did were kind of multi turn with a pretty fine number of turns. So we didn't drag them for very long. We just wanted to see whether the agent could break our agent in like four or five turns of conversation. And we had few phases to define these. Like first, first we came up with examples ourselves because we wanted to just make sure basics were covered. Then we started to get feedback. And then after this phase you can move to production data and see that maybe we can move to this other part of EVILS which is equally important, which is error analysis. So there you really want to pinpoint what's wrong and add this to your evaluation set.
Chiara Caratelli [00:06:50]: So for instance, if we see a case where user managed to sneak some information from the agent, then you add it also to this set and you make sure you run it every time you're updating the code or periodically, depending on how frequently you need it.
Demetrios Brinkmann [00:07:05]: One thing that I'm fascinated by is the evals you're kind of doing offline, right? But then you have prompt changes that you roll out and those are almost like online. Do you think about how to update the production system with the new prompts, but almost like do it in a champion challenger way or is it done slowly or do you just change it and see where things break afterwards?
Chiara Caratelli [00:07:37]: If I understand the question, like is it about basically measuring the impact of the change.
Demetrios Brinkmann [00:07:45]: On the production system? Basically. And, and especially because you change the prompt, how are you rolling it out?
Chiara Caratelli [00:07:54]: Yeah, this is same as rolling out a new model. Like in the times of machine learning we were training a new model and you roll it out. So first of all you want to see how it compares to the previous. So you have an evil set, you run it and you see if there are discrepancies, if it's better, if there are some, some tests that now fail and they didn't before. Like if you introduce new errors. This is really important to do. First of all, on a trusted EVIL set, you run it there and then you can start to roll it out. Yeah, you can do an A B test for instance.
Chiara Caratelli [00:08:28]: There are, there is a lot of literature about this because yeah, this is what data science community has been doing a lot before and you know, now it's coming to agents because agents are, you know, are being productionized so now all these little things are starting to connect. Finally it was. I didn't see this as much before, but now, yeah, it's really nice that we can use all these techniques and port them in the world of agents.
Demetrios Brinkmann [00:08:57]: Maybe that's a nice thread to pull on is what are some things that you can take from the traditional predictive machine learning world, like those eval sets that you're testing offline and then the slow rollouts are the champion challenger. A B testing when you do roll out the model. But now we're not rolling out a new model, we're rolling out new prompts. What are some other things that you're taking or what are some things that are completely different?
Chiara Caratelli [00:09:23]: Imagine you have an input, you have a black box which is your model, your prompt, and then you have the output. So that part needs to be tested. Well, so that I approach like a data science problem more than traditional machine learning sense. Even though there is a bit more explainability there, we can make easier changes in the prompt, but still that part is very analogous. The other is the fact that you're integrating this into a big software engineering project that is more complex than just getting an output from a model. So, for instance, tool calling is really important to get it right. Sometimes tools are whole workflows by themselves. Like tool can be almost an entire agent or sometimes agents.
Chiara Caratelli [00:10:14]: So there is much more complexity in an agentic system. So many more things that can go wrong. And if you start to, for instance, chain API calls where you have always a probability that things might go wrong, that you're going to get the wrong output. So it's really important to find out what these steps are and test everything. Yeah, you know, it's easy to say, let's just test the agent end to end. What does end to end mean? There are all these components and you really need to understand what goes wrong. Also at the end, why do you even need evals in the first place? Right? You want to see what's wrong, what to improve, Is the product good enough to be released? I think if you're a software engineering team, you want to know what to do next. That's the most important.
Chiara Caratelli [00:11:03]: You want to iterate fast, get output and release new version quickly. So you need evos to understand where things go wrong, what are the errors, why does certain things happen?
Demetrios Brinkmann [00:11:14]: Having agents use other agents as tools, those agents now, they can fail and maybe you don't really recognize where that happens. So debugging becomes very complex.
Chiara Caratelli [00:11:27]: It's the core of evaluating agent because at the end, users care about the end to end experience. Right. We want to give a good experience to the user. So end to end is really important. So there is a level in which you want to test like how things go, like how. What's the experience for the user itself. Yeah, so that's very important. It's not the only part of it.
Chiara Caratelli [00:11:51]: Like it's very important to also dig deep and see for every step what can go wrong. This is more similar to traditional software engineering where you have unit tests for functions. You want to have granular visibility into what can go wrong. We say we have the end to end experience where you can for instance, say this is what the user types and this is what the agent responds. The agent is doing web search, these are the search results, things like that. Or the user is searching for gluten free food. Did the agent return gluten free food or not? This is something important to check. Right.
Chiara Caratelli [00:12:28]: Because there are allergies in case of food. So to give you some ideas. But then you check that the user experience is fine. But then from the developer side, if I want to know what to fix, then I need to take all these tool calls, see what goes wrong and where. And I think it's very important to, to use the right tools for that also. We find many times this project have complex workflows and traditional platforms for evils are sometimes not enough. Like to visualize data and outputs. We sometimes had to build our own tooling.
Demetrios Brinkmann [00:13:11]: Oh, interesting. Tell me more about that.
Chiara Caratelli [00:13:13]: For instance, to evaluate conversations in order for people who are non technical to understand and see if if something was good enough or not. We needed a good way to visualize this. And you know, you have agents and tools, most frameworks allow you to visualize these things, but sometimes in the tool you also have other things going on that are very domain specific. So we had to build apps for, for them to. To visualize conversations. And I think this step is sometimes overlooked, but it's very important because you really need domain Experts, you need PMs to look at this data because they're guiding the development of the product itself. So they really need to understand what is going on and it needs to be easy for them.
Demetrios Brinkmann [00:13:59]: I heard about a friend who actually Willem, who came on the podcast probably three months ago and he was talking about how they built heat maps for where the agents were failing. And once they built these heat maps, they could see very clearly what the agent was good at and what it was not so good at.
Chiara Caratelli [00:14:19]: Yeah, I think this Is the core of evil, like understanding where things go wrong? We had it at the beginning, we had simple prompts to evaluate conversations which were like, is the agent returning a good response? Is the tone accurate? Like all these things.
Demetrios Brinkmann [00:14:38]: This is when you were using LLM as a judge.
Chiara Caratelli [00:14:41]: Yeah, yeah, yeah. This is for LLM as judge. And it's very easy for the judge to think that everything goes well, but you really need to instruct them to look at the errors, find out what is wrong. Because they don't have these capabilities by themselves. They're very eager to please you when you show them data. Usually talking about various foundation models here that you can use to evaluate, but what you want is not that. So yeah, this is a very cyclical process. You need to keep people in the loop.
Chiara Caratelli [00:15:14]: You need to pull people from technical and product side in the loop because you have evaluation sets, what you evaluate, but also how, what is the quality of the evaluation output? Because I think it's worse to get wrong outputs than to not get them at all. Like you think everything goes well and then it doesn't. Of course you don't want that.
Demetrios Brinkmann [00:15:37]: Yeah, those silent failures is something that I wanted to ask you about too because a lot of times maybe everything seems like on the surface it went well, but when you dig in and you look at that user experience, you go, wow, this was a horrible user experience.
Chiara Caratelli [00:15:51]: Yeah. I think nothing beats using the app and testing it.
Demetrios Brinkmann [00:15:56]: Being your own beta tester, basically, yes.
Chiara Caratelli [00:15:59]: I think dogfooding is one of the most important things to do if you're developing a product. You need to use it, you need other people to use it, you need your grandma to use it. Yeah, it's really important to see the flow because often these apps involve UI elements. It's not just the agent itself. It's easy to sit in a corner and just evaluate the agent choosing the tool. Well, yeah, but what if the user gets a completely wrong experience? So yeah, there is. Especially at the beginning of this project, there is a lot of information sharing and yeah, like involving multiple people in this.
Demetrios Brinkmann [00:16:37]: Well, there is a fascinating thing here too. When it comes to the evaluations that you were talking about and the visualization that you were mentioning, you want to be able to visualize it in ways so that non technical people can also get an idea of what's happening. I just mentioned heat maps with my buddy Willem. Did you have other ways of changing that data to visualize it for the others that you found particularly helpful?
Chiara Caratelli [00:17:08]: Yeah, I think distributions are really important. Sometimes you don't know what errors you're going to encounter. So it's also important to keep things free form, like have the agent summarize what went wrong simply and then aggregate everything later is very important because you can think, okay, what are the category of errors that I can get? What is the taxonomy? You can have a tool error, you can have, I don't know, timeout error. There are many, you can start writing them all. But at the end, if you gather data, then you can just use that. Right? I mean, these things are not mutually exclusive, of course.
Demetrios Brinkmann [00:17:47]: Yes. Yeah. So that's a. Interesting way of putting it is why not just let the agent tell you what went wrong and summarize what went wrong and then you can have that as another data point.
Chiara Caratelli [00:18:00]: Exactly. Yeah. Yeah.
Demetrios Brinkmann [00:18:02]: I imagine there's no shortage of data for this. So in. And in a way you can always get more data. You can always ask the agent for more data and say what went wrong? Or show. Then you can also look at the traces and logs from different tools. You can look at the eval sets and if it's failing any of those. Yeah, is there like a 80, 20 here that we. If you're going into it, you're saying these I 100% need to have as data points and they need to be this granular or this sophisticated before I feel confident in pushing my agent out to production.
Chiara Caratelli [00:18:46]: Yeah, that's. That's a very bold question. Because it's really hard, especially at the beginning to know how this is going to. How this is going to look. Right. Especially if you don't have production data, you're kind of imagining how things are going to work until you're allowed to use, you don't have this information. Even you can think, these are all the use cases I want to cover. Okay, 99% are cover will release or maybe 50% if you're more comfortable.
Chiara Caratelli [00:19:22]: But at the end you need real user data. So I would frame it the opposite way. I think that the goal is really get feedback as soon as possible and get this out. And if you release to some users, for instance, you will start to have conversation logs and then you can really see where all the failures are. So there are some that you might want to accept, some that you might not want. In the case of search, there is always precision and recall problem. For instance, with information about food, you want to be precise because if someone has certain dietary restrictions, they don't want to give them wrong information. So it really depends on the project.
Chiara Caratelli [00:20:08]: You really need to sit down and with people from engineering products and define what these things are. Yeah, maybe we haven't talked about it yet, but defining what metrics you're comfortable with is really important at the beginning. So you have an idea of what is good enough. Right. Will I tolerate 50% of the search is going wrong? Like, what is the tolerance level for errors? What do you accept?
Demetrios Brinkmann [00:20:39]: Yeah, again, it goes back to this idea of like, there are hard line things that we for sure cannot do, like giving free food away or telling a vegan that this food is vegan, but it's not, it's actually meat. Those are things that you're gonna piss some people off very quickly with. And so you recognize along these lines we need very, very low margin of error. But over here we can be a little bit more generous with our margins of error.
Chiara Caratelli [00:21:15]: And I mean, this is AI, we need to embrace it. Right. It's not like users also expect the agent to be conversational, so they might get something wrong and then recover and then give the right result at the end. There can be some hiccups along the way, especially at the beginning. I think with new AI products, people are a bit more comfortable with that by now. Of course, then you want to improve it, but you cannot do that without enough data. So it's a bit. You need to be comfortable with this uncertainty and leverage it.
Chiara Caratelli [00:21:51]: Use the feedback and build on top.
Demetrios Brinkmann [00:21:54]: Is there anything that you are particularly surprised about that you've learned along the journey of building these agents, putting them into production that you want to share right now, or that you can share with us, that if you were going back and you were building this on day one, you would focus a lot more attention on, I think in general.
Chiara Caratelli [00:22:19]: How complex these systems can be, like how many moving parts that can go wrong. I think at the beginning it's easy to underestimate. You're testing yourself a couple of use cases, it works and then you're good. I'm not talking about this agent in particular, like in general, my experience building agents, but then, yeah, you release it to users and they will find ways to multivariate or not using it the way you think, because you're someone building agents. So your mindset is completely different. So for me, the most important learning point was to really trust users and get feedback from them.
Demetrios Brinkmann [00:23:01]: Trust users to break the thing.
Chiara Caratelli [00:23:04]: Yeah, like get, get people. That's your. That whose opinion you value and ask them to test.
Demetrios Brinkmann [00:23:12]: Yeah. And how are you reincorporating that back into the product?
Chiara Caratelli [00:23:17]: One of the things you can do is to get a channel for feedback and have people posting feedback and then gathering all this feedback, aggregating it and defining a roadmap based on that. Like, what are the main issues to fix? You know you can use LLMs in every step here. Yeah, you can have an agent reviewing feedback and prompting you things to improve.
Demetrios Brinkmann [00:23:42]: When you're doing these different kinds of visualizations and looking at the distribution and all of that fun stuff and you find certain things that are out of whack, you're then just reincorporating that into your eval set or I guess, where is step one?
Chiara Caratelli [00:24:00]: Yeah, it's both. It's like when you find an error in an application and you fix it, like you find a bug fix it Usually good practice to add a unit. Test for that if you can. So I think having a curated set of examples and edge cases also is really valuable here. So when you find something that is a bit off and it's a good idea to include it, and depending on the severity, then you can, you can fix it. Like this is more of a product problem then. Are you comfortable with that? Yes or no?
Demetrios Brinkmann [00:24:36]: I think that's good. Is there anything else you want to hit on?
Chiara Caratelli [00:24:38]: If you want to work on good projects like this, we're hiring a process, so feel free to message us, send us your CV and yeah, let's keep building. Great stuff.
Demetrios Brinkmann [00:24:50]: Nice.
Chiara Caratelli [00:25:14]: Convinced.