MLOps Community
Sign in or Join the community to continue

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Posted Aug 04, 2025 | Views 116
# Agents in Production
# AI Evals
# Run the eval loop
Share

Speaker

user's Avatar
Mariana Prazeres
AI Engineer @ Run the eval loop

I help early-stage teams turn AI demos into real products by setting up evaluation systems. Previously, I worked at product start-ups building AI features and did academic research on deep learning.

+ Read More

SUMMARY

Demos are easy to set up, but making AI Agents work consistently is hard. In this lightning talk, I’ll walk through a startup-friendly process for iterating on your agent evaluations. I'll cover:

  • How to define success by creating an evaluation dataset
  • Leveraging LLM-as-a-judge for quick iteration
  • Why do you need to iterate on your evaluation and not just on your prompts and models
+ Read More

TRANSCRIPT

Mariana Prazeres [00:00:08]: Thank you so much. Yeah. So today I wanted to talk about iterating on your AI evals and why. I think actually iteration is the key to a great evaluation framework. So we know how it is with AI, we do a demo, we feel like this is easy. It feels like we're almost there. A lot of the times, even just a simple prompt makes us feel like we're there. Feels like 90% done.

Mariana Prazeres [00:00:37]: And going from this zero to almost done feels so easy. So how hard could it be to go from almost done to done? But that's actually the hardest step. Reliable AI is hard. Sometimes it works, sometimes it doesn't. A lot of times fixing one bug breaks other parts of the AI, and when it fails, we don't really know where it is. I'm preaching a bit to the choir here. But the solutions to evaluate, I think we can probably all agree a bit that we can't fix what we don't measure. Usually the way I would go about it, and you probably go about it, is okay.

Mariana Prazeres [00:01:16]: We define what success looks like for our agent. We run it, we have some evaluation framework in place, we diagnose issues from the results, and then we improve the agent based on these results. And we sort of run on this loop. And of course, as we diagnose issues, as we evaluate results, maybe we're also going to change our definition of success. Maybe our AI agent changes based on the product or sometimes it's its own behavior that makes us take a step back and think, wait, what does success actually look like, look like? And unfortunately, there is a lot that can go wrong here. This is not an easy thing. Even though my scheme was quite simple, it can go wrong in terms of process. So we can end up just perfecting before testing.

Mariana Prazeres [00:02:06]: So we spend weeks crafting these evals, evaluation, creating the evaluation framework before we even run them. There's a risk of using just vibes over some structure. If we feel it's good, is that enough? We can ignore end up ignoring the framework. So, for example, we build it, we forget it, we never look at it again. So there's serves no purpose. And there's even the fourth one of giving up too soon. So, you know, abandoning the framework before we even see some advantages from it because the results just don't feel right. You know, evaluation maybe says everything is good, all our scores are amazing, but the product experience is just not there.

Mariana Prazeres [00:02:55]: It can also go wrong in structural ways. Maybe there's some siloed ownership. The people evaluating are, for example, engineers. And product context is missing it can also be that our evaluation data is a mess. It doesn't actually correspond to what could happen in the product. Maybe metrics are wrong. Why? We could be looking at academic metrics, but we actually care about user satisfaction in a product. Then we can also end up with stale evaluation criteria.

Mariana Prazeres [00:03:29]: For example, we never revisit the evaluation, even as the product, the AI features evolve on their own. Then finally, it can also just go wrong strategically. So maybe there's not an unclear, there's an unclear iteration path. Okay, evaluation is done, scores are bad. What do we do next? Maybe we're over indexing also on one score and we're using this score as the guiding light instead of changing it. And yeah, and maybe there's also no tie to product loop. So how important a metric actually is for the product and for the user experience is not reflected into the metrics that are being used in the AI agent evaluation. And the truth is there's a lot that can go wrong and none of these can really be avoided.

Mariana Prazeres [00:04:19]: And of course we're going to make mistakes. I make mistakes all the time. There's sometimes it's impossible to not make mistakes. So that's why I want to talk about iteration. So what does it actually mean to iterate on your evaluation? I think when we talk about iteration like this, most people would just think about iterating on the model. So it would be about, okay, evaluation results guide the prompt or model changes. So we use the low scores to tell us where to look to try to understand why things fail. And then understanding these failures is actually what help us understand what the next steps on the model or the prompt could be.

Mariana Prazeres [00:05:03]: But for me, iterating on an evaluation is also iterating on metrics. And on data is iterating on the evaluation framework itself. So on metrics, for example, if your goal changes, so should your score, so should your rubric. So the idea here is that, okay, metrics don't need to be perfect from the start. It's still better to have something simple than nothing at all. It's. And then now, even with LLM as a judge, we can get a lot of fast scalable feedback. We can use as many as we want.

Mariana Prazeres [00:05:38]: They can be as specific or as complex as we want, and we can change them as our goals change. And then finally, even if we reach a point we have too many metrics, we can start weighing them, for example, in terms of importance, right? And so that's on the metric side. And then on the data side, we also need to Change. We have probably collected a bunch of input output examples to evaluate on. But when our goal changes, these examples also need to change. They can't just remain the same. So, and we can start as small as we want, five, ten examples can sometimes just be enough and then eventually this can grow. We can include expected outputs or not.

Mariana Prazeres [00:06:22]: There's no reason we have to. A lot of the times it can be quite hard since LLMs are non deterministic and so that's okay too. And then yeah, we should probably avoid synthetic data early on, but that doesn't mean we couldn't add it later. So our data, our examples that are getting evaluated on would change with time. So how would this look in practice? So I'm just going to share a bit of an example, but this will look different for every case. So for example, we could have at the early days, so this is the, maybe the time where you just want to ship something and you just need some sort of evaluation framework so you're not completely blind. So maybe you start with some small amount of high quality examples. They're manually curated by people that really know what they want the AI to do.

Mariana Prazeres [00:07:19]: Maybe there's some simple heuristics. They can be, you know, a Python script that just checks, is this equivalent equal to the number I wanted it to output? Or it could be a simple LLM as a judge questions could be as simple as is it addressing the user with the correct name? Just very, very simple. And maybe there's just a script and the script returns some scores, runs this and returns some scores. That's it. This would be the beginning. And then eventually maybe you start feeling like, oh, now I can get a loop going. You'll eventually grow from a few examples to hundreds of real world examples. Now maybe you're grabbing some of these from production, actual examples of how users have used this AI agent or AI tool.

Mariana Prazeres [00:08:10]: Now you have a lot of LLMs as judge. Maybe the evaluation is not just one local script. Maybe it's running on CI is getting scheduled, scores are getting tracked over time. Maybe now you're using langsmith or building up fancy your internal tooling. But the point is at this point maybe you feel okay, evaluation serves a purpose. You can feel the feedback loop going. But maybe there's still a lot of manual processes in here. And then maybe one day you could reach something like an advanced system.

Mariana Prazeres [00:08:44]: So now you could have everything that maybe you could have thought of in the early days, but it wouldn't have made sense to implement it yet because you didn't know how your product was going to evolve. And, and this could include categorizing examples because maybe there are thousands now, maybe there are humans in the loop. Maybe now you even fine tune after you correct some evaluation examples that didn't turn out well in production. And then yeah, maybe you deploy with confidence because once it passes some evaluation thresholds, you know it's correct. But once again, this is just an example and I think different products require some of these things maybe earlier on or later on. Maybe some things don't ever are never going to make sense. And some features maybe just need a few examples, others, you know, a few examples and many metrics, but others will be one very good metric with thousands of examples. But this is also my point, because once we are in the early days or in the comfortable loop or whatever we are in this process, we don't actually know what is the best evaluation framework because as we build the AI feature, the framework will depend on how the AI feature itself evolves.

Mariana Prazeres [00:10:00]: I guess my suggestion here, and my point is don't try to skip these steps today. Just build the smallest evaluation framework you can. Maybe you have something like this already and then you just work on making it a little better every day. Yes. So thank you everyone and yeah, feel free to reach out to me on LinkedIn or via email. I'm always happy to discuss how it would actually implement this for specific cases and also just AI evaluation in general. And there are some great talks following me. So just stick around.

Mariana Prazeres [00:10:43]: I see you're here now.

Skylar Payne [00:10:45]: This was awesome. I really love the sort of pathway you laid out. It just reminds me of the old adage of crawl before you walk, before you run. So definitely love the ethos there. We did have a couple questions in chat that I definitely wanted to shoot over to you. So Ricardo from Indechium asks, is the user feedback included in the loop and if so, how would you do that?

Mariana Prazeres [00:11:15]: You could include it so here. I mean you can always have user feedback in your evaluation. So the same way you could have an LLM as a judge or some other type of scoring, you could also just bring in to your framework. I know you can do it in Langsmith, when you get a thumbs up or down, you can bring it in and that becomes a score, just the user feedback. And maybe you use it to collect examples to then evaluate later on. Maybe use it for some fine tuning if they're good and if they're bad after you correct them. So can fit in many different ways.

Skylar Payne [00:11:49]: Yeah, awesome. And we had one More question Sange from Microsoft asked, do you mean advanced system, as in productizing the system? If so, that seems like Human in the Loop is coming in pretty late.

Mariana Prazeres [00:12:05]: No, I just meant, like, it was more of an example of how things could look and maybe it depends. Yeah, I think it depends a bit on the situation, but for me, I just meant this is what an advanced system could look like for you. But yeah, I agree that maybe in some situations, human in the Loop comes a bit too late. Yeah, you can agree with that.

Skylar Payne [00:12:28]: Cool. Maybe one more question just. Just from me and you know, from my own experience, like running evals, I noticed that, like, you kind of like, had a little bullet point on like, the early paths of like, you know, simple LLM as a judge. Do you run into issues where, like, the LLM as a judge is not like, well calibrated to like, human judgment early on, and if so, like, do you do anything to correct for that or.

Mariana Prazeres [00:12:52]: Yeah, so there was a little bit of a point stuck in there. I think sometimes when people start with LLM as a judge, they just make it so complex, so it's so hard to calibrate. And that's why my example is like, does it address the user by the correct name? You really want your initial LLMs as a judge to be, like, ridiculously simple, but to actually translate what you really want to happen, it would be much harder to be like, here's our. All our style guidelines. Please do, like, score against our style guidelines. This is going to be like, so much harder to deal with at the beginning.

Skylar Payne [00:13:26]: Yeah, totally. Awesome. Well, thank you so much for your time. This is excellent talk. Very excited. I'm going to connect with you and love to chat. More being said, we're gonna send you off and bring on our next speaker.

Mariana Prazeres [00:13:40]: Thank you.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Comments (0)
Popular
avatar


Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.5K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Building Reliable Agents // Eno Reyes // Agents in Production
Posted Nov 20, 2024 | Views 1.7K
# Agentic
# AI Systems
# Factory
Exploring AI Agents: Voice, Visuals, and Versatility // Panel // Agents in Production
Posted Nov 15, 2024 | Views 1.4K
# AI agents landscape
# SLM
# Agents in Production
Code of Conduct