Evaluating AI Agents: Why It Matters and How We Do It // Annie Condon | Jeff Groom // Agents in Production 2025
speakers

Annie Condon is an AI Solutions Engineer at Acre Security, where she helps bring intelligent systems to the physical access control space—without letting any rogue AI lock people out of a building (on purpose, anyway). With over 8 years of experience across machine learning, data science, and AI, Annie’s journey has taken her from deploying traditional ML models to building cutting-edge AI agents.

Jeff Groom is an accomplished engineering leader specializing in AI-powered products. As the current Director of Engineering for AI-focused initiatives at Acre Security, he spearheads the development of domain-optimized solutions designed to enhance security across critical infrastructure sectors. With expertise that bridges advanced machine learning techniques and practical engineering execution, Jeff ensures that AI innovation directly aligns with operational and regulatory needs.
Prior to his company being acquired by Acre Security, Jeff led engineering teams in the security space, driving architecture, development, deployment, and continuous improvement of AI systems tailored to real-world threat landscapes.
Based in Denver, he is active in the AI and security technology community.
SUMMARY
As we integrate agentic AI into business products, robust evaluation of the agents is essential to delivering the highest quality. Proper evaluation ensures that AI agents are reliable, safe, effective, and aligned with user intent. Unlike traditional software or machine learning models, AI agents are non-deterministic and require specific types of evaluation. This talk outlines the importance of evaluating AI agents, the key components that we version and test at Acre Security, the metrics that matter for different types of agents, and how we currently achieve success evaluating AI agents that we build at Acre.
TRANSCRIPT
Demetrios [00:00:00]: [Music]
Annie Condon [00:00:08]: Thanks for having us. We are definitely here for all these talks. It's really awesome to hear all this content and also the music primarily is always why I come. So yeah, I'm Annie, I am a AI solutions engineer at Acre Security and I brought my manager Jeff along because yeah, this is really like a lot of these ideas are his thought leadership. So I'll let him introduce himself real quick.
Jeff Groom [00:00:40]: Yeah, cool, thanks Annie. I'm Jeff, I'm director of engineering for AI at Acre Security. I have about just under 25 years experience in developing software and hardware and not too long ago I founded a company that built some AI products specifically for the industry that we are in. And we went through acquisition, joined up with Acre and then I had the pleasure of meeting Annie and yeah, we're really excited to talk to you guys today. So thanks for having us.
Annie Condon [00:01:15]: Cool, thanks Jeff. And if you're listening in and you're building agents, but you aren't quite sure how to scale an evaluations process and make it robust to be able to put agents in production, then you're in the right place because we're going to show you how we do that. So just for context, a little bit like high level what we do. Acre is a physical security company. So things like badges and readers, access control, how you get in and out of spaces. AI agents are definitely becoming a part of this industry really quickly and there's a race to solve these common problems. The people that we're solving them for are people like installers, operators and enterprise users trying to change things from using really technical processes to more natural language oriented processes. And so yeah, so I think like by now most of you already kind of have a definition for what an agent is.
Annie Condon [00:02:23]: But I think one of the things that's important is that it's still a software system. And even though there's parts of it that are non deterministic, it really needs to be evaluated in the same way or I mean, even more expansive way than traditional software testing. And Jeff actually kind of came up with these like these ideas for these cartoons. So I want him to kind of explain his thinking around why evaluations really improve quality.
Jeff Groom [00:03:01]: Yeah, so thanks Annie. So yeah, I mean, so my specific target audience around the world right now is maybe people who don't have a lot of familiarity with agents and evals. This mental model that I have is here in the cartoons and it shows that when the user interacts with the agent, we have a loop that's non deterministic and the agent is going to Interact with maybe an LLM or multiple LLMs several times. We might not always know the decisions that are made at decision points. Right. And so what we want is I have this analogy of the X ray machine so that we can peek inside or look under the hood in terms of what's actually happening when the agent interacts with the LLMs. And so the big takeaway here is that for me, this is how we, at the end of the day, one of our mantras or what's in our ethos is quality. Like it's all about quality.
Jeff Groom [00:04:00]: And so anybody who's building agents, you can't really go off of, feel like if you feel that it's, it's, it's working well, you really need to understand what's going on. So if you want to drive quality and explainability, it's very important that we, we actually have evals.
Annie Condon [00:04:20]: Yeah, absolutely. And yeah, I think going off of that, like, there's a lot of ways that evaluations of agents are similar to maybe how you would design software testing or even machine learning model testing. But because AI agents have parts of parts that are non deterministic, what really enhances agent evaluations is one the metrics that you're designing and that you're choosing to evaluate your agent, which tend to get really creative because it's a different problem that you're trying to kind of evaluate. And then also, also the tools that you used, not, not literally like agents tools that they're using, but like the frameworks that you use to, for evaluation to make it very systematic. And I, I know this is like a lightning talk, so I kind of want to get to like the sexy part of how, how we do some of our evals. So I'm gonna speed through this. But at acre, we version everything. First off, every version of an agent might include the prompts, the instructions, the code, the tools that were used.
Annie Condon [00:05:44]: Some of this gets versioned within each version of the code. But then also we store a lot of these evaluation logs as well.
Demetrios [00:05:56]: So.
Annie Condon [00:06:00]: I want to show you really quick two of the tools that we use. And let me try to share this. One of the tools that we use is called Logfire and Logfire is built by Pydantic. I'm not a log fire salesperson, but we really like it because it's great for drilling down agent runs. I don't know, let me see if I can zoom in on this a little bit. But basically Logfire will log each one of your each time that your agent runs. And it Will trace basically like how long your agent run was. It will tell you everywhere from like the system prompt all the way to the user's prompt, so what was entered by the user.
Annie Condon [00:06:52]: And then it will show you in detail, kind of like the agent's output, which is really important. When Jeff was giving the X ray analogy, I think sometimes it's hard to determine which tools your agent used and if your agent used it in the correct order that you were expecting. So like this agent is just like a simple math agent. I think it has two tools for multiplication and additional. But we can see in this run the user prompt was 2/2 times 4. Then it breaks down in detail what the agent is doing and which tools it used. We find this really useful not only for evaluation before deployment, but also in production, having these logs so that we can trace the back the results of each tool. And then one other really cool tool that we're using is called Confident AI.
Annie Condon [00:07:58]: And it's like the library, the open source library behind it is Deep Evals, but similar to something like MLflow. It traces your runs of your evaluations and it will give you an overview of how many of your evaluations passed your metrics. And then you can also drill down into each case that you put into your actual evaluation and kind of troubleshoot from there. And just for context, also, like, we, we're a super lean team there. It's Jeff and I and one other developer. And so I think it's really like we could definitely build our own tools for this and that's definitely an option as well. But this makes it really easy for us to perform quality evaluations quickly. And then also like Deep Eval offers a lot of different metrics, like built metrics, especially for conversational agents and things like that.
Annie Condon [00:09:04]: And then they also offer, you can build your own custom metrics. So it gets really creative.
Jeff Groom [00:09:12]: Yeah, and I just want to double click on that. I mean, I think the main takeaway, like Annie said, we have a small team, so we don't want to reinvent the wheel and try to build our own tooling. But it's my opinion that evals sort of make or break an AI project or product like this. And so you need really good tooling. And if you can't roll your own, then it's important to find good stuff that's commercial, off the shelf, and you want to try to find something that, like Annie said, you don't have to reinvent the wheel on the eval platform itself, but that you can fine tune the Metrics so that they work for your. For your business case and you make sure that you're evaluating metrics and scoring. That's really tied off tightly to your business loop.
Annie Condon [00:09:54]: So, yeah, yep, that's kind of a perfect conclusion to this. So thank you so much and connect with us on LinkedIn if you feel like it. And thank you.
Demetrios [00:10:09]: So I've got one question and it is how you chose Log Fire. How and why?
Jeff Groom [00:10:19]: Go ahead, Annie. You want to answer that?
Demetrios [00:10:23]: I don't work for Log Fire, I don't sell for them at all. But you chose it and there's probably some reasons. So I want to know.
Annie Condon [00:10:31]: I'm going to defer to Jeff because I love Logfire now that we're using it. But I think Jeff, it was his idea originally.
Jeff Groom [00:10:40]: Yeah, I mean, the answer was pretty simple. There are probably other things out there, but I'm a big fan of the people that build all of the Pydantic products. So I know that we just saw talk about the merit of agentic frameworks, but I mean, I think Pedantic really stands on its own, just in general in the Python community. And now we have Pedantic AI, which is a framework that I think is one of the most, or one of the easiest frameworks to use. And it's kind of slept on a little bit, I think. You don't hear about it a lot, but I really love it. And so as an extension of that, the creators of Pydantic also make this logging utility and I think it's pretty comprehensive. I like the user experience on it, and it's not just for observability on agentic applications.
Jeff Groom [00:11:32]: You can use it to actually log it, to ship out logs for any kind of application that you want to do. But it was purpose built with the idea of running LLM powered or agentic applications. Like the makers at Pydantic, they thought they went out there to try to find something while they were building maybe LLM apps and there was nothing out there. So they were like, let's tailor something to what the developers are going to really want. And their ethos is like, we want to make tools that are really developer friendly. I don't know the people at Pydantic, but from what I read, this is their whole mantra is we want to make shit really easy for developers to use. And so we just started checking it out and I loved everything about it. And so until we find something that's better, like that's going to be, that's going to be the horse that.
Jeff Groom [00:12:19]: That we choose for this race.
Demetrios [00:12:21]: That's amazing. If you ever want to know more about Pedantic, I found the biggest fan, and that is the creator of Fast API.
Jeff Groom [00:12:30]: Oh, yeah. Yeah. He's amazing. Amazing dude. Yeah.
Demetrios [00:12:34]: And if you ever just. If you utter the word pedantic around him, you better clear your schedule.
Jeff Groom [00:12:41]: His story in its own right is amazing, too. Fast API is one of the most. In my opinion, one of the most amazing things ever built with. With Python. So cool, dude.
Demetrios [00:12:50]: Yes. I. I literally just did a podcast with him and mistakenly said, so why do you like Pydantic?
Jeff Groom [00:12:59]: We talk about that for, like, 20.
Demetrios [00:13:02]: Minutes of the podcast, so. Yeah, that's cool to hear about Log Fire. I appreciate it. And Lack was just talking about in the last talk. Pedantic is a great one to choose because it's, you know, very flexible in its ways.
Jeff Groom [00:13:14]: Yeah.
Demetrios [00:13:15]: So, folks, thank you so much for this. We're gonna jump to the next talk.
