Building Robust Autonomous Conversational Agents with Simulation Techniques from Self Driving // Brooke Hopkins // Agents in Production
Brooke Hopkins is the founder of Coval, a simulation and evaluation platform for AI agents. She previously led evaluation job infrastructure at Waymo. There, her team was responsible for the developer tools for launching and running simulations, and she engineered many of the core simulation systems from the ground up.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
As conversational AI systems become increasingly autonomous and mission-critical, ensuring their reliability presents novel challenges that parallel those faced in self-driving vehicle development. This talk explores how simulation-based testing and evaluation strategies from autonomous vehicles can be adapted to build more robust AI agents. We'll examine how traditional software engineering practices are evolving in response to AI systems, where deterministic unit tests give way to probabilistic performance metrics and behavioral analysis. Drawing from real-world examples, we'll demonstrate how comprehensive telemetry — both in pre-production simulation and production environments — provides crucial insights beyond simple pass/fail metrics. The presentation will delve into the critical balance between computational cost, latency requirements, and signal quality in AI system evaluation. We'll introduce a framework for developing evaluation strategies based on reliability requirements across different use cases, from bug detection tools where any true positive provides value, to medical assistance systems that demand near-perfect accuracy. Attendees will learn practical approaches to implementing simulation-based testing pipelines, techniques for meaningful telemetry collection and analysis, and strategies for defining appropriate reliability thresholds based on application context. This session will benefit ML engineers, software architects, and engineering leaders working on production AI systems.
Adam Becker [00:00:04]: Next up we have Brooke Hopkins. Brooke, can you hear me?
Brooke Hopkins [00:00:07]: Yep, I can hear you.
Adam Becker [00:00:09]: Awesome. So I hope this was the right transition because I am so interested in simulations, especially in the context of conversational AI and I'm stoked to hear about what you have to say about this. Let's see. Do you have your screen up? You do? Awesome. So I'm going to be back soon. If anybody has questions until then, drop them in the chat and then I'll ask them in about 20 minutes. So Brooke, take it away.
Brooke Hopkins [00:00:37]: Awesome. Well, awesome talk to follow on from. So today I'm going to be talking about how what we can learn from self driving when testing agents, specifically autonomous agent simulation. But I think probably just to define what is an agent because there are so many different catch all terms for what an agent is today. And the way we define the agent is an autonomous system that perceives the environment and can respond to it. What this has to do with self driving cars, or the similarities there are that like a self driving car is basically an agent without wheels. It navigates the world to achieve a task and then it's basically a self driving car is trying to get from point A to point B. And it has to, for every iteration it goes through, it has to respond to the world in which it's navigating.
Brooke Hopkins [00:01:27]: Autonomous agents are really similar in this regard where they're interfacing with external systems, be it maybe a user, a voice agent, another web agent, or another website, another API and it's trying to autonomously navigate these systems. I think there's a lot of really interesting things that we can learn from self driving because testing agents is really slow and manual. So today engineers will spend hours manually testing their system. And this takes a really long time because you, not only do you have to do it manually, but you have to go through multiple steps in order to test that agent. And so for example, if you're testing a voice agent, you might call it over and over in order to make sure that it's doing what you expect and going through all these different pathways. But each of these phone calls can take anywhere from 30 seconds to 10 minutes to test, end to end. Every time you make a change to the agent, then you vibe, check, chat with the system for hours, find a variety of ad hoc cases that you think need improvement, and then you make another change to improve those things. You end up playing whack a mole.
Brooke Hopkins [00:02:35]: Because each time you make a change, you would have to test every single possible case, which would just take far too much time. But self driving has been solving this problem for the last 10 years, where how do you test basically an infinite plane of all the possible inputs and decide whether or not you're systematically improving over time? So that's how I ended up starting COBOL is trying to take a lot of these learnings from self driving and apply them to building autonomous agents. I was previously leading evaluation infrastructure at Waymo, so my team is responsible for all of our developer tools around, launching and running simulations anywhere from building datasets, maintaining them, and then launching these simulations on distributed compute. And then the rest of our team has studied evaluation at Stanford and Berkeley. And so we're kind of taking an approach of drawing from the state of the art research as well as building infrastructure at Waymo and taking a lot of those learnings. So today I might talk through the interesting parallels that I've seen between building out evaluation systems for autonomous agents, especially conversational AI, and how we've taken learnings from self driving and a lot of the iterations we went through there. So at a high level we'll talk through probabilistic evaluation and what that looks like for agents and how that's changed over time, how you can simulate different layers and the parallels with robotics here, how we can compare to humans when there are so many possible right answers and wrong answers, and then how you can build reliability when there are so many different layers of compounding errors. So software engineering, I think we're all aware that the way this, you know, best practice around testing is you have layers of test cases, so you have unit tests that you run every time you submit a commit or a piece of code.
Brooke Hopkins [00:04:45]: You have some integration tests or some larger regression tests that you might run lightly. And then you might have a bunch of release tests. And these all have very clear pass or fail metrics. And then with traditional ML, it was much more focused on F1 score precision recall. How do you make sure that in aggregate you're improving on certain numbers? But I think the challenge with foundational models in both self driving and with conversational AI or generative AI at large is you both care about these, the overall statistics, but you also care a lot about individual examples. And so I think that's where you kind of have the parallels from both self driving and traditional ML, where you want to know like what are your aggregate statistics of like how well you're doing on all the different cases, but you also care a lot about doing well on certain tasks. And so this is actually very similar to self driving where you Want to know, like, how often are you driving successfully without a collision? But you also are looking to say for this specific stop sign, did we stop at the stop sign? And what does that look like? What did the path look like? And so I think something that self driving came to the conclusion of is moving away from testing very specific test cases where you say, for this scenario, I want exactly these things to happen, being very rigid in those test cases, and then maybe manually creating these and then running several hundred of these or whatever, anywhere from tens to thousands. But these are really expensive to maintain.
Brooke Hopkins [00:06:32]: You have to manually go in and create all of these. Self driving then shifted, I think, well, while I was working in self driving towards this much larger probabilistic evaluation. So how do you run thousands of different test cases that are all generated either from logs or from synthetic data, simulate all of those cases, and instead of looking at any specific one, you're looking at aggregate events. So how often are certain types of events happening? How often do we get to our destination? Was there, you know, a heartbreak? Was there, you know, a collision? All these different metrics. And what's really useful about this is that you're able to do, you, you don't have to maintain each of these tests, each of these individual tests, but you can instead say like, how often are all these things happening across all of these cases? So I think we should be doing the same thing with agents. I think right now people are too focused on doing individual test cases. So I want to test, you know, when I put in this prompt of I want to book an appointment, then making sure that it books the appointment exactly in the way I expect. But really what we should be doing is running, you know, thousands of different iterations of booking an appointment and then saying how often was the goal achieved using LLM as a judge or heuristics, etc.
Brooke Hopkins [00:07:57]: And this will allow us to be able to show both from a non deterministic perspective, like it's working 90% of the time, but also creating these much more robust tests that are able to be dynamic as your agent changes over time. Another really interesting thing I think we learned from self driving is the idea of simulating different layers and kind of enabling and disabling different modules. So in robotics, when you simulate, when you simulate a piece of hardware, you might be testing different layers at different points. So everything from like car on board or hardware on board, you're testing, you know, just the hardware, you're testing the hardware and seeing how actually it's performing on you know an exact copy of that hardware or you're testing just the software in this like very high level. So a bit of background on how self driving cars work at an extremely simplified level is you have a bunch of sensor inputs, you have lidar, you have cameras, you have radars, you have audio. And that's all being ingested into perception. Perception says, you know, like, what is, like how can I take all the single signal and turn it into, this is a person, this is a siren, this is, this is a car in front of me. And then you have localization, you have a bunch of other component modules that are doing, ingesting other information and stitching that all together with the sensory information.
Brooke Hopkins [00:09:33]: And then you have planning where you're both deciding based on perception and everything I know about the world around me, what do I think is going to happen next? So behavior prediction of is this person going to walk across the road? Is the car in front of me moving forward or is it stopped? Do I think it's going to move forward? Is it just double parked because it's waiting for someone to come out? And then planning takes into account all of this information from downstream and says, given that I want to get to point B, how do I, what's my next, my next action? And then control actually passes this back to the hardware saying I want to turn left or I want term right, I want to accelerate. And so in simulation you don't necessarily have to test every single piece of this stack in order to have really useful simulations. And I think we can take a lot of this in simulating agents because for example, with conversational, conversational agents you might not actually need to simulate voice in order to be able to understand how well your voice agent is doing. You might just need text to evaluate the logical consistency of your voice agent. Or you might, if you're doing, you know, an AI sdr, you might only evaluate the conversational piece of voice and chat, but you know, mock out all of the web agents and all of the websites that it's interacting with. So this is useful for optimizing, I think this trifecta that you're always optimizing for in simulation, which is cost, latency and signal. You can make every, you can get the most amount of signal by running an infinite amount of tests, but then you'll have very high costs or similarly you can have very low cost by running no, by running no tests and then you'll have very low signal and, or you can run them very slowly on like very cheap compute. But Then it takes a really long time.
Brooke Hopkins [00:11:38]: Then the next thing I think that was really useful from building simulation at Waymo is this idea of like, how do you compare, how do you compare to a ground truth human input? So one of the hard parts of building autonomous agents is there's many ways to get to the end result. So if you're driving a car from point A to point B, you might be able to take a left and then a right, or you might be able to take a right and then a left. Maybe you pass a car or maybe you wait for it to pass before you take a left. And with autonomous agents, conversational agents, you have the same problem where you to book an appointment. There are many ways that you might say that in many ways that the conversation might go. But in the end you successfully book the appointment, get their email, get their phone number. And so here the ideas of comparing to humans are like similarly difficult. One way to do this is to say are we slower or faster than the human was? And how can we measure that? So for conversational AI, what we do is we can say what steps.
Brooke Hopkins [00:12:50]: Like we can ask an LLM as a judge to say what steps did it take in order to achieve this goal and then how long did it take the agent to achieve that goal? And it's not necessarily a hard and fast rule of this is definitely you should always be faster or always be slower. But it's really useful for giving you a broad sense of if you're dramatically slower because the agent didn't understand, or the agent is going around in circles, or that the agent keeps repeating itself, this can be really useful signal for saying my agent isn't doing what I expect. And so what we'll do is we can take ground truth human labels and say how long does it take an average person to achieve this task and what steps do they take? In the abstract of first you get their email, then you get their phone number, then you say you'll call them back and you can say does that match up with the, with my. What my agent is doing? And I think yeah. There's also a lot of other really interesting parallels and metrics between self driving and agents won't go into all of them. But the high level is having a suite of metrics that you go through and they're not necessarily, I need this metric to be like passing, which is 90% or otherwise it's failing. But actually a spectrum of metrics is, can be really useful for having a suite like doing human review of these really large Regression tests and saying what are the trade offs that we're willing to make here? Are we willing for an agent to take longer if it has higher accuracy? Cool. And I think another the last parallel between self driving and autonomous agent evaluation is that you have the similar problem of compounding errors where or a butterfly effect where if something if you have an error early on in the conversation that can then cause an error in the next piece, can then cause an error in the next piece.
Brooke Hopkins [00:15:02]: And the flip side of this is that your potential for errors, you may argue is actually you know the sum of all of the points along your system that are the that could fail. But I would so this has also often been an argument for people of why it's very difficult to build reliable agents and we won't be able to get there. But I would argue the opposite actually. I think we've been solving this type of problem for a long time in software where traditional software infrastructure to get six nines of reliability. You're building on top of many systems that all have their own potential for errors. So your servers have a potential for failure. Your Internet has a potential for a failure. There's packet loss, there's web application errors, there's all these layers that all have the potential to fail.
Brooke Hopkins [00:16:00]: How do you build really robust systems when you otherwise might encounter all of these different areas throughout the stack? What software infrastructure has done is actually building reliability into this through redundancy, through fallback mechanisms, et cetera. Self driving also took a similar approach where just because an agent fails or is non determined or just because a model fails or is non deterministic, there are all these potential fallback mechanisms that introduce redundancy and also ensure that it can fall over gracefully. I think we should really be doing the same with agents. Is thinking more about not just how do I get on the first try my prompt correct, but if I get it wrong, how do I create fallback mechanisms such that my agent can fail gracefully? There's redundancy, I can find the right result by calling multiple models, etc. This is really once we start to think about layers of redundancy that can help really dramatically increase the reliability of agents. And then I think there's a really interesting like last piece of from the last talk where he they were saying like it's really important to have like always know that there's going to be human in the loop. I would actually argue the opposite is that it's really important to set your mind on building towards level 5 autonomy and self driving there's these levels of autonomy where everything from level one, which is not autonomous at all, level two, level two through four are kind of variations of driver assist or advanced driver assist. And level five is full autonomy with no human intervention.
Brooke Hopkins [00:18:08]: And I think a lesson that we've learned from self driving is that it's very difficult to go from level two to level five linearly. So any self driving car company that was targeting level two and said we're going to do driver assist and then systematically get better over time to get to level 5 has had a. We have yet to see that approach successful. However, I think Waymo targeted Level 5 autonomy from the beginning. And what this allows you to do is it actually forces you to build this redundancy and fallback mechanisms and reliability into your system at a very core level that's hard to add in after the fact. And so actually I would say if you're building autonomous agents, you should shoot for full autonomy from the beginning because it forces you to think about what do we do for fallback mechanisms if the agent fails, how do we handle that failure gracefully? How can we create other systems that can make, you know, decisions perhaps without, without generative AI or with using cached examples, using multiple LLM calls. And I think it's very difficult once you have human in the loop to improve that over time. I think we've seen that with, for example, like email suggestions, it's hard.
Brooke Hopkins [00:19:27]: It's not that over time the better the email suggestions are, the more likely you are to just not look at it at all and send it off. I think you actually have to build in systems that allow you to see how are all my emails that are being sent out performing, can you flag emails that potentially were incorrect, can you go through some secondary review, et cetera? And so I think that is really important to shoot for Level 5 autonomy from the beginning. Yeah. So with all this said, I think we have been approaching agentic evaluation in the wrong way. I think we're focusing too much. What I've seen from the industry today is that we're focusing too much on building static tests, saying, I'll build out for this prompt input, I expect this output and I'll do that for every step along the system. But really we need to create a new testing paradigm that's much closer to robotics of how do you build out simulation such that autonomous systems interfacing with the real world can interact with a dynamic testing environment that provides dynamic input as they're making decisions and navigating. So what we do is we'll actually simulate a whole conversation back and forth or, and we're moving into being able to do this for web agents, being able to do this for any type of agent, simulating it within the sandbox and saying, you know, for this agent on the left, how is it responding to our dynamic input? So this way, when, when your agent behavior changes, all of your test cases don't break.
Brooke Hopkins [00:21:12]: And we also will go through and show you like, which paths are you going through and like which paths are you exercising the most, which can be really helpful for understanding test coverage. Another really tricky part of these, these testing these systems with such a huge testing plane. Um, so we can say from, from this example, it's very clear that this, these paths weren't exercised and either that's correct or incorrect as well as we now understand which other parts of the system we have to test. So lastly, I think this testing starts to blur the lines. Like I, I think all generative AI has kind of started to blur the lines between software, product and sales. But showing how well your agent works is a really critical part of gaining user trust. And so I think we've seen something that's really interesting of actually using a lot of these tests to create user confidence that they can trust your agent, they can trust that it will be safe, that it will do what they're expecting, and they can also monitor that over time. So through these dashboards, you can actually see what's happening over time and how often are you getting things right.
Brooke Hopkins [00:22:31]: And on top of that, it also reduces, drastically reduces the amount of engineering time spent. You don't have to go through a thousand test cases manually by typing this input or calling your agent over and over. Cool. So that's. Those are some thoughts on how self driving we can use like learnings from self driving and evaluating autonomous agents. We. If you're building agents, we would love to talk to you and see how we can help. We're also hiring, so if you.
Brooke Hopkins [00:23:04]: This sounds really exciting to you. Come work with us.
Adam Becker [00:23:08]: Thank you very much, Brooke. This has been absolutely fascinating and I feel like I can now keep you here for like two hours, just ask you a ton of questions. Let's see how many we can actually go through because I see already a couple in the chat. So let's see how far I should start because there's a bunch I'll try to focus on the ones that I think are like the most specific to the content that you've shared. So. Okay. One of them is how do you Create a rich set of different situations to test the agent on. And do you use LLMs for that as well?
Brooke Hopkins [00:23:47]: Yeah. So we can generate scenarios for a lot of different scenarios given certain contexts. So if you give us workflows or if you give us reg documents or other pieces of content that inform your agent, we can generate test cases from that. But I think another useful piece is that we actually will simulate everything based on this high level scenario. So say you want to create an account, give the details and ask, you know, ask for a professional plan. The advantage of actually defining these test cases as like natural language is that they're much more, they're very easy to create. And so then you can run those same scenarios maybe 100 times and see different pathways for that high level objective. But on top of that, it's also much easier to set up for say, creating 50 test cases in the same way that you would write unit tests for your application.
Brooke Hopkins [00:24:43]: Like I want to test all these things and then I expect it to run.
Adam Becker [00:24:51]: This is the question that I have. And I'll continue with the chat in a minute, but I do think it's a little bit more apropos right now. One of the reasons I feel like we often write unit tests in much more kind of like closed form test is because if something fails, I tend to have a clue as to why and then I can go in and actually start to investigate and then ultimately resolve it. When you go about testing in the more kind of like statistical sense, how do you nevertheless build up an intuition as to why certain errors occur?
Brooke Hopkins [00:25:28]: Yeah, I think the answer is probably twofold. It's similar to integration tests. I think we always recommend like a similar setup to like integration tests and unit tests. So we're very focused on integration tests, which is how does your system run end to end. But you might want to be running like prompt, you know, prompt evaluation where you're iterating on a prompt and seeing how like how well you're producing expected outputs. And then the other. So then that helps you kind of target where in this pipeline things might be going wrong based on intuition of we're not following the path correctly. And like this is the module where we're following the path.
Brooke Hopkins [00:26:09]: The flip side of this is also being able to isolate different modules. So doing things like not simulating voice and understanding, like is it a transcription error or is it that we're actually not logically understanding what to do next. But I think this is similar to software engineering where if you're getting a lot of if you're getting a lot of errors on your server, starting to go down and isolate variables, it's a very similar engineering problem.
Adam Becker [00:26:36]: Okay, so this carries over very nicely to the next question then. So I suspect this is a nice transition. But so Lucas is asking, how do your systems layers communicate with each other? I believe that this was when you were sharing the slide with all the different, with the lidar, the camera, all the different sensors. He's asking, are you using custom built or industry standard protocols? The processing from sensor inputs to control, steering and throttle is impressive is what he's saying. I think that this gets at that a little bit like the modularity and being able to then simulate different inputs and control these variables. Do you have some recommendations for how to go about doing that?
Brooke Hopkins [00:27:22]: Yeah, I think with there's already like a stack that's emerging around conversational AI where you have, you know, text to speech, then you have the, you know, using LLMs to generate the responses and then speech. Text, sorry, speech to text and then text to speech. And those are pretty well understood versus I think self driving. There's a lot more complexity of like how do all these modules talk to each other? I think the other, the other thing here is that because you can isolate out these variables around just LLM response completion, it's pretty easy to take out text to speech or speech to text. I think where this will become really interesting and I'm excited to see how we evolve. There is with OpenAI's new voice to Voice or like all these new Voice to voice models coming out. Eleven Labs also has their own Voice to Voice product. Now.
Brooke Hopkins [00:28:20]: How do you then, when you don't have visibility into how that text is being generated, how do you then isolate out those modules or is it even necessary? I think also as we're expanding into web agents, there's an interesting aspect there of how do you generate, how do you generate web pages and kind of selectively do that? But I would say it's largely a much easier problem just because you have fewer moving parts than you do in self driving.
Adam Becker [00:28:49]: Okay. So I suspect that Troy might take an issue with that statement. Then he says, my Tesla with FSD is FSD full self driving? Yeah. Okay. My Tesla with full self driving has completed my last 20 drives without safety disengagement. Would you agree or disagree? Having worked with both.
Brooke Hopkins [00:29:22]: You cut out just a bit. But I think I heard the question is like FSD has completed all of these drives and is really far ahead. Where is generative AI and language today?
Adam Becker [00:29:34]: That's right. And it seems to him the language agents is a much more difficult problem because there's so many more possibilities. And so. Yeah. Would you agree or disagree that ultimately these agents are going to be a much more difficult problem to solve?
Brooke Hopkins [00:29:51]: I would probably agree and disagree. I think the advantage of self driving is that you do have a certain set of road rules that you can abide by. So you know that everyone's supposed to stop at a stop sign or there's a lot of common but like common conventions and also laws. So in that sense it makes it easier. But still having the physical doing physics, you know, physics and the physical body of the robot as well as like operating in 3D space and all of the possible scenarios that can happen when driving. Like what happens when a kid in a costume jumps in front of your, you know, jumps in front of your car? Like can we, can you recognize that as a, as a child all of these like, you know, crazy edge cases. I think you're going to see it's like a very similar scale of infinite possibilities in agents and with voice and chat. I think as a, as a starting point you definitely have a lot more structure there of you're only interfacing with like one in like one modality going back and forth, usually one turn after another.
Brooke Hopkins [00:30:57]: It starts to become really complicated when you have multiple people speaking. Where I would agree is that I think there's a future where you have all of these agentec employees doing really complex tasks like data entry and navigating different APIs to then execute a task and send emails, make calls, collect data. All of these tasks that we do today. And that most certainly is a much harder problem that I think we haven't even started to solve today.
Adam Becker [00:31:31]: I lost you for a minute. Can you still hear me? Am I still here?
Brooke Hopkins [00:31:36]: Yeah, I can hear you.
Adam Becker [00:31:38]: Where do I talking about packet losses and building realized.
Brooke Hopkins [00:31:44]: Well, WebRTC is notoriously one of the hardest problems that I think it's still hard today even after decades.
Adam Becker [00:31:54]: I'm dealing with that right now for my company. So it seems like a never ending struggle. Last question here and if you can please just stay on the chat. Lots of folks have a bunch of questions there. So if you can just go back and engage with them. So Jan is asking, humans seem much more or are much more erratic and multi thematic in their conversations, especially in voice. How do you go about simulating that.
Brooke Hopkins [00:32:22]: One thing that we do. This is also something I think borrowed from self driving is actually re simulating from logs so you can't take exactly these transcripts and then just feed them back to your agents, because if your agent behavior changes, those responses no longer make sense. But simulating from logs gives you kind of, you try to follow that trajectory, but when your agent behavior changes, then you can respond dynamically with the appropriate response. And so we use that to simulate more of these erratic behaviors where someone is saying something pretty unexpected, or they're saying very specific names or very specific cases that are throwing the agent off. And that can be a way to both create this development cycle of, as you find cases in production, then re simulating those and seeing if you can iterate to improve those cases. Because individual test cases are as important as the aggregate statistics. But yeah, definitely finding these like crazy examples where some, you know, a dog is barking in the background and someone is describing like the weather, when this is totally irrelevant to the scenario, but then actually makes a request. How do you make sure that you're getting that right every time?
Adam Becker [00:33:36]: Brooke, thank you very much for coming and sharing this with us. Next on stage one, we have the great practitioner debate. So I'm going to direct everybody to go check that out and Brooke will see you in the chat. Thanks again.