Simulate to Scale: How realistic simulations power reliable agents in production // Sachi Shah
speaker

Sachi is a Product Manager at Sierra, where she leads product for the Agent SDK, Developer Experience, and the simulation and testing framework. Previously, she was Head of Product at Semgrep, Director of Product at Lightstep and ServiceNow, and a Forward Deployed Engineer at Palantir. She studied Computer Science and Math at Wellesley College and holds an MBA from Harvard.
SUMMARY
In this session, we’ll explore how developing and deploying AI-driven agents demands a fundamentally new testing paradigm—and how scalable simulations deliver the reliability, safety and human-feel that production-grade agents require. You’ll learn how simulations allow you to: - Mirror messy real-world user behavior (multiple languages, emotional states, background noise) rather than scripting narrow “happy-path” dialogues. - Model full conversation stacks including voice: turn-taking, background noise, accents, and latency – not just text messages. - Embed automated simulation suites into your CI/CD pipeline so that every change to your agent is validated before going live. - Assess multiple dimensions of agent performance—goal completion, brand-compliance, empathy, edge-case handling—and continuously guard against regressions. - Scale from “works in demo” to “works for every customer scenario” and maintain quality as your agent grows in tasks, languages or domains. Whether you’re building chat, voice, or multi-modal agents, you’ll walk away with actionable strategies for incorporating simulations into your workflow—improving reliability, reducing surprises in production, and enabling your agent to behave as thoughtfully and consistently as a human teammate.
TRANSCRIPT
Sachi Shah [00:00:05]: Awesome, thanks so much. Yeah, super excited to chat about simulations today and how they help with agents. Maybe before I jump in, I'll just very quickly introduce myself and Sierra as well. So I'm Sachi, I'm a product manager. I've spent, I started my career in engineering and I've spent the last several years working at like developer tools, companies within products, and then more recently moved over to Sierra at the beginning of this year. So Sierra based in San Francisco with offices kind of like all around the world, and we're building AI agents for the customer experience. So what that means is if we take like, let's say an example of Sonos, like the speaker company, right? If you call Sonos, you chat with Sonos, the front door to that customer experience will be a Sierra agent, where Sierra will be able to essentially kind of like figure out what your request is and try and help you through a bunch of the integrations we have rag and all of the agent capabilities we have. So if you call in and you're like, hey, I ordered like this pair of speakers, help me troubleshoot them.
Sachi Shah [00:01:09]: The agent will walk you through that. And then if you ultimately want to return that speaker, the agent will actually go ahead and actually process that return for you. So very much front door customer experience. And as you can imagine, so we work with companies across different industries, but especially in E commerce and retail, especially with peak season coming up, a lot of companies are interested in, hey, how do we actually test these agents at scale, right? We're putting these things out there in the wild in front of all of our customers as the front door. We need to make sure they're actually reliable and not just up and running, but actually behaving the way we expect. So this brings me kind of like to my presentation and how we solve for this problem at Sierra, right? I think taking a step back in the world of traditional software, over the last, I don't know, like 20 years or so, we've developed all of these frameworks in which it becomes easier to test your agents, right? You have like your unit tests, your integration test, your smoke tests and so on. And you can kind of like run these tests at different points in your CICD pipeline. And of course it's like the speed has gone up, the efficiency of these tests has gone up, we've shifted left so you can have these run at like precom head hooks and so on, and all of that applies to agents.
Sachi Shah [00:02:27]: But there's also kind of like this new challenge, right? Because agents ultimately are non deterministic and so unlike traditional software, the same inputs to the agent don't always yield the same outputs. And you kind of need to test for all of these different parts and variations that your agents can go down. And I think, like, one of the arts, so we spoke briefly in the last talk about context engineering. One of the arts with actually building an agent that sounds human and like, actually gets the job done versus just like follows like particular steps is not programming those steps in. It's actually just making sure the agent has all of kind of like the context, the goals, any policies or guardrails it should follow, any conversation history, that it should adhere to memory, and so on as well. Right. But if you're giving the agent so much flexibility so it can do its job, and it can do its job by reasoning and improvising, that does mean that the paths that it can actually go down are truly, like, infinite. And so you need a way to kind of like, test through all of these, like, variations.
Sachi Shah [00:03:26]: Right. I think another important thing that to think about in the world of agents is when you're writing a test, like, say, you know, if you're thinking of like a traditional unit test, you don't want it to do things like here, did it actually return the right value? And you don't want the criteria to be super specific because it might actually be creative and do something else that's totally within kind of like the policies of your company. Right. So, for example, in the Sonos example that I'd shared, see, I've called in and I want to return my speakers. I think a simplistic thing to do would be like, oh, the agent should, like, process the return. But what if the agent offers kind of some kind of like, promotion instead? Or, like, actually successfully helps me troubleshoot my speakers to the point in which I don't need to return them anymore. Right. So ultimately you just need to, like, test for, like, hey, did this agent actually achieve that the user achieve its goal or not? So it's fundamentally a very, like, different flavor of testing that you're trying to do as well.
Sachi Shah [00:04:19]: And then, of course, as Demetrios mentioned earlier, we have agents in voice chat, messaging, email, and so on. And I think the list of modalities, especially with all of the announcements and the changes in this kind of new, exciting time we're living in, will just be endless. And so at Sierra, we truly think of kind of like the conversation itself as the interface. And so anytime we're testing our agent, we want to make sure that it's like the simulations and the tests that we have, actually working across all of the modalities and kind of like this build once, test anywhere and deploy anywhere kind of philosophy. And then of course there's a lot of the agent is ultimately speaking with customers in the wire. So there's a lot of variation in terms of where people might be, the emotional state they're in, the language they're speaking in, and so on. So would love to jump into how we've solved this problem of kind of this. How do you test for agents in these infinite scenarios? Right.
Sachi Shah [00:05:16]: And no surprise, we've run it through simulations. So we've taken a book out of kind of like self driving car companies, airlines and so on. And we've really rethought the traditional software development life cycle into what we're calling the agent development life cycle. So this might look very familiar to all of you that have been working in this industry for many years, but really what we're doing over here is starting with kind of the top left. You of course have to write your software. So you build the agent by not kind of setting those okay, like to A, B and C, but really defining those goals and guardrails that you wanted to adhere to. From there you move on to your testing phase, which is like running simulations, regression tests, and making sure the quality bar is maintained, finally going on to actually release your changes, look at how those changes are performing in the real world as well. Hopefully they kind of like mimic what you saw in your simulation phase and then ultimately finding areas to optimize because you know there's no way post release or like pre release, you're going to catch every single thing that might come up.
Sachi Shah [00:06:22]: There will always be room to optimize and then continue kind of like this iteration loop from there on. Right. So I would love to just dive deeper into kind of the simulation phase. And once you've built kind of the initial version of your agent, or maybe like a new change, a big change to your agent before you actually release it, what should you go and do? So this is where we've kind of like built simulations. So what you can see over here is you can kind of like set up this entire like test suite or simulation suite within a product like Sierra and ultimately run these automated conversations between mock user Personas. You're placing these mock user Personas in different scenarios and environments and using that to actually test against your agent. And so you could do a bunch of this manually, but there could be like I said, infinite variations. So it would be extremely time consuming and you're not going to end up testing the same use cases every time.
Sachi Shah [00:07:16]: So one of the benefits of spinning up a suite like this is that can also serve as like a regression suite. So anytime like you've defined your suite over here, anytime like somebody else who I might be collaborating with comes and makes a change, I can rerun or they can rerun the entire simulation suite and make sure like no regressions have been introduced. So beyond the manual work, there's also just kind of like efficiencies and optimizations and just like encoding what you want your source of truth in terms of test behavior to be. So you can kind of see like a sample, you know, chat based simulation on the right hand side of the screen over here to just give you a flavor of kind of like what these actually look like. And like I mentioned, this also enables kind of like repeated like scalable testing and handles all of that variation that you might see. So kind of going a level deeper like what is actually happening in each of these simulations. Right. You can think of this as essentially three components to every simulation.
Sachi Shah [00:08:11]: The first component is the mock user. And so there's like this whole like separate agent that we have to just be the user in this case. And the instructions that you give that user agent is who are you? You might be like Sachi calling in from say like Germany and speaking in like German. What device have you called in from? You might be like on your laptop or maybe you've called in via the phone, right? Why have you contacted support? So in this case you can say, you can see kind of like someone is contacting support because they've like, they're traveling, they've lost their credit card, they're out of the country and they like a replacement urgently, right? So the whole kind of goal for calling in this mock agent is that they need to get a credit card replacement. And it's super, super important that you actually give the user Persona detailed instructions because you're actually just minimizing the variance that actually goes on in the conversation. And it'll help you test exactly what it is that you're looking for. Then what happens is the mock user ultimately ends up interacting with your actual agent. And that of course produces like a conversation.
Sachi Shah [00:09:18]: So you can go and like see the conversation script. And finally we actually have this third agent running, which is where you see kind of these expected agent behaviors that we have defined. So the third agent is the judge agent, where what it does is it actually looks at the entire conversation transcript, it looks at why the user was calling in and it actually grades the conversation based on like, hey, did you actually, did the agent actually do what it was supposed to do or not? And so these expected agent behaviors also super important to kind of like codify and do so well because ultimately all of these are going through LLMs and you want to make sure that the agent can actually understand these. So one example for the expected agent behavior is we suggest not having or conditions in there like the agent should do A or it should do B or like it should do A and it should do B, just like split those up into separate criteria. So it's very easy for kind of like the charge agent to go through these and actually make sure your simulation is passing on. So just taking a step back, the idea of this is you could kind of like set up hundreds, thousands of these and kind of run them every time, every time you're making a change and you want to kind of like set these up in a way where through the user agent you're actually specifying some of these golden parts and common use cases that your agent might go down. So for an E commerce company, you know, maybe there would be things like someone's trying to return an order, but like user A knows their order number so it's kind of like the smooth path. User B just like does not know their order number, so you might have to like look up their order through their email address or something.
Sachi Shah [00:10:49]: And user C maybe knows things, they know their order but they have the wrong order. So how does the agent actually gracefully manage that situation as well? The idea is you kind of like run all of these before making all of your changes to just get that like confidence and release, knowing that your agent is doing what you said what you wanted to do. Income is kind of like the next challenge, right? So first we built these chat based simulations and then we thought about, okay, how can we actually bring this to voice as well in kind of like that same thinking around video agent ones and deploy it across any channel or modality. Now Voice, especially when you think of the user Persona, adds kind of a whole new degree of complexity in that emotions I think like come into play a lot more. The user might be calling in and they might be super angry because like they just like, you know, missed a flight or something has happened to them or like their order like got stolen from their doorstep. Right. So can your agent actually respond with empathy but also not be kind of like so empathetic that they don't just get to the meat of it and solve the issue as well. So how do you like this? This lets you create kind of some of these like softer qualities with the agent.
Sachi Shah [00:11:59]: The user might have like a ton of background noise. They might be speaking with like a heavy accent. They might be speaking super fast or super slowly. So you want to make sure agent actually performs well in all of these different scenarios too. Voice is also just like inherently more complex than chat because you have kind of like speech to text and then kind of like the reasoning that goes on text to speech. Maybe you're running like a speech to speech model. So you generally want to test kind of like the entire stack, which goes much like further and deeper than like traditional chat based conversation. So for example, if you have transcription, you might want to make sure that if I read out a super complex number, the agent is actually able to put a ton of background noise.
Sachi Shah [00:12:42]: The agent is able to transcribe it successfully. Or on the synthesis front, things like can the agent actually read back a complex number like a birth date or license plate number accurately as well? So it lets you kind of test for all of these things. And then also the agent might have genuinely different behaviors, right? Like in voice, you can't send around like magic links. You might need someone to just authenticate by like, you know, using OTP or like saying something or something along those lines. So the logic that actually empowers your agent might be totally different for like a different modality. So it's important to test all of those things as well. But I think when we're chatting about like voice simulations, I think architecturally this was also like relatively complex to build out where you. Ultimately what we've done is we've built kind of these like dual like voice loops as well.
Sachi Shah [00:13:34]: Right. So on one hand you kind of have your traditional voice loop that your agent is using where you're like running your actual agent itself. And then you also have this simulated user loop that goes on. So when we actually have that mock user speaking with your agent, what we're doing is we're injecting things like background noise, and we're also doing it smartly. So if the user's calling in from an airport, we make sure that there's airport related background noise to just actually test for more of these real world use cases. And so we have kind of like what we're doing is we're getting the user to actually generate audio and then kind of muddying that audio with things like background noise and also instructing the user to like speak fast, to speak in a different language and so on. And then what we do is we start screaming, like streaming bits of this audio to kind of like the Sierra voice loop, where your agent is actually able to like transcribe it, understand it, generate the response, send the response back. And this ultimately keeps happening until like the conversation ends.
Sachi Shah [00:14:34]: So you can think of like building almost like a dual agent architecture on some of these cases. And I think as I kind of like wrap some of these insights up, I think ultimately we built these simulations. But of course the next challenge becomes how do you actually run these at scale. Right. I mentioned at the beginning of the talk that running some of these tests in traditional software testing, we've kind of optimized the speed for these to run in CI CD and so on. So it's kind of the same thing with simulations and agents. So a few other things we've done to kind of help with this challenge. One, a common problem with software is a lot of areas of your code base just go untested because ultimately it's time consuming to spin up a test suite.
Sachi Shah [00:15:17]: And spinning up simulations is even more complex. One of the main things that we did at the very beginning was making it easy to actually automatically spin up a simulation suite. What this means is as a user, you can upload any sources of truth that your agent might have, like knowledge basis, conversation transcripts and SOPs and so on. And we automatically generate that simulation suite based on those inputs that you've given. And then from there you can go on, edit and tweak it and so on. We've also made it possible, which I think is like the next piece is super important, is actually integrating these simulation runs within your CI CD pipelines. So for example, you can set up GitHub Actions, use it to say like, invoke a command line and have different checks to be like, I want this set of simulations to always run pre release, so kind of like my main or like pre commit, so my main branch always remains clean and so on. And this is super important because like, if you do it too late in the process, it's just going to slow down the entire development life cycle, which ultimately with testing, it's always a balance of velocity and reliability.
Sachi Shah [00:16:22]: Like I mentioned, kind of like optimizing for different modalities across voice, chat and so on, and actually letting you kind of retroactively understand what agent behavior is as well. And so today we focused a lot on the testing use case. But as you can imagine, you can also use things like These simulated conversations for more like traditional quality assurance if you want to get a sense of how your agent is doing before it goes live through some kind of like manual qa and also for things like predictive analytics. So if you want to know like, hey, what is like the, what is the latency of like every turn for a voice conversation of my agent? Is that something I need to optimize? You can start gleaning these insights before your agent ever speaks with your customers as well. So I hope you enjoyed that talk and kind of learned a thing or two about I think why testing agents is difficult and some of the ideas that we've had to actually share these. Apologies kind of for the video clip as well, but it's on the Sierra website if you're interested.
Demetrios Brinkmann [00:17:17]: Excellent. I love that there's a lot of questions in the chat, so I'm just going to get started with the top rated ones. First one is, is there a way to evaluate an LLM more determined deterministically or are we always going to have an LLM that evaluates an LLM that evaluates an LLMs output?
Sachi Shah [00:17:41]: That's a good question. I think you can do things more deterministically if you want, but I also genuinely think like that the solution to AI is more AI because there's just like so many, there's so much like when you, you're faced with this level of like nondeterminism. Right. And there's so many parts your agent can go down. I think the most effective way to actually test these is through more AI. And I think the really important thing over here when we're doing this is you actually want to kind of like set these criteria and make sure charge agents gets enough kind of like, like fine tuning if you will. And like you're like setting it up with as much care as you would set up like your actual agent.
Demetrios Brinkmann [00:18:22]: Do you have humans? Because one question from Sirop in here is saying who is evaluating the chat simulations in terms of scores that are being given out. Is that a human or is it another AI?
Sachi Shah [00:18:37]: So it is another AI. It's kind of like this. Lm as a judge that we have. But if a human wanted. So maybe like two examples. If as a human, like say you're a customer building an agent on Sierra, you have access to all of these conversation logs for every simulation in the full history. So it's easy enough to go in and be like, okay, great, my judge was right. Oh wait, actually my judge was wrong in this case.
Sachi Shah [00:18:59]: I want to go and update the charge. So you can do that and kind of tweak things along the way. And we have, like, workflows for that. And another thing is, if you're actually trying to, like, release, like, a new version of your agent, you see, like, which simulations have failed. So you can actually go and look at those and edit those. And in some cases, you know, it might be, like, a mess. Like, maybe, like, you're fine with the failure and you, like, really just need to, like, get something out.
Demetrios Brinkmann [00:19:21]: And that's also fine because it's Friday afternoon and you just want to ship.
Sachi Shah [00:19:25]: Exactly.
Demetrios Brinkmann [00:19:28]: I'm just joking, except when I'm actually serious. So how much efficiency gain in response have you seen by using simulations?
Sachi Shah [00:19:39]: How much to what to gain.
Demetrios Brinkmann [00:19:41]: Sorry, efficiency gain. Or is there a certain metric that you look at differently to recognize that these simulations are valuable?
Sachi Shah [00:19:52]: You know, I would say, like, just as an industry, we're still in relatively early days, so it's similar to, like, testing. It took, like, with traditional software, took a long time for us to get to, like, slos and, like, tie those back to, like, okay, my, like, tests actually effective and so on. So I think we're, like, starting to enter that phase in the industry, but not quite there yet in terms of, like, hey, this is, like, the exact metric to be like, is this helpful or not? But a few things we do look at is, is your agent performing well in the real world if you have run simulations? Right. So fewer issues coming up in real conversations. We also track. And it's like, of course, like, traditional metrics like containment rate and so on that are more relevant to the customer experience industry. So is your agent actually hitting its goals once it's live, assuming you've run those, like, tests beforehand? And then I think the whole kind of like, does this actually make shipping software or, like, agents faster or slower? It's very fuzzy, but there's some, like, parallels that you control with that as well.
Demetrios Brinkmann [00:20:50]: Incredible. Sachi, thank you so much for joining us. As you know, I'm a huge fan of simulations. I'm a huge fan of what y' all are doing. This has been awesome. I really appreciate you coming on here to me.
Sachi Shah [00:21:03]: Thank you so much for having me.

