MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Fast, Trustworthy and Reliable Voice Agents: MLOps That Blend LLM Annotation with Human QA // ERik Goron // Agents in Production 2025

Posted Jul 30, 2025 | Views 19
# Agents in Production
# Voice Agents
# LLM
# HappyRobot
Share

speaker

avatar
Erik Goron
ML / AI Engineer @ Happyrobot

Erik is a ML/AI Engineer at HappyRobot, where he focuses on building voice AI agents for real-world applications. With a strong background in training, deploying and optimizing ML speech models, Erik has contributed to the development of AI solutions across various domains, including logistics and healthcare. Prior to HappyRobot, Erik worked at Maaind, Huawei and Homni Health, where he led the development of speech emotion recognition systems and AI-powered healthcare applications. He holds a Master's degree in from ETH Zürich.

+ Read More

SUMMARY

Voice agents live or die on latency and trust. I’ll share how HappyRobot’s MLOps pipeline turns raw production audio into high-accuracy, low-latency models: 1. Synthetic labels first: we generate large-scale annotations with reasoning LLMs. 2. Human in the loop: a targeted subset of samples are reviewed by human annotators to correct drift and refine prompts (DSPy-style). 3. Distill & specialize: small, domain-tuned models are fine-tuned via LoRA/distillation. We’ll walk through our MLOps stack. From observability to AI-assisted data generation and model fine-tuning / optimization.

+ Read More

TRANSCRIPT

Erik Goron [00:00:08]: Yeah, thank you very much for the invitation. And so I'm Eric, I'm a machine learning engineer here at Happy Robot, where we're building AI workers that can do all sorts of tasks. And one of our main focuses is of course, voice. So we've been building voice for a few years now almost. Yeah, I'm really excited to talk about voice AI and how we've our story and how we've been improving our voice experience overall. So I think the last conversation was all about voice agents. So I'm not going to give you a detailed explanation about how these systems work. As you know, there's like these cascading models where we have a bunch of different models playing together and interacting together that need to be really fast and really performant.

Erik Goron [00:01:04]: But we also have now these speech to speech models. Right now we really believe that these cascading models are reproduction ready and we're slowly also transitioning and thinking about these speech to speech models. But the main discussion will be more about these cascading ones and how we can evaluate them also. So a quick recap or storyline of how we got here at Happy Robot. So we built an initial prototype, like the founders, a couple of years ago, they built this voice orchestration platform that's not based on any framework on any Live kit or pipecat. We build it ourselves. And that's when we got really excited because a lot of customers were like, oh, this has a lot of value. And we got a lot of woes from people around, a lot of validation and the iteration.

Erik Goron [00:02:00]: There was much more about Vibe metrics. Of course, the foundation models were extremely good and we were making sure that we were using the best ones. But the orchestration part itself, it was a lot about, you know, how we felt when we talked to the agent and we did a lot of hacks in order to improve the model. And then slowly we transitioned towards a more production grade and scalable solution. Because we started scaling. We have almost a million minutes of calls happening in our platform. So we have a lot of data and we can, we can measure and monitor a lot of things, right? So that's when we slowly transition towards, okay, we were going to be measuring, we're going to be a metrics driven company, we're going to be measuring the system as a whole with something that we call like North Star metrics, which can be something like, for example, just user frustration, right? Or like, which are metrics that are much more aligned with like, what are our business objectives? Right? Then we also have something that we Call targeted metrics, which are typical ML metrics, to evaluate each individual model by itself. For example, the transcriber.

Erik Goron [00:03:19]: We want to make sure the word error rate is as small as possible. But our main idea now is to reverse the order of. We want to make sure that we understand their North Star metrics and then we can prioritize the features and which models we want to improve based on those ones. Because you can easily go crazy into, okay, let's build the best transcriber out there. But maybe does it really make sense at the end of the day? Because as we can see, the LLMs can actually help out a lot of mistakes from the transcriber. I want to talk a bit about interruptions, which is one of the metrics that we've been optimizing a lot at Happy Robot. And it's one of the most critical and most hard things to improve in general in voice agents. So as you may know, there's these models that are called turn detectors, which predict if the bottom should stock right away or if the bot should wait a little bit, right? I know in the last conversation they were talking about semantic bad.

Erik Goron [00:04:32]: So a semantic bad would be exactly this model, right? Would be a turn detector. But you have all sorts of models for that purpose. You have models that are text based. You also have models that are audio based or multimodal. And we've been thinking really about this problem through a lens of we want to make sure that the model has a lot of awareness of the context because as we may know in a conversation, it depends a lot on the flow of the conversation and of the use case. If you're talking on a customer service, it will change a lot, the pace and the interruptions than if you're talking with your friend, right? So that's when we started thinking much more about the MLOPs or the thinking about fine tuning our models to more specific domains or use cases. What we did was originally we started with heuristic models, as most people probably have tried, like for example, just the silence of the VAT is a great indicator of the sentence being finalized. We also have fillers or punctuations.

Erik Goron [00:05:55]: These are heuristics that are really important and they tell us a lot of information about the end of the sentence. And then we started exploring these more ML based models that are multimodal or that, for example, the semantic VAT also. But the way we approached this was much more in an mlops manner, which is, okay, let's get data from production, let's Observe it. Let's have this monitoring tooling and let's try to find a way to fine tune models for each specific use case or at least try to understand this problem of having different sorts of interruptions in different scenarios. So one critical point that we, one point that it was really relevant for us was the annotation part. To build these models, you need data and you need to make sure that the data is accurate. There was a combination there of using humans to annotate these samples, but also using LLMs as annotators. And by aligning them through, through basically starting building test sets with humans and then aligning what the prompts from the LLM annotators with the human signal, like DSPY kind of things, we managed to create really accurate data sets across many use cases.

Erik Goron [00:07:26]: And we kind of closed the loop of improving these turn detectors for different use cases. So this is about interruptions and here we're mainly talking about the turn detector model. But it's also really important to think that there's a lot of different things that can impact interruptions. So it's not only turn detection, it can also be, for example, the latencies of the LLM or the accuracy of the vad. So there's a lot of components that also play into the interruption metric. What we're trying to build at Happy Robot right now is a much more holistic platform for evaluations and for monitoring all of these sorts of issues that are happening. We have from one side, these North Star metrics, which are from a lot of times we're using these LLM as judges to annotate transcripts or to annotate audio. And we need to make sure that these North Star metrics are well defined and are aligned with human annotators.

Erik Goron [00:08:41]: And then we also have a bunch of ML targeted metrics like worldwide rate or the turn detector accuracy. And they both play together. So that's why we are thinking about. And we've been building this monitoring platform for voice agents, but in general it's also for all sorts of agents where we can track down what's happening, what are the issues, and which are the models that are impacting these issues. So there's a lot of things related to building dashboards to observe, to understand which are the calls which are failing. A lot of things also about labeling. We need to make sure there's human signal to align the LLMs as judges in these scenarios. Then there's the next part which is, okay, let's try to clean that data and try to use it to improve our transcribers, to improve how our LLMs perform and et cetera, that.

Erik Goron [00:09:43]: That's what we're trying to build at Happy Robot is this ML monitoring tower. And yeah, we're really excited because it's a new field and it's extremely promising. And yeah, that's pretty much it. So we're Happy Robot and we are hiring across the board. So if you guys are excited about what we're building and you want to join us, please reach out because we need hands and we need to build cool shit. So, yeah, please reach out and let's build.

Adam Becker [00:10:20]: Nice, Eric, thank you very much. I want to see if we have folks in the audience that have any questions. And until they do, and until those come up, you still have a few more minutes, I would love to go a little bit deeper into some of these slides with you.

Erik Goron [00:10:36]: Okay, let's go.

Adam Becker [00:10:40]: So one of the things that I believe people might want to further understand is. Oh yeah, you're pulling up the slides, right? Yeah, they're off the stage. Nice. Okay, cool. Yeah, let's just go all the way to the top. So. Yes. Okay.

Adam Becker [00:11:03]: So in the last conversation we heard From Peter from OpenAI and Brooke was sort of, I believe, like assuming that people are using, let's say like these end to end speech to speech models that come, let's say, out of the box. Right? So let's say they, you know, you're using the, you know, the OpenAI voice API to what? And you know, many of those already come with some of these modules already built in. Right. They have the vad. Yes, right. And VAD for people that don't know. Can you just. That was the.

Adam Becker [00:11:41]: What is that?

Erik Goron [00:11:41]: Voice activity detector. So it tells you if there's some. If there's speech happening or if there's silence. Right.

Adam Becker [00:11:49]: So they do this, they do the turn detection as well. Right. And so what a lot of people might do if they're just using it out of the box is do they have to already think about things at this level or is this the level that you need to think about it? Only if you're trying not to use some of these things out of the box and to build your own.

Erik Goron [00:12:10]: Yeah, that's a good question. And it's a question we ask ourselves sometimes. But I think what this cascading system gives you, well, from one side. So it's much cheaper also right now, audio tokens are really expensive also we know that currently these speech to speech models are not as good as these LLMs for tool calling. So there's a lot of things that are still not there yet, we believe, but outside of that, we think that with this cascading system, you have much more control and much more visibility on what's happening in all parts. And you, you don't observe it as a black box because you can monitor each individual part. So I think eventually we're going to transition to speech, to speech once it's good enough. And we can actually also build observability things on top of it.

Erik Goron [00:13:07]: So we need to know what's happening inside, what are we transcribing, etc. But I think for the time being, and probably for at least I believe for the next year, these architectures are really powerful and they provide a lot of usefulness.

Adam Becker [00:13:27]: The idea is there's a lot of benefits to building these things on your own. Would you recommend that somebody who starts out for, let's say just to test if there's any demand for their product, they start by already engineering these things themselves, or if is it best to really begin to think about how to unpack existing off the shelf commodity models only once you hit a particular scale, and then costs and transparency and visibility and debugability become more important. At what point should people really be building these things on their own?

Erik Goron [00:14:05]: Well, I think nowadays there's a lot of toolings that a lot of frameworks like Live Kit and Pipe Cut, where you can really easily build these things, I would honestly suggest that they build their own. It's true that there's a lot of noise because, for example, for the transcriber, there's so many transcribers. Which one should I choose for which language? It's true that if you want to build a good model, a good system, you need to spend a lot of time on each part. But yeah, I believe that with live kit and PipeKat you can go really, really far already.

Adam Becker [00:14:44]: Yeah, let me second that. I've been using PipeCat recently and I've been pretty amazed and surprised by almost the seeming maturity of this type of frame. I mean, I thought things would be much more early and I feel like they've done excellent work already. And yeah, just plugging in many of these modules I think is much more straightforward than somebody who might just first be introduced to this would ever kind of imagine. So definitely seconding that. What's the next thing people might want to. Let's imagine that you're already putting this into place. Next you need to build some type of guardrails.

Adam Becker [00:15:26]: This is A question from Ravi in the chat. How do you think about implementing guardrails in voice agents?

Erik Goron [00:15:37]: Well, with these systems, we use the basic LLM guardrails because that's the only part that we need to actually guardrail. So at least from my understanding, what we're focusing more is on guardrailing the brain of the agent, which is the LLM. Like the TTS itself, sometimes it hallucinates, but it's not often. So we really focus on LLMs. And these are, we use the most classical ways to guard rail LLMs.

Adam Becker [00:16:05]: You know, Eric, how can people stay in touch with you?

Erik Goron [00:16:10]: Yeah, of course. Reach out, reach out. Hit me on LinkedIn, on X or Twitter.

Adam Becker [00:16:16]: Do you want to put your, do you want to go back, I think to the last slide, I think where you have maybe your email and LinkedIn.

Erik Goron [00:16:23]: And join us if you're excited and you want to. Like, we're really using the latest technology and with like we're, you know, we're a product and enterprise company, but we are extremely tech savvy and we, we are, we're building all of that ourselves. Like, we're not even using Live Kit and Pipeket. Like we have built really, really cool shit. So please join us.

Adam Becker [00:16:44]: Eric, thank you very much. Thank you for the energy. Folks that have questions or would like to join him, find him and his LinkedIn. Maybe I'll drop your LinkedIn in the chat below. I want to reach out to you, I want to show you what I've been building and pick your brain about it just as well. So stay tuned for that.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

LLM, Agents and OpenSource // Thomas Wolf // Agents in Production
Posted Nov 15, 2024 | Views 1.3K
# LLM
# Agents
# OpenSource
Challenges of Working with Voice AI Agents // Panel // AI in Production 2025
Posted Mar 14, 2025 | Views 511
# Voice AI
# AI Agents