Building Real-Time, Reliable Voice AI: From Simulation to Production // Brooke Hopkins | Peter Bakkum // Agents in Production 2025
speakers

Brooke Hopkins is the Founder and CEO of Coval, where her team is building the enterprise-grade reliability infrastructure for conversational AI. Coval provides simulation, observability, and evaluation tools that help companies rigorously test and monitor voice and chat agents in production.
Previously, Brooke led evaluation job infrastructure at Waymo, where her team was responsible for the developer tools that launched and ran simulations to validate the safety of autonomous vehicles. This work was foundational to scaling Waymo’s simulation capabilities and ensuring the reliability of self-driving systems.
Now, she’s applying those learnings from testing non-deterministic self-driving agents to the world of voice AI, bringing proven simulation and evaluation methods to a new class of complex, real-time AI systems.

Peter Bakkum leads the API Multimodal Team at OpenAI, where he built the Real-time API powering the next generation of voice AI agents. His work makes it possible for developers to create responsive, conversational AI experiences that combine voice, vision, and language—enabling products that feel fast, natural, and interactive.
At OpenAI, Peter’s team was instrumental in shipping the infrastructure behind GPT-4o’s real-time voice capabilities, bringing latency down to human-like levels and unlocking new applications in voice assistants, AI companions, and multimodal agents.
Before OpenAI, Peter was an engineering leader at Stripe, where he rebuilt the company’s financial backend, and Quizlet, where he helped scale the platform’s core systems during rapid growth. His career has focused on building the infrastructure that powers complex, real-time, and data-heavy applications—now applied to pushing the boundaries of voice AI.
SUMMARY
Join Brooke Hopkins (Founder & CEO, Coval) and Peter Bakkum (API Multimodal Lead, OpenAI) for an insightful fireside chat focused on the cutting-edge voice-to-voice architectures powering modern voice AI applications. They’ll unpack the unique challenges of designing and deploying real-time, multimodal systems that enable seamless, natural conversations between users and AI agents. Drawing from Brooke’s expertise in simulation and evaluation at scale and Peter’s experience building OpenAI’s real-time APIs, this conversation will dive into how infrastructure, latency optimization, and rigorous testing come together to create reliable, production-ready voice AI experiences.
TRANSCRIPT
Adam Becker [00:00:00]: [Music]
Brook Hopkins [00:00:08]: Hey everyone. Great to see all of you here. So today for this we're going to be doing something a little bit different, more of a fireside chat with Peter Baucom from OpenAI. He leads the real time API at OpenAI and has been working a lot on how do you really scale voice to voice in production applications. And this has been something that I'm super curious about because I think there's so much potential to have all of this really interesting higher fidelity data to feed into your Voice AI applications. So I'm Brooke, I'm the founder of Coval and we're building simulation and evaluation for voice agents. So we think a lot about how do you scale these and really take a voice agent from a really cool looking demo to running millions and millions of conversations at very high reliability. My background is from Waymo, so I led our evaluation job infrastructure at Waymo and my team was responsible for all of our developer tools for launching and running simulations.
Brooke Hopkins [00:01:05]: But I think there's a lot of really interesting parallels from Waymo which is how do you take these really large scale, non deterministic models and be able to scale those to a very high reliability system that's also very autonomous and can handle new situations well, Peter, do you want to start with the intro?
Peter Bakkum [00:01:22]: Yeah. I'm Peter. I work on real time API at OpenAI and this is our platform for streaming speech to speech voice applications. So native audio in, native audio out applications. I spend my time building the service, working with researchers and growing the the applications that customers are able to do on top of OpenAI.
Brooke Hopkins [00:01:53]: Definitely. And I think Peter has been an awesome thought partner on kind of when does it make sense to use a real time solution? When does it make sense to use, for example, other OpenAI products? And so that's really what I want to get into today is pick Peter's brain on what he's been seeing in terms of where Voice AI is excelling in Voice to Voice and kind of what the future of Voice to Voice Voice looks like because everyone is very eager around that future. Maybe to start with just a quick recap for those that are newer to voice. There's two different architectures that we're seeing out there today. So more of a cascading system. And this is where you chain together a variety of models. So text to speech, sorry, speech to text, an LLM and then text to speech. So basically a model for perceiving the world around it, a model for kind of doing the reasoning and logic and then a model for then Voicing that response.
Brooke Hopkins [00:02:50]: What Peter and the multimodal API or the Real Time API provides is a voice to voice interface. So having instead of having to chain together all these models, you have this capability out of the box. Peter, I'm curious like from your perspective where it makes sense to use real time versus a cascading architecture.
Peter Bakkum [00:03:12]: Yeah. So there's, you know, voice is difficult, there's different shapes and sizes of applications and I have to say I'm a little biased because I personally work on speech to speech and so that's the perspective I'm coming from. And to be clear, we have customers that do both architectures and make both architectures work well. The way that I would compare and contrast them is speech to speech is very much native audio in, native audio out. So you don't necessarily get the same kind of modularity that some people care about with chains. So some folks want to run a whole classifier on the input before it goes to the LLM in a voice application. But you do get, with speech to speech, you do get generally lower latency. You get native audio understanding in, you get native audio understanding out.
Peter Bakkum [00:04:18]: Something I hear about from customers is language switching or code switching is actually quite difficult with cascading apps because you're often choosing models specifically for certain language properties. And so, you know, it's a dumb answer, but it really depends on the application. Like I said, I'm very invested in speech to speech. I'll tell you part of why. When I first started working with this technology, I was sitting with my one year old daughter who was saying a few words but like not really sentences and I was just chatting with the model and I was like, okay, I'm sitting with my daughter and she kind of made a sound in the background and the model was like, oh, she sounds happy. Right. And I like, I had a moment because it was like very cool. But it's also I think an example of something that just like doesn't really work with like the change or cascading approach that it really is like native audio in and native audio out.
Brooke Hopkins [00:05:20]: Yeah, yeah. I think that is actually something that I'm really excited about is that I think in self driving you have the, what we saw was this constant kind of moving from one model to many models, kind of this omni model approach. You have all these benefits of you have a lot more signal coming in and so you can make higher fidelity decisions. It's the same reason why talking to someone on a video call is easier than talking to someone over chat. Because you have all of this extra signal around tone of voice. You have kind of who's talking in the background. You can kind of correct for transcription errors by contextual understanding. Was also true with self driving.
Brooke Hopkins [00:05:59]: Where you can kind of correct for was this person. Because I saw where the person was five, five seconds ago. So I think there's so much potential here. But people, I think you're probably. It's not news to you that people are saying that real time is not ready for production. What do you think people are getting wrong there? And where do you think you're seeing production use cases today? And where do you think it's. Yeah, where do you think people are getting that wrong?
Peter Bakkum [00:06:23]: Yeah, well, it's. Speech to speech has been a nascent technology really. We first launched it in October of last year and we've been, you know, we've been working hard on it and you know, I work with it every day and so I'm like intimately aware of where it fails and what's good and bad about it. It's really been a process of launching and then making it great and making it fit customer applications. Since we've launched, we've done a lot of model level work to improve model behavior and then we've done a lot of service level work. I think part of what's difficult about voice is just actually the delivery and management and latency is its own set of problems. I think we started off new with a lot of issues and we've pushed on it and we have a lot of customers in production now and then. I'm especially excited about a model we launched in June.
Peter Bakkum [00:07:41]: We pushed really hard on function calling and instruction following and some of these other behaviors. I think that in terms of production readiness, I think it's actually mostly there. Like I say, I work with it every day and I know all the problems, but we have pretty significant customers that are in production. So for example, perplexity has like a voice experience where you can like load it and talk to it. And so I'm like, I'm like bullish on the future here. And I like, like I said, I'm a big believer in speech to speech. So I think this is where the industry is headed.
Brooke Hopkins [00:08:23]: Yeah, I think that's something that we're seeing from our customers as well, is that the people who are getting a lot of value from real time is when you care a lot about latency and naturalness and conversational quality. So for example coaches or AI tutors, interview assistants, I think that's where we're seeing a really good fit because that extra gain in latency and that extra kind of awareness of what's happening in the conversation is really powerful for creating a better, that like incremental better experience. I think where, and then again I think we've seen like huge gains in the tool call capabilities of the real time API. I think where there's still where cascading you might be like why don't, why doesn't everyone just use cascading model today? Or sorry, why does everyone use real time voice to voice instead of a cascading model? I think where we're seeing, at least with our customers, the fit for a cascading model is when you have really complex instruction following and the latency and naturalness isn't as important, but getting it right every single time is super important. So things like very complex customer service workflows, for example, where do you think that you see OpenAI? Like where do you expect the model to go next and what kind of applications do you think it's going to grow into?
Peter Bakkum [00:09:43]: Yeah, great question. So yeah, there's, there's pretty well documented gap between like pure text models and audio models. On certain benchmarks there's basically kind of more going on in the application or in the model. And so there's a trade off and people kind of strategize around that. So we see a lot of customers do kind of a conversation graph or we call this agent pattern which is like you set up a pathway through a conversation that you expect the user to follow and so you achieve objective A and you move on to state B, that kind of pattern. And then we also see customers mix and match models. One architecture pattern we've seen is what we call thinker responder where you have a really critical function call, let's say you're going to do a refund and it's for money and you need to make sure that you've really checked off every box and we're adhering to our policy and things like that. And so even with chains cascading apps, it's like a pattern that we see to kind of choose really critical moments and then offload to a reasoning model at that point.
Peter Bakkum [00:11:14]: And I think that's actually been pretty effective from what we've seen.
Brooke Hopkins [00:11:19]: Yeah, I think that was something like really interesting that I think I've heard from you is kind of that you don't necessarily have to use the real time model end to end or you can layer additional models on top of it. What are some of the different configurations that you've seen of layering models or using real time models in interesting ways.
Peter Bakkum [00:11:42]: Yeah, that's a great point. So yeah, it's very mixed and matchable. This is speech to speech. But it's also, we see customers with, with all kinds of different patterns. One pattern that we see is I want a different voice, maybe like a different text to speech provider. And so I talked to a big customer who uses real time for basically everything except voice. So turn detection with vad, ASR and then the actual model, they run the model as a text model just because it's the fastest path that they found to do it and then pipe to a different text to speech. We also have a mode that I think actually most people don't know about, which is you can use this purely for turn detection and I think it's actually pretty novel.
Peter Bakkum [00:12:38]: You can basically pipe audio to real time API in transcription mode and it will do turn detection with vad. And then we have a semantic VAD product that sits on top and actually does more classification, more intelligent kind of are you done with your turn logic? And then spits out ASR as well. So it really is pretty mixable, matchable and we see a lot of different patterns.
Brooke Hopkins [00:13:06]: Yeah, I think that's something else that we've seen as well is that people are like kind of mixing and matching these different capabilities. I really like what you've said around actually that you can use this for turn detection because I think turn detection is still, that's one of the biggest problems that we're seeing today and like the hardest unsolved challenges. And I think having this additional signal when deciding whether or not it's your turn is so valuable because like, even when you're reviewing transcripts, I think it's very hard to tell if it was an interruption or if the person like took a long time to respond and whatnot. And so you can imagine for like LLMs and models this is very difficult as well. And so having that extra context and being able to stream that makes a lot of sense to me. Or even being able to use it just for like maybe you use this to transcribe, do turn detection and then generate the response. But then being able to layer on additional LLM calls, like for example, like maybe have a tool call or a reg, like a retrieval or additional context that you feed into, kind of amend that context. I think could be a really powerful architecture as well.
Peter Bakkum [00:14:20]: Yeah, I mean I would say like turn detection is still like one of the hardest parts of this architecture and I think with chained and speech to speech, it's difficult on both. And something that's interesting is it's actually a different problem for different applications. So we see customer support applications, there's a different pattern of speech or kind of like a conversation like this, there's a different pattern of speech than we have some other applications. Like if you're doing language learning or that kind of thing, actually probably your speech is going to slow down and there's going to be big thinking gaps. And as a human you would probably like smoothly understand that the person is not done. But if it's a simple VAD model, that's actually kind of a hard problem. So this is like, I can't say that it's a completely solved problem, but the thing that we're doing new in this space is called semantic bad. And basically it classifies if it thinks that you're done speaking.
Peter Bakkum [00:15:29]: Right. So if you end with an emphasis or a question, that's a pretty good signal that, okay, my turn is done. Or if you trail off with an. It gives a very low probability. And so then it basically tells the model how long to wait after that point. So if there's an. It'll give you space to continue.
Brooke Hopkins [00:15:51]: Yeah, I'm super excited about this because I think from our customers, our interruption detection metrics are some of the most most in demand. Just because I think this trade off between latency and interruptability is probably one of the hardest problems and impacts the prosody or like the naturalness of the conversation so much is if the bot is constantly interrupting you, it makes it. No matter the quality of the output, it makes it very hard to have a natural flowing conversation. Totally. Yeah. I think something else that comes to mind when building or one of the challenges I think in deploying real time applications or voice to voice is that how do you ensure that if you do get it wrong, like as it's coming, like once you basically are streaming out the audio, how do you verify that it's in fact what you mean to stream out? What have you seen people do to ensure guardrails to ensure reliability in that final output?
Peter Bakkum [00:16:49]: Yeah, so the first concern that people usually have is is the model going off the rails. Is the model doing something that. Saying some things that we don't want to say, et cetera. And actually so first of all, I would say we put a lot of time into making our models safe and reasonable and we have our own safety systems, but often people want their own controls. And so with the Chained approach. This is kind of. You can classify the LLM output text and make a decision based on that, but there's a latency trade off there. Right.
Peter Bakkum [00:17:31]: You're running a stage in between LLM generation and text to speech and so there's a cost to that. And what I tell customers in this space is basically to accept that guardrails should be asynchronous and you should not block the conversation while you're running your guardrails. This means they have to be fast enough. But the architecture we advise is you put guardrails on the text output from real time API which is actually generally faster than the audio output and then you run classifier and cut off when you see something that you don't want. This is actually pretty effective and it's actually what we do for our own safety systems. Even just doing something really simple like a regex looking for specific words on the output. It's actually pretty simple and effective. Actually our SDK has some tools in this space as well.
Brooke Hopkins [00:18:33]: Yeah, I think there's going to be really interesting. We've already seen this with voice, which is how do you create UX experiences that account for some of the shortcomings in voice? So for example, like saying ums and ahs to make it sound less robotic or being able to kind of fill the gaps in silence while waiting for a response or like when there's long latency. I think we're going probably. Something that's interesting is how do you create really graceful kind of cutoffs of saying, oh, sorry, I actually didn't mean to say that. Or let me revise what I just said or I just checked and it actually looks like this answer is supposed to be X, Y or Z.
Peter Bakkum [00:19:16]: Totally. Yeah. Another pattern that we see is folks will take the conversation context and run a separate classifier asynchronously in the background. Is this going well? Am I adhering to the system message? Am I adhering to the policies? And then, you know, you come to an answer and then you can actually like prompt the model like, okay, say, say this now. Or like we need to go in this direction. And it's actually like, it's like hard to make it completely smooth, but it is actually pretty effective. And it's like, you know, you can like recompose the conversation kind of in arbitrary patterns of like classification or guidance or you know, you could do a lot with, with, with that model.
Brooke Hopkins [00:20:06]: It's like kicking the model under the table when they start to say something that they're not Totally.
Peter Bakkum [00:20:10]: Yeah, exactly. Yeah.
Brooke Hopkins [00:20:11]: I think also something that we've something that's different in conversational AI that from LLM applications where like, let's say you're, you know, like prompting like you're returning summarization, there's kind of one final output. Whereas with conversational, as you're going along, the conversation is evolving and oftentimes conversational failures are slow. Kind of slow deaths of you get stuck in loops or you keep repeating yourself or you're misunderstanding the user's intent, these types of things. And I think especially in customer service or some of these cases where the user has a goal that they're trying to achieve, they actually will be pretty patient if they're not able to get it on the first or second turn. And so if you can have something that can check this conversation isn't going well, what do we need to do to kind of get it back on track is actually probably one of the more powerful guardrails is like, are you following the workflow? Are you following all the right steps? Did you miss any steps? Did you understand what the user is trying to say? Is there a new context, et cetera? How do you think that people. So if you're moving from a cascading model to real time, what are some of the things that you should look out for in productionizing that voice, that application? Like, how does context engineering change? How does prompt engineering change?
Peter Bakkum [00:21:30]: Yeah, it's a, it's a good question. I think it actually overlaps to a greater degree than people might realize and that a lot of the same strategies apply. Right. So writing a clear system message, breaking up your conversation to kind of like different states. I think, I think these, these strategies are still great. Something that often people don't realize about speech to speech is that you have a lot of control over how the model talks in terms of tone and energy and speed. And this is all pretty steerable. When I'm testing and making sure everything is right, I always ask it to talk like a pirate briefly.
Peter Bakkum [00:22:21]: Right. Like, it's like really pretty amazing. This is one of the advantages of speech to speech in my opinion. It's like really pretty amazing, the steerability in terms of the voice output. And so we see, you know, we see people kind of experimenting and tuning and getting like the right like adjectives to give to the model in terms of what they want to hear.
Brooke Hopkins [00:22:43]: Yeah, I think that's super. I think people don't necessarily prompt to like prompt even their like TTS systems or these like different cases enough like Using tags or like kind of trying to influence how something is said to achieve that more natural response. What are some of the best like prompts that you've that you would recommend for production applications?
Peter Bakkum [00:23:09]: Talk quickly. Talk professionally is one. I mean, it depends on the application. We had an application where Match Group wanted to. It was quite funny. They had an application where they wanted to create a dating scenario. And then you're supposed to flirt with the model and so they prompted it be friendly and flirty and fun and bubbly and stuff like that. Most people don't want that for customer support.
Peter Bakkum [00:23:39]: So it's like you can do guidance. I've seen also people will also kind of prompt for specific kind of backgrounds. So I was working with a customer who was prompting like I want you to be talk like you're from northern India, right? Use that like diction and language and that kind of thing. And so it's really across the spectrum and it's like, I guess that's pretty powerful. But it's hard to define exactly what it will do and not do. And so I just, I advise people just to try things and experiments.
Brooke Hopkins [00:24:15]: Yeah, that's super cool. I don't think I knew that you could actually change the speed at which it talked. I think that's super interesting because I think we hear this from customers all the time. That if you're in healthcare, speaking slowly for an older demographic and making sure that everything is really clear is important. But if you are a note taker or personal assistant, if you speak slowly and repeat everything, it's really, it makes it way less usable where. So I think there's been huge jump, leaps and jumps in terms of your tool call accuracy. Like what kind of led to that and how are you guys able to improve that over time?
Peter Bakkum [00:24:53]: Yeah, I mean the story with our models is often, you know, we, we push them forward and then we try things and people build applications and we hear feedback and then we, you know, push the models harder in the direction that people are asking for. And so with his last release, we spent a lot of time working with specific customers and actually like writing lots and lots of evals about like model behavior and different situations. Like when do you call a function versus not calling what we call relevance? Are the arguments correct? It's able to say something before it calls a function. And I would say it was a lot of very manual experimentation and creating scenarios and that kind of thing. And then also we also just with certain customers, we worked with them and got a better sense of their production. Data. And actually there's incentives with OpenAI where you get basically discounts. If you share data with us, this is off by default.
Peter Bakkum [00:26:18]: And if you explicitly share it, turn it on and share it, we're able to use some of that data to improve the models as well. So a lot of model work is, I suppose, the answer here.
Brooke Hopkins [00:26:34]: Yeah, totally. I think that people really underestimate, I think, how much just having evals and a clear sense of what you're trying to get to can really improve both at a large model, from a large model perspective. But even I think in voice, like for your specific use case, I think this is actually something that's really interesting about voice, is that 80% of the or not 80%, but like a large portion of the use cases are actually very similar for different contexts. So compared to LLM applications, I think there's a lot more homogeneity of what these different applications look like. Like customer service, job interviews, coaches. Like you're trying to book. Booking an appointment, I think is a massive number of applications today. And so I can imagine that your team is probably looking at those.
Brooke Hopkins [00:27:25]: And like, even if you just focus on booking appointments, I think you could have huge gains just by looking at those. And then, and I think also on the evals front for teams specifically, like, how can they. I think people underestimate how. If you actually just define, like, we really want to nail these like 10 scenarios. We want to be able to book an appointment, reschedule an appointment, cancel an appointment, et cetera, and then run that 100 times. You can actually start to see there's this edge case. I didn't account for this edge case. And really, he'll climb on those edge cases.
Peter Bakkum [00:28:01]: Yeah. Something that I like about working on tool calling and instruction following is even with specific evals, model capabilities tend to be super general in that effective tool calling is a very general capability and we see people use it for customer support applications. Yesterday I saw someone use it for a desk toy that had a tentacle arm and he could tell the toy what to do. And I was like, it's the same function calling mechanism is my point. Right. That's pretty amazing.
Brooke Hopkins [00:28:42]: Yeah, definitely. And I think we see this even like a lot of times the failure modes of voice AI applications are those that people just didn't think were going to happen. So it's like you accidentally get stuck in a loop or your customer starts speaking a different language or these types of things. I think that's where observability also is super important. Because you can then start to see what are people actually trying to do in production. You guys probably have like less ability to do that because you don't have access to customer data. But like I think that's where like if people can share those use cases with you of like these are the edge cases that we're seeing. Like a lot of times they're not like crazy hard to solve for it.
Brooke Hopkins [00:29:23]: It's knowing that there it's the unknown unknowns that get you. At least that's what we see with voice AI is the unknowns unknowns get you.
Peter Bakkum [00:29:30]: Yeah, no, it's, it's, it's quite a hard space because I think really what applications often want is a kind of subtle unexpressed behavior. If you had to write down everything you want a customer support agent to do, it's kind of a hard problem. Right. You want them to talk a certain way. You want them to be friendly and professional and probably not talk about your competitors. And in this situation you do X, except for when you do Y. And if you heard something, it really is a pretty complex thing.
Brooke Hopkins [00:30:20]: Yeah, yeah, I think that's very true. And I think the other thing, so this is actually something that I think self driving and we did it at Waymo is instead of trying to say I want you to do exactly this really for voice AI, you have to move to this probabilistic evals or this probabilistic nature where you're trying to say I never want this type of thing to happen or I'm trying to increase the probability that this thing happens. And so trying to say this is exactly how the conversation is going to go is hard because then it makes your agent less autonomous. And, and I think the really cool part of real time is that you can get very high levels of autonomy because you have all of the signal and context. But then the question is how do you really start to define what are the. I think this is really a hard product question, actually not just like an evals question, but what does it mean to be successful in your application? What do you not want to happen? And how can you start to measure probabilities of those things happening?
Peter Bakkum [00:31:19]: Totally, totally. One thought that I give people is just that using reasoning models as evaluators for after the fact voice sessions is actually really effective. And it's one of the most effective tools that we have. And we do this a lot actually because it really is a complex question. Often like review a 20 turn conversation and was it good? Right. It's kind of a Hard question. And then like, was it good in a way that you can be kind of like quantitative about or even use to like, improve the model is also like a pretty difficult question. But actually just like using the smartest models we have is like one of the best strategies for that.
Peter Bakkum [00:32:09]: And it's like, we do this a lot, actually.
Brooke Hopkins [00:32:12]: Yeah, we actually. We have an exciting launch coming up of. We are of something along these lines, but I think this is definitely something that we've thought a lot about is like, how do you not only evaluate this transcript in like, isolation, but actually in the context of all of my conversations, like, how can you gain awareness of what good looks like so that the model has that capability versus you having to say all these things I know about my business and all these things that I've seen from conversations and then encoding that into specific metrics?
Peter Bakkum [00:32:46]: Yeah, totally.
Brooke Hopkins [00:32:48]: Where do you think you're seeing the models go in the next. In the next six to 12 months? And I think there's a lot Sam Almond always talks about, like skating to where the puck will be and kind of assuming that models are going to get cheaper and faster. What do you think is kind of the equivalent in multimodal. Where do you think people. What do you think people should assume? What do you think people should bet on getting better?
Peter Bakkum [00:33:12]: Yeah, I mean, there's a lot I can't talk about here, but I would say that we are pushing. This is a big theme for us and it's an area that we're really investing in. And so there will be new models and new capabilities and lots of kind of improvements.
Brooke Hopkins [00:33:37]: It was.
Peter Bakkum [00:33:37]: Yeah. Yes, it will get better. And like, I'm just like, extremely bullish on voice in general. This is like. Honestly, one of the reasons that I work in this space is that I think that we will reach a point where just we're going to talk to our computers more. You know, as the computers get smarter, we will talk to our computers more. And I think we'll reach a threshold where this happens. And in hindsight it will seem like really obvious that, like, oh, it's like there's tasks where this is a lot nicer than typing.
Peter Bakkum [00:34:13]: Right. So I think that, like, that's my overall thesis and I just think that there's like a lot of applications to be built that are kind of more voice targeted, voice native. Like, it's really just now that we're getting to the point where you can have a pretty natural conversation and really rapidly fire off tools and handle that in a smooth way. And so, honestly, my advice to people would just be to kind of try things in the space and build things and experiment. And there's just like, customer service is one application. Language learning is one that I really like. But I think just from using your computer perspective, I think there's a lot to build in terms of what I call interface control. I'm working with my computer and I want to do X or connect to Y.
Peter Bakkum [00:35:15]: These models actually work really, really well with mcp, maybe better than, I would argue, than text models, because it fits in as tool calling. And there's cases where you would have a voice model call, like an MCP tool that you maybe wouldn't actually type in. Press this button. I'm extremely bullish on the space, and I would just advise people to try things that exploit the new capabilities, the function calling, instruction following, etc.
Brooke Hopkins [00:35:51]: Yeah, totally. I think the reference point I always use is that you are so quick to jump on a call when you're trying to do something complicated with someone, if you're brainstorming or trying to, like, hone in on a really ambiguous idea that, like, actually typing that out takes a lot longer. And I think that reference point actually makes a lot of sense for voice, which is like, I'll. I'll use ChatGPT on my way home to just like, like, regroup on the day or, like, think through, like, what I need to do next, or I think through, like, a new architecture for, like, a new system that we're building. And. And for these really ambiguous tasks, I find it really useful because you can kind of help you, like, give structure, reflect back your thoughts, and, like, start to, like, hone in on that. Yeah. I'm curious, like, what other have you, like, do you use it for anything personally that you're really excited about in voice?
Peter Bakkum [00:36:40]: I've been experimenting with MCPs a lot. This is my new thing, and I'm excited about it. Yeah, like. Like I said, like it. I like this paradigm because it is really intentionally built to match the model capabilities. MCP is a broader framework, but the really core thing, in my opinion, is exposing a list of tools to the model. The model can choose when they want to do it. It fires off an RPC effectively.
Peter Bakkum [00:37:13]: So I wrote a little demo for myself that was like, connecting to notion, actually. And then I would talk to it, and it would write paragraphs of text in notion. And some it's like, you want to do more kind of like dictation. And some you want, like, okay, summarize and then add A paragraph summary at the bottom, that kind of thing. So I'm excited about the space and I'm still experimenting myself.
Brooke Hopkins [00:37:40]: Yeah, yeah, totally. I think that people are thinking about voice AI as replacing phone calls that are already happening today. They're essentially thinking about all the IVR trees they've ever interacted with and how you can replace them. But I think that one, people kind of already are like, especially in very low margin places. People are already very okay with IVR trees in many regards, even if they're not ideal. And so how can you start to move into places where maybe there is. Where that's a bad fit or places where like, software was not able to get. I think something I'm really excited about with voice is that for a lot of legacy industries where you aren't able to infiltrate the software because it's too disjoint or there's too many, too many players to all switch onto an API format, you can now have this.
Brooke Hopkins [00:38:27]: One side is automated while the other side isn't. Or maybe both sides are automated, but they never have to do a technical like API integration. So I think we're seeing this a lot with, for example, like trucking and logistics or healthcare, where you have tons and tons of providers in health care or lots of truck drivers getting like booking freight. And so you're able to automate only one side of that and then really scale that.
Peter Bakkum [00:38:55]: Yeah, I think that like voice phone calls are kind of probably here to stay just because it's a very convenient shape and it's what a lot of, you know, businesses have already. And there will be a spectrum of like how automated it is, like classic ivr, like the phone tree stuff, you know, I think, I think has always been frustrating in certain dimensions and annoying. And I think that it's like, it's at the point now where it's like you can, you can really do a lot better than that. And I think it's reaching the point where you can do a lot, a lot more. But then like, if you still have the customer support line, this is like a phone channel to your backend systems. I think that's a nice shape and we'll probably stay there, but we'll shift into a more automated shape. Honestly, it's a motivator for me working on this technology is that I'm terrified of calling into customer support and then having to. If my own service is bad, I will be subject to it myself.
Brooke Hopkins [00:40:03]: Yeah, totally. And I think the thing I'm most excited about is what does A what does the waymo like experience for voice look like where I think self driving was? So people said it's like not going to work or it's going to be very robotic or it will never like hit production. And I think self driving today is simultaneously very autonomous but also incredibly reliable and scalable and in production. How can we do the same for voice agents where it's not just you're going through this rotary, but actually how do you create these really autonomous experiences where it can actually decide? Actually I'll make an exception for you because there was a crazy weather event for this flight and so we can override this policy or actually I'm going to do this pretty complex workflow and automation on our side. Even though it requires some reasoning capabilities.
Peter Bakkum [00:40:56]: It's hard, but we're going to get there.
Brooke Hopkins [00:40:58]: Cool. Well, this was awesome. What would you like from the audience? Why are you most eager to hear about in terms of use cases for those building voice AI applications?
Peter Bakkum [00:41:12]: Yeah, like I said, I love it when people try things and do novel voice driven things that haven't been reimagined. So an example, someone sent me an app a couple months ago which was like flashcards, but it's totally voice driven and like the stuff that comes up on the screen is basically, you know, it's prompting you, you're saying something, you know, did you get it correct? It's green. Right. And so like, you know, this is, this is something where it's like if you want it to be voice native, you really have to kind of reimagine it. But it's kind of a very obvious, very obviously useful shape. And so I'm really eager for people to try this. And if you do this, please send it to me. I'm excited to see.
Brooke Hopkins [00:41:58]: Amazing. Where can people find you? Twitter.
Peter Bakkum [00:42:01]: Yep, I'm on Twitter. I'm one of my own. Twitter. P B B A K K U M on Twitter and send me a message.
Brooke Hopkins [00:42:09]: Yeah, honestly, I'm going to copy that. I'm also curious to hear people's voice AI applications and interesting things that they're doing. So I'm also on Twitter if you find me and I think we're both hiring, so always, always looking for a good and great engineer.
Adam Becker [00:42:22]: Brooke, where are you on Twitter? How do we.
Brooke Hopkins [00:42:26]: Hopkins. BN Hopkins.
Adam Becker [00:42:29]: BN Hopkins. Okay, I'll put both of your guys's handles in the chat. Folks, this has been absolutely incredible to listen to. Should we open it up to questions from the audience if anybody has any? So I kind of want to wait. Brooke, can you say something? I think I lost you for a moment.
Brooke Hopkins [00:42:53]: Sorry. Yeah.
Adam Becker [00:42:55]: Nice. I kind of want to set some context for people who might be listening in and trying to bridge their perspective from what they're used to, which is just traditional, mostly kind of like text based models to these multimodal models. So I think a lot of people when they first start out, they might want to integrate some type of voice component, but the first thing that they would do is just take whatever the voice is and then transcribe it and then go back to text and then just proceed as normal. And perhaps they might come back and maybe even do like a text to speech or maybe they just remain in the text domain. How should they think about making a transition from transcribing to text and then operating in text versus going kind of like end to end from speech to speech? What are the constraints? Help them navigate through the options there.
Brooke Hopkins [00:43:53]: Yeah. Do you want to take that, Peter?
Peter Bakkum [00:43:56]: Yeah, sure. I think that there's a paradigm shift into kind of a more streaming world. So you have to write your code a little bit differently. You have to assume that, okay, I'm not just doing a turn and then I do another synchronous call, text to speech and I get the audio and I play it. It's like, okay, I'm going to be always, I have the mic always on, I have the audio streaming back, et cetera. There's actually, I don't mean to overemphasize how difficult it is because there's actually pretty robust tools for this now. Actually something that I really like about voice space right now is that the in browser tools for audio and for WebRTC are actually very strong and are often better than the outside browser tools. And so plugging in mic and audio and those things is actually pretty robust and good.
Peter Bakkum [00:45:05]: So I think there's a paradigm shift to streaming. And then I think there's probably. You probably edit your prompts to be aware of audio. So you say ok, speak quickly or speak this way. That kind of thing. Yeah, those are two things that come to mind.
Brooke Hopkins [00:45:24]: I also, I think that there's like something like pipecat or Frameworks where it helps actually run these voice applications. That's why it's super useful, because I think it's not impossible. WebRTC is definitely annoying. And so I think having these, I think orchestration layers actually make a lot of sense in voice AI compared to maybe like OM applications. It might be easier to roll your own. But I think orchestration layers and Voice AI allow you to focus, focus on what is the actual meat of what I'm trying to say. I think also being able to then swap in, as we discussed earlier, you can use maybe real time just for the speech to text and turn detection and maybe even LLMs, but then chain together another TTS model if you want to sub in your own voice. Being able to do these types of things.
Peter Bakkum [00:46:16]: Yeah.
Adam Becker [00:46:16]: The point about having to recraft your prompts is interesting because first of all, I love talking to ChatGPT. It might be the primary way in which I engage with it. I must spend over an hour a day just talking to it. And I realized that sometimes, much more than, especially on long drives, driving from LA to sf, I end up watching. Oh my, two and a half hours. I'm just talking to the thing.
Brooke Hopkins [00:46:42]: I bet the OpenAI finance team loves that.
Adam Becker [00:46:48]: They always tell me, okay, you're about to hit your daily limit. So it's one thing that I've noticed is. And it takes me a while to actually get to the question right. So, like, a lot of my time is just spent trying to figure out what I want to ask it in the first place. And there isn't that type of editing that takes place normally with text prompts, because I write it out and then I go back and really try to craft something powerful. I imagine that when you build an application that takes this into account, you have to then recraft the prompt to accommodate for that type of, I don't know, like, relatively low signal to noise. Is that how you. Is that one of the things that you guys are seeing?
Brooke Hopkins [00:47:36]: Yeah, go ahead, go ahead. I think something that's interesting with prompt. Excuse me. Something that's interesting with prompt engineering is there's kind of this. Write once and then use many times prompt engineering, that's more like. I think actually Lenny had this on his recent podcast around prompt engineering with like, how do you actually, like, write a really robust prompt that you're going to use over and over in a product? But then there's this more conversational prompting where you say one thing and then you say, actually, can you also add this? Actually, can you also add this? And I think in conversational, with voice, we're going to start to see people get used to similar user patterns where maybe they start with like, you know, they. They start with an initial prompt and then they add on. Like, something I like to do is like, first explore one thought, then explore another thought, then explore another thought, and then like, have it weave it all together.
Brooke Hopkins [00:48:28]: Together at the end. Yeah. I'm curious, Peter, like what your take is, but that's something that I think will happen, is that people will have a different voice conversational experience with pattern.
Peter Bakkum [00:48:40]: A common pitfall that I see is that people will build applications with voice, they'll write their prompt and it'll be very okay in this situation. Say this. If you hear this, say that. And I often tell people to be a little bit more generalized or maybe goal based is a way of describing it. Let's say you're doing customer support and really what you want the voice to do is understand the problem. And let's say it's a refund. You need to maybe check three boxes, you need to get their account number and their address and their email. And the way to write the prompt in that case is like, okay, here's the things you need to achieve the goal rather than here are the things you need, here's what to say in this situation.
Peter Bakkum [00:49:33]: And so I think there's an art to crafting it. And I think also just like what everyone does is try things and experiment. And that is still true.
Brooke Hopkins [00:49:43]: I think that's actually like what you said around telling someone what the goal is versus how to do it. I mean, I guess management 101, but I think it's true with prompts as well, is like you can get much higher levels of autonomy by sharing more about what you're trying to do, the context. And there's kind of this trade off of autonomy versus, like constraints, step by step. So maybe you're better like if you say exactly how everything should go, you're going to get higher levels of like, it does that one thing really well. But I think trading that off with how can you give it broader context so that it can autonomously handle new situations that it's never seen before is an interesting thing that we're seeing.
Adam Becker [00:50:24]: Yes, a couple of questions from, from the chat here, Marco, could you tell us more about an AI voice assistance architecture.
Brooke Hopkins [00:50:39]: In terms of, I think like cascadings are probably the more. I would guess that's what he's referring to is probably the more interesting ones. I think the beauty of real time is that you can kind of plug in voice and have voice to voice, whereas this more traditional architecture is what we were talking about earlier with cascading models. So you have speech to text that does transcription and then usually have voice activity detection which is going to say, are you done talking? Should I start to come up with a response? And that's what Peter mentioned around the new models that they have around semantic bad. Then you have an LLM that decides what to say next, comes up with the reasoning, et cetera, and then text to speech that then speaks that out. So you can imagine that that causes. You have pretty low latency budget per model in those cases because you have. You're chaining together lots of models and then voice to voice.
Brooke Hopkins [00:51:35]: Peter, you can do a summary of architectures for voice to voice.
Peter Bakkum [00:51:39]: Yeah, I mean, voiceover is a little bit more end to end for the voice layer. If you're doing a voice assistant, I would advise you to try it simple first try it in browser or on the command line and prompt it roughly how you want it to help and have some examples then. I think the other challenge that we solved with voice assistants is just what can it do? How do you connect it to things? How do you hook it up on the application layer? That's the domain where I was talking about. MCP has proved a really effective strategy with voice in that there's growing tools and APIs that are very pluggable and usable in this context.
Adam Becker [00:52:32]: We got another one coming from Guillermo in production environments. Is the evaluation of an LLMs response, response consistency across voice and image inputs handled the same way or do these modalities require distinct evaluation techniques?
Brooke Hopkins [00:52:52]: I think image and voice are going to be very different in terms of how you're evaluating them because you, I mean, you can use multimodal as a judge, actually. So that's something that even for audio, like using multimodal judge for saying, like, was this good audio quality? Were there any vocal hallucinations? Was the agent pausing at the right times? I imagine you could totally do this for image as well. We don't do any image evaluation, so. Yeah, I don't know, Peter, if you've.
Peter Bakkum [00:53:30]: Seen image evals, I haven't worked with that much specifically, so I don't have a lot of thoughts on image evals. But multimodal evals are hard in the sense that there's an extra layer of complexity. We see a lot of people who write evals with audio. They'll take a text eval and then render that to speech with the text to speech model and then feed it to speech to speech model and then get the audio output and take that and then. Yeah, like I mentioned before, we also see a lot of people take the transcript from a conversation and then like shove that into a reasoning model and evaluate that way.
Brooke Hopkins [00:54:13]: Yeah, yeah. I would also say that there's more and more conversational applications that have, like, buttons or images, especially like AI tutors where they're like, showing images, showing diagrams. So this is something that I think we've kind of is on our radar of. Like, you might want to be doing multimodal evals there.
Adam Becker [00:54:33]: Awesome. Peter, if you can drop your Twitter handle here as well, I'll put both of those in the chat. Awesome. Thank you very much. I have one just lingering question. Something that's speaking to the AI solutions often has been getting me to think about the tone with which it is answering me, and I feel like that tone has been evolving and the intonation and the. And every time I try to ask it not to do it. Like, can you actually.
Adam Becker [00:55:05]: Can you change the tone? Can you sound a little bit more like this as opposed to sound a little bit more like that? I don't know if it's actually picking it up and modifying, to be honest. And I wonder whether getting the tone right for different customers and for different users is something that. Has anybody complained about to either one of you and. Or how do you think about that? Because I imagine the tone with which you speak to different people is different. How do you. How do you. How do you cohere?
Peter Bakkum [00:55:36]: Yeah, I have, yeah, a bunch of thoughts on this. Definitely something we hear a lot about. There's actually an interesting distinction. The ChatGPT product is intentionally more conversational and API is a more general shape. And so we see it as important to support a really wide range of tones and behaviors and moods and things like that. Actually, between those models, you will see some difference. And our goal is that that is like with API model, you start at kind of a neutral and then by default, and then you can prompt it in the direction you want.
Brooke Hopkins [00:56:28]: Awesome. Awesome. Thank you so much, Peter, for joining. This was really fun to talk through all of everything around real Time. I think I'm super excited about the future Real time.
Peter Bakkum [00:56:40]: Yeah. Oh, thanks for including me and I enjoyed chatting about all this stuff.
Adam Becker [00:56:46]: Awesome, folks. Thank you very much for coming. The chat has your Twitters as well, so please keep ideas and use cases flowing to Brook and to Peter as they come through. Brooke, Peter, thank you very much, folks. It's great having you.
Brooke Hopkins [00:57:05]: Thank you so much, guys.
