The Challenge with Voice Agents

Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.


At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Demetrios, Paul, and Floris explore the latest in Voice AI agents. They discuss real-time voice interactions, OpenAI's real-time Voice API, and real-world deployment challenges. Paul shares insights from iFood’s voice AI tests in Brazil, while Floris highlights technical hurdles like turn detection and language processing. The episode covers broader applications in healthcare and customer service, emphasizing continuous learning and open-source innovation in Voice AI.
Demetrios [00:00:00]: What's up, everybody? We are back for this limited edition series that we're doing with the process team talking about all of their work they have been doing on AI agents. Today, it's Voice. And we go deep into the experiences that they've had building out Voice AI agents. I talk with Paul a little bit at the beginning about the landscape, about what he's seeing, what I've been seeing. And then we go into the tactical stuff with Flores to talk through how he, he has been using OpenAI Real Time API and specifically what the learnings have been. Because Voice is a whole different beast, as we will find out. Let's get into this conversation. All right, we're back with episode two.
Demetrios [00:00:52]: We made it this far and we're coming hot out the gate with Voice AI Agents because we talked yesterday and I want to give a bit of background. The way that we were going to do episode two was say, hey, let's talk about all the frameworks you guys used while building out some of these agents, what you liked, what you didn't like. And then we said, you know, frameworks have been around for the last two years. There's a lot of content on the Internet about frameworks. You have some people that love them, some people that hate different ones. And what we don't have a lot of content around or experienced stories around are people building voice AI agents.
Paul van der Boor [00:01:37]: Yeah. So what we saw already for a long time is that, you know, the ability to generate synthetic content with voice, so you basically take text and you have voice allows a new way of interfacing or interacting with AI systems. So it's a important next step for us to be able to open up, in particular in a B2C world, agents to consumers. And for a long time we're experimenting, doing voice cloning, working with Avatars, Synthesia, 11 Labs and many others out there just to kind of see how this could look and feel. And typically what was kind of the limiting factor is that this was not possible real time because you could generate some voice. There's a lot of sort of voice libraries that you could use, standard voices that you could generate, give them text and they would generate audio, text to speech, essentially. And they were great. They were becoming very good, but they were still.
Paul van der Boor [00:02:43]: It was an offline interaction. You have to take the text, generate the voice, wait a little bit and then go back. And that wasn't really real time.
Demetrios [00:02:49]: Yeah.
Paul van der Boor [00:02:50]: And about a year ago or so, there was a moment where this started to become pause. I remember very vividly a Demo we had with a team from Grok with a cube. And of course their whole aim is to kind of accelerate.
Demetrios [00:03:06]: Yeah, they're fast, they're very fast.
Paul van der Boor [00:03:07]: That's what they optimize their inference. And they were using that to enable real time voice interactions. And then of course we saw the real time Voice API from OpenAI.
Demetrios [00:03:18]: Well, we should mention with Grok, there's almost two ways ways of doing this these days. It's clear that there's this large push towards speech to speech models. So the whole model can do everything that you need done and the model takes input as speech and the model gives output as speech. But before that became the trend is what you're talking about with Grok, because Grok's so fast on the inference side that you could set up these pipelines and go input voice, transcribe that, send it to a large language model that is on Grok and it can do things really fast. And I think someone told me where it was like the token streaming time is like 300 tokens per second or something. So it's insanely fast. And then you output that with some text to speech model. So there's that pipeline, three steps, you.
Paul van der Boor [00:04:17]: Had three steps, you had to basically switch the bridge, the modalities within different models and that changed. And then you had the OpenAI real time voice API and others that we were doing early testing with and so on. And all of a sudden you could have agents that you could talk to that you could stream voice in and they would stream voice out. And we did some tests also with Ifood, one of the largest food delivery companies in the world that operates out of Brazil, seeing if we could use that for example, for riders as they were out there and they were delivering. And we trying to figure out. And by the way, just take this for a second, one thing is to have a demo where you and I can do a demo with one of these voice agents. Now take that same model, put it into the context of a food delivery partner who is out on the streets in Sao Paulo, who speaks Portuguese with a Brazilian accent, of course, maybe even a colloquial accent with a lot of traffic noise in the background, is in a hurry trying to get figure out which restaurant do they need to go to to pick up the order, or what direction, or the road is closed or whatever. And they then expect the voice model to be able to operate.
Paul van der Boor [00:05:33]: So that was a great test because it sort of takes something that works in a lab to being able to test it, you know, in Real life that's complicated for so many reasons.
Demetrios [00:05:42]: Yeah. And this was them calling the restaurant to say, hey, where's the order?
Paul van der Boor [00:05:46]: They were calling ifood to basically ask the question. So normally they would otherwise have to stop the vehicle, take out their phone, text, and so to not lose time and to do that safely. Basically, the ability to interact real time through voice was one of the use cases that we thought was a great way to kind of really stretch the limits of the technology while also figuring out if it would work for a real use case. And of course, there are lots of things you want to test for us. It was one technical feasibility. Can we actually get a model into this app in a way to kind of test this, but also desirability. Right. Does it writer find this good enough? Is it useful? Does it help ifood answer questions faster, safer, and so on? And then you enter a whole new world of complexity.
Paul van der Boor [00:06:34]: Right. Because, well, what formats is this audio being streamed in? You're actually streaming, not doing batch processing. You still need to retrieve real data that you have normally in your agent tool. Calling sort of workflows from the ifood.
Demetrios [00:06:46]: And that memory aspect that we talked about last episode is so important. How do you keep different things in context without losing what is important and the thread of the call? So with the one thing that I am not clear about when it comes to voice agents is on the memory aspect, when you're doing the pipelines, you're just shoving new things into the prompt or you're shoving the whole conversation into the context window. How does that work?
Paul van der Boor [00:07:19]: Yeah, so a variety of things. Of course, you have a system prompt, you need to stream the content into the depends on the model we're using. In this case, we were putting everything in the context window. That of course has its own limitation. We tested that also on a bunch of other use cases internally where we wanted it to be an expert on a certain topic. We took an internal education course and people could call. For example, we created an audio interface for tocan, our internal assistant. They could call token and ask about this course.
Paul van der Boor [00:07:49]: And you could see that there's a lot of these behaviors where if you just use a simple text to text interaction, an LLM, then you know how the context and the stuff you put in the context will perform. Generally, at the moment it generates for voice, it didn't behave the same way. Of course we're testing different languages. So you start with Portuguese and the moment you mention English term, it starts responding and in English and it doesn't go back to Portuguese anymore. Even though that's in the system prompt, it's much more prone to hallucination. So this thing that we kind of solved in LLMs in the voice model world, boom back like it was just making up stuff.
Demetrios [00:08:29]: One thing that I wanted to talk about was a bit of the landscape in this whole voice AI space. You've got different players that are doing different things and you, you've got folks that are, are focusing on different ends of this spectrum. Like in 11 labs, that is doing the speech or text to speech aspect. And so if you have the text, you can turn it into speech. And then there's like the Deepgram which is doing the text to speech. Yeah, I'm getting all confused with which ways we're coming in, but voice to text is one aspect. And they also started just going from the text to speech and creating that whole voice API piece because I think everyone knows that's where the future is. And then you've got other ones.
Demetrios [00:09:22]: Like Cartesia is doing a speech to speech model. Obviously OpenAI is doing a speech to speech model and doing the real time. Are there other ones that you can think of in this space that have grabbed some attention? Because this is like the foundation of what you need and then you build apps on top of it. And when you think about what you need for this, you are looking at a way to get the voice to the model.
Paul van der Boor [00:09:52]: Yeah.
Demetrios [00:09:53]: So whether that is on a zoom call or via phone, you need to get that information to the model. The model then needs to do something with it. It's either the speech to speech model and it spits back out speech, or you do that pipeline that we were talking about. And oh, the other one that I wanted to mention, just the little tangent is Hume. Hume is doing emotion detections and all that kind of stuff and they're creating that toolkit. But what else is there?
Paul van der Boor [00:10:20]: Yeah, I think you touched on a lot of them. I think there's one more which we typically also look at is the ability to clone voices. Right. So if you can. Another great example where you used to need a lot of good, rich, high quality audio content to be able to clone someone's voice. Now you can do that with tens of seconds, you can create a pretty high fidelity voice clone. So let's say the elements you mentioned, so the ability to transcribe text to speech do that in high fidelity. But also that in itself requires for example, language detection.
Paul van der Boor [00:11:00]: Right. So some, like OpenAI's whisper does that. Well, you may depending if you're Doing transcription for meetings or conversation with customers, you need speaker detection. So there's actually whenever we've put these things into systems agents products, it's typically a bunch of things that need to come together. So like I mentioned, speaker detection, language detection, the actual transcription, we do benchmarking against various languages because again, PROS is a global group. We've got lots of different languages. We hardly work only in English. So if you do this in Hindi or any other Hindi language, or in South America with its languages, Eastern Europe, we typically try and measure how well does this model perform.
Paul van der Boor [00:11:44]: Just for that task transcription, word error rates in that language, you've got translation, the ability to actually move across languages. If you have a use case, for example, you want to create educational content, you've got a lot of education companies in the portfolio. Udemy, skillsoft, Stack Overflow, Brainly and many others, or even others that do corporate learning and so on. They typically want to translate dub and so on. Well, you need to be able to do that with high quality. It can be done offline. So you don't need to do that. Like you and I are in a conversation now.
Demetrios [00:12:20]: Yeah.
Paul van der Boor [00:12:21]: Then the other kind of environment is like the one we mentioned in a B2C context, like iFood or a marketplace, you want to be able to interact with a consumer or a driver or restaurant real time. And there the ability to use this much richer medium than just characters in a chat message. Because like you said, you can detect speakers, emotions, intonation, accent, all those different things. And using that to better generate answers is also something that offers again, a whole. You need to actually know how to do that. You can't just say, well, I'm getting voice and I'm giving the right answer. Well, you think, well, maybe is this person annoyed? Are they?
Demetrios [00:13:04]: There's a lot of different ways of saying, yeah, exactly one word.
Paul van der Boor [00:13:07]: Yeah. And if in a learning context, are they still engaged? Right. And so, you know, language learning, how do you help them with their accent? So all those different signals you can now get out of this much richer medium. Utilizing them is a whole new problem space as well.
Demetrios [00:13:22]: Well, there is two things that on these buckets that you're talking about, it's almost like the first one is not so agentic and the second one is very agentic. Because the first one, if we're just dubbing my voice, which actually I have a funny story about, I was on a Brazilian podcast and for the first half of the Brazilian podcast, I stumbled through with my broken Portuguese and it was all about Data. And I was talking about MLOps. So you can imagine how fun that was for me to be sweating talking about mlops in Portuguese. And then for the second half, I think they just got sick of hearing me stumble in my Portuguese. They said, okay, now talk in English. And they would ask me the questions in Portuguese, I would say them in English. And later after the fact, they went and they dubbed it in Brazilian Portuguese and the agentic part.
Demetrios [00:14:11]: And really where you start to see the voice AI medium come in is when you can go back and forth and get information or get the agent to do something.
Paul van der Boor [00:14:22]: Yeah, and that's, I think, where this is sort of really at the frontier now. Right. So being able to use function calling with these voice models in real time, we found we're struggling with that. It's going to be required to get rid of these hallucination issues. The system prompt and instruction following is depending on the use case also not quite there yet, but we're getting there. I mean, it's like all these things, right? This is not a point in time, it's a vector. So we see the improvements over the last 12 months on this front and once we get those things right, in particular the function calling, that will make it a lot easier to then also use the agentic workflows, you know, be able to access tools and so on to get higher fidelity conversations.
Demetrios [00:15:13]: Let's talk about some of the use cases that we've seen because I feel like there are some really cool ways folks are building apps on top of this base that we just mapped out.
Paul van der Boor [00:15:25]: Yeah, a ton. So we're focused on the E commerce space. So typically with a B2C, in a B2C world, there are lots of customer touch points that you can, you know, you can think about now changing because of voice. Our geographies, you know, we sit here in Amsterdam, but the people we work with, the consumers of our platforms, they're in Brazil again, like I mentioned, India, Eastern Europe. They're very, you know, they're very different in the way they use technology. There's much higher usage of voice in general. If you think about the percentage of, for example, voice voice messages on WhatsApp in Brazil is significantly higher than here. It would be in the Netherlands.
Paul van der Boor [00:16:02]: So voice is also in many ways a much more natural, let's say, capability in some of those markets, especially in the B2C application. As we work with our restaurant partners, with real estate partners that are listing stuff on our platforms, being able to confirm certain things with them and check that the menu is fine and using that as sort of more seamless touch points. And then the category I still think is heavily unexplored is as we give users the ability to interact with us with our platforms through voice. How do we use that as a way to help them better detecting the intent that they have? Because they speak to us and say, hey, I'm looking for, let's say, something simple and light tonight. Right? Then you say, well, this is. Paul is using this account. He's ordering such and such thing. He's maybe agitated or whatever.
Paul van der Boor [00:17:02]: So get him the fast. Give him a fast order with discount.
Demetrios [00:17:05]: Right.
Paul van der Boor [00:17:06]: But it's a much more richer medium. And then all these things apply not just to the food ordering space, but also, as you may have seen, we know we recently announced that we'll be partnering with this Big R, which is a travel platform in South America. They will join the group. And so there, there's going to be a ton of additional travel use cases. So I think this is a fairly unexplored space. Lots of ideas and many things to test and try and build.
Demetrios [00:17:37]: Still, some use cases that I've seen that have been quite novel to me have been in the, funny enough, very regulated space of healthcare, health tech and helping doctors or psychologists or whoever it may be work faster and spend less time on admin than they would need to if they. Because we don't know this because we're not doctors. Right. Or at least as far as I know, you're not a doctor.
Paul van der Boor [00:18:07]: I'm a doctor in the wrong kind of engineering, so not very useful.
Demetrios [00:18:11]: All right, doctor, I didn't know that. That's cool now. So the thing about it is that for the most part, doctors can spend. Let's just take the doctor example. Doctors can spend a lot of their time on admin work, whether that is for insurance stuff or just talking about what they think the patient has or where they think there's maybe pain points, what prescription they're giving and why, et cetera, et cetera. So that is a less agentic use case because maybe you can create a solution that will give someone or you. The doctor brings in their phone and records the conversation and then it helps them fill out that form after. Or you can have it where an agent actually just asks the doctor after, says, what do you think about this? And the doctor talks to the agent like they would talk to their nurse or whoever.
Paul van der Boor [00:19:08]: Yeah, well, there's one company in the group, it's called Core T AI and as I mentioned previously, we invest in Lots of companies, many of them AI companies. Cordi AI is one of them in the healthcare space. And what these guys have done is they were in the space already for some time. But of course this is a good example where super smart team, very capable AI native, they're building on all of these waves of or new capabilities that come out, including voice and their proposition. One of the important ones is helping healthcare workers, doctors, but also others that work in emergency departments, picking up phones and so on to better assist whoever's on the other side of the call. And they do that by basically giving them an interface that during the call their models listen in and help the healthcare worker ask the right questions. You start to code and say this is likely an emergency of this and this type. You may want to ask such and such questions.
Paul van der Boor [00:20:16]: So they help them basically triage and navigate down the tree of possible options.
Demetrios [00:20:20]: Next best action type of thing.
Paul van der Boor [00:20:22]: Exactly, but in a very assistive way to help the folks as they go through the calls. And a lot of that is live in real time. So they've had to build a lot of their own technology. Of course in the healthcare space you can't afford to make mistakes, so you need to make sure these models are accurate. So they have a fairly sophisticated setup to evaluate the accuracy of the models, but also had to do a lot of, let's say their own training, deep model development to be able to extract healthcare specific terms. And that in fact is helping thousands of folks around the world as they are putting this into in front of the healthcare workers on a daily basis.
Demetrios [00:21:08]: Yeah, and that one's fascinating because it's not necessarily that full pipeline that we talked about. It is voice in and then agents helping, but you don't have voice out. So it still is agentic workflows, it still is using voice AI, but it is not necessarily everything we talked about. And I like that you give that because it's a bit more creative thinking.
Paul van der Boor [00:21:28]: Yeah, that's. You actually bring up another topic which also again isn't specific to voice, but especially as we think about more sophisticated agent, is the ability to proactively reach out to you at the right time. Right now all of these systems, whether or voice or not, let's say these agentix systems, they're basically dependent on you coming in with a question or task and if you're lucky then maybe the agent will ask you one or two clarifying questions before it starts execute the task. Because most of the time they're just greedy executioners. They just go and they execute as.
Demetrios [00:22:03]: We learned in the last episode, it can't say no.
Paul van der Boor [00:22:06]: It can't say no. When you are, let's say in a process, you're trying to do a task, in this case, fill your taxes, which may take you hours. You want to have the ability for a assistive system to come in at the right moment in time and say, hey, shouldn't you think about this? Or haven't you forgotten about that? And that's almost like another prediction. Like should it intervene? Should it proactively reach out to you? Hey, haven't you forgotten to also take this form or whatever? And of course, voice is just one way to kind of connect with you. But it's not the only one. Right.
Demetrios [00:22:45]: One that I have not seen in the voice space that I know we've made predictions around that there's going to be agents in the E commerce area. But I want my personal agent to go and do things for me. And especially in the voice realm where I'm stuck on the phone with Vodafone, for example. And it is so painful because they're passing me around to different agents, actual human agents. But each human agent doesn't know what I'm calling for. And I would like to be equipped to send my agent out to do my bidding for me and figure out my phone bill. Why did I get charged extra? I can tell the agent, go figure. Go call Vodafone, figure out why I got charged extra.
Demetrios [00:23:29]: And so it's almost like this personal agent that can sit on the phone and interact with all these companies. Especially because the majority of the companies that you are faced with on a day to day basis, they have some really bad voice. It's not agents, it's just robotic calling. And you have to interact with that. And so it would be really nice if a voice agent could do that for me.
Paul van der Boor [00:23:56]: I completely agree. It's kind of an asymmetric reality that typically the call centers you're trying to call the big companies have their agents first. It's like ivr and then becomes something a little bit more sophisticated while we are still stuck with our own human time.
Demetrios [00:24:12]: I know, exactly.
Paul van der Boor [00:24:12]: So we should have our own calling agent. Yeah. And by the way, I don't think that's that hard to implement. No, exactly. To build or test. You know, we'll have Flora Sanda talk about some of the tests that we've done where we've used this interface to kind of do and test things that basically you send out an agent that you interact with through voice to do Things for you in an e commerce space. Go order me some food, go buy me this, go find me this. So where I think you'll find 2025 to be a good year for you and hopefully send these agents out on behalf of yourself.
Demetrios [00:24:50]: Yeah, well, speaking of florist, let's bring him in now and let's have a conversation on what he has found while creating different voice agents. I know he's played extensively with the Real Time API. I also know he's had some insights because again, this is like, let's pull back the curtains. Let's see, you've banged your head against the wall so many times. How can we help other developers not do that same thing? Welcome back, dude, talk to me about voice and the differences that you've experienced between voice and working with voice. AI agents versus text.
Floris Fok [00:25:27]: Okay. Yeah. So it was a journey. You know, it's because I can remember the first time I heard about the audio agents coming live, I was like, oh, but we can just copy paste the prompt we have for a certain agent.
Demetrios [00:25:42]: It's easy.
Floris Fok [00:25:42]: We put it in the realtime API and it's still works. Yeah, that was kind of the dream I had and it wasn't true.
Demetrios [00:25:49]: It didn't happen, huh?
Floris Fok [00:25:50]: No, it didn't happen. And yeah, so there were, there were a few things. I think the first one was of course, that if you type something and I type something, it's exactly the same.
Demetrios [00:26:05]: Yeah.
Floris Fok [00:26:06]: But the moment I'm, I'm, I'm, I'm talking that or I'm speaking that sentence out loud and you're speaking that sentence out loud. They are, they're quite different.
Demetrios [00:26:16]: That's true. Well, just in the pauses or in the intonations, the way that I pronounce words, there's so many variables there.
Floris Fok [00:26:25]: Yeah, yeah. And they add up. Like it's a, it's an error that can propagate like if the beginning of the sentence something was unclear and then there's not another word that is unclear. And that whole paragraph that you're just saying to the agents is just becoming super vague to it. Which is, which is just a dimension that is really hard to, to tackle right off the start.
Demetrios [00:26:48]: Well, it's funny that you mentioned there's different dimensions that you have to be thinking about because I know you, you said to me at some point, you have so many different events that fire when you're working with the agents. Voice agents specifically. Right. It's not just, oh, hey, there's some text and now we interpret that Text. There's so many different pieces, so maybe you can map out some of those pieces.
Floris Fok [00:27:14]: When we have text, you know, we send the messages, and the messages are really structured. You know, we have. Or, yeah, we have the tools and the messages. And it's like the user says this, the assistant says this, but if we move to real time, you know, this all becomes real time. So it's asynchronous. You know, you're connected over a websocket and you're not having like, hey, these are the messages. Now give me one response. You know, you can constantly are.
Floris Fok [00:27:41]: You're. You're getting responses all the time, and you can send responses all the time. So knowing exactly how that is processed and saved in the. In the session, as they call it, of the real time is really important because then you also know how you can manipulate this. So a funny thought experiment is like, if I would now continue to talk and you would interrupt me halfway.
Demetrios [00:28:04]: Yeah.
Floris Fok [00:28:04]: I would still remember what I wanted to say to you. You know, if you have the old system where it was text to speech and speech to text, and then to LLM, the LLM would have the memory that it would taught you that part. Real Time API changed this because they. They have the interruption event. Not sure that's not the right word, but they have this event saying, like, the speaker interrupted you, and they would trim the output from the LLM perspective as well. So the LLM knew where it got interrupted and knows that you don't know that part. So if you would ask something, it would not say like, oh, I told you that last time. But actually it was in the trim part.
Floris Fok [00:28:48]: So that's a whole new way of working with that session. And they have a few nice visuals on the. The API reference page. So for the viewers, I would recommend watching that instead of me trying to visualize it with voice.
Demetrios [00:29:05]: You can say things a certain way, you can say them in certain accents. So you have all of these different vectors that you pull on, but you also have the latency requirements that you have in any tech, and you also have the piece around. Like, is what is being said actually relevant to the question I asked?
Floris Fok [00:29:31]: Yeah, yeah. And like, one of these parameters within the real time, which is quite difficult, is the turn detection. So if, if I'm talking now, you know, there are. There are some pauses which are natural, which are just like, you know, give me room to think or. Because I can't find the next word.
Demetrios [00:29:51]: Yeah.
Floris Fok [00:29:52]: But others are like, I'm done talking.
Demetrios [00:29:54]: Yeah.
Floris Fok [00:29:55]: If I'm done talking. You know, you want this agent to talk back. And that's what turn detection is doing. It wants to have the least amount of time of silence, but still kind of have leaves room for you to have these stinking moments.
Demetrios [00:30:14]: Yeah. Not interrupt you.
Floris Fok [00:30:15]: But this is highly personal.
Demetrios [00:30:17]: Yeah.
Floris Fok [00:30:17]: So if I would now make, make a voice agent for, for me, I would tune these parameters so that, so that my way of talking will be perfectly aligned.
Demetrios [00:30:27]: Interesting.
Floris Fok [00:30:27]: And then I gave you this agent. I say it's so incredible. You know, it will never interrupt you and you will start talking to it and it just starts to interrupt you because your style is there.
Demetrios [00:30:36]: Exactly. I have long pauses, thinking or whatever and the agent doesn't realize that. And so you have to tune it in a different way. But the speaking of which, I had a friend who, who is building agents and he was saying that a lot of times when the user will say something, the agent will think that it is trying to talk. So normally you have an agent that will be talking and if it was with a human, I would be nodding along and maybe I would say, yeah, okay. And my friend was saying that a lot of times the agent will stop talking because it thinks, oh, you want to talk. I just heard you say something.
Floris Fok [00:31:19]: This is one of those parameters you can change. You can say like what is talking? You know, if you talk for longer than X amount of milliseconds, that's considered talking. So if you can be detected as something you want to say and not, you know, so it's all these different things and like I said, highly personal.
Demetrios [00:31:41]: Yeah. Actually another insight from my friend who's building these agents. He said that he found the product worked better when they made the agent a little less good, I guess, in a way. So if you make the agent worse and not as good as it is possible, then humans are more sympathetic and they're more patient with it. And they also are able to, they talk slower, they enunciate their words more and it's much better for him and his product to get that. And so if you make the agent as good as it is capable of right now on these different vectors, you mistakenly have the customer thinking that it is better than it is and you get a worse result.
Floris Fok [00:32:37]: That's an amazing, amazing view. You have this like this mirroring aspect that people have naturally, you know, that you kind of adjust to the way the older person is talking. That's actually a really cool one. I've not heard that before, so I'm going to write that down. For later.
Demetrios [00:32:52]: So make it. Yeah, tell me if you make it worse.
Floris Fok [00:32:55]: Amazing.
Demetrios [00:32:56]: Now, I know you guys did some really wild stuff with evals to see what was working, what wasn't working. Can you talk to me about the custom evals that you did for the voice agents?
Floris Fok [00:33:07]: Yeah. So it was pretty cool. Like, you cannot just launch this, you know, you cannot just say, like, hey, put it out in the wild. We have an assistant, works by the phone. Good luck. So we needed a way to know the boundaries. You know, we talked about it earlier. Like, you knowing where it fills also kind of breathes confidence in deploying it.
Floris Fok [00:33:30]: And since it was like Real Time API was a month old, so there was no testing framework. So we came with this crazy idea of, like, okay, why don't we use Real Time API to test the Real Time API? Since we had dozens of experiments where we're able to make the Real Time API move in all these weird places where, like, talk with a German accent or talk really slow or super fast. You know, it was lots of emotion. Yeah, exactly. Like, you were able to prompt it to. To. To talk differently. So we built an entire system where we would randomize different characters of different properties, saying, like, you know, to talk fast, slow talk with an accent, or not, you know, British or American, but also be sad, you know, or be happy.
Floris Fok [00:34:24]: And yeah, we had, like, a wild range. You know, we went all out. Because if we thought, like, we can automate this, then we can just test whatever. Yeah, you know, that's the amazing thing about automation. You will do much more the moment you automate it. But, yeah, so we tried this and we just ran like, hundreds and hundreds of experiments saying the same stuff, and then seeing how the phone agent would react, and it needed to, like, order pizza, but then in all these different styles. And the fun thing was that the result of that experiment was that we saw that Italian and Spanish were like mix and match. So the moment there were Italian or Spanish, it would just swap over to one another.
Floris Fok [00:35:02]: Like, if it. If it didn't really understand that there was a different language, which was really funny.
Demetrios [00:35:06]: It's the hybrid language.
Floris Fok [00:35:08]: Yeah. So I have the same mistakes sometimes. You know, like amigo. That's Italian. Right. But so it's good to see that these smart models also struggle with that. And, you know, it's the same where the German accent would sometimes convince it to talk German back. So you had, like, the Italian accent, it was alpha.
Floris Fok [00:35:31]: But then, you know, when you had the German one, it would just start talking back German.
Demetrios [00:35:36]: The Model would respond in German sometimes.
Floris Fok [00:35:38]: Yeah. But we saw that accents could indeed trigger it to kind of go over that language barrier. And the final, like what I would discuss with the turn detection, you know, that is something we saw with the slow and fast speaking, that the moment we had slow speaking, it would randomly trigger that turn detection. So people who were talking really slow were really difficult to have consistent because if the pulse was just a few milliseconds too long, then it will blow the experiment. And those were all the things we were seeing by generating all this synthetic data set. So that was quite an innovative approach. And since this whole thing was a month old, we're like, we're probably the first doing this and it was pretty exciting.
Demetrios [00:36:25]: And do you think with that turn detection it will become more personalized later on and be able to adapt real time? Is that like how do you see that problem being solved?
Floris Fok [00:36:38]: Yeah, it's a difficult one. I think first of all, people just need to learn how to work with the normals. Yeah, that's the quickest approach. But I do see the opportunity for companies because also in the realtime API, you can say, I do my own turn detection. You don't have to do it on the OpenAI side. So OpenAI is just a turn detection model. So it would be amazing if the open source community would now work on a turn detection model that is adaptive or, you know, because if we can clone a voice in 10 seconds, why can't we detect the turn settings of a person in 10 seconds?
Demetrios [00:37:22]: 100%. Well, yeah, speaking of open source, there is a model that came out and I think you were playing with it a little bit. Right.
Floris Fok [00:37:29]: Which model was it? Yeah, the Kokura.
Demetrios [00:37:32]: Yeah.
Floris Fok [00:37:33]: Cocora. I'm not sure if I'm pronouncing that right. I'm not good with names. The audio of the realtime API isn't elso. So, you know. Yeah, no, but this open source model, it blew me away. And it was not, you know, we've seen 11 labs, great quality, amazing work.
Demetrios [00:37:53]: For the text to speech aspect. Right?
Floris Fok [00:37:56]: Yeah. So this is text to speech.
Demetrios [00:37:57]: Yeah.
Floris Fok [00:37:58]: It's not real time, but it deserves a shout out in this episode because it's really amazing. They managed to get really good performance from 82 million parameters.
Demetrios [00:38:16]: Wow.
Floris Fok [00:38:17]: So all these models were in the billions.
Demetrios [00:38:18]: Yeah.
Floris Fok [00:38:19]: And these guys just say like, okay, we have here a model that. So if I run it on my MacBook, I get five times the speed of real time out of this text to speech. So it can generate five times as more speech than it has the time for. Just on a MacBook.
Demetrios [00:38:38]: Yeah.
Floris Fok [00:38:38]: You know, and they claim to. To. To reach over 100 times if you have like a proper GPU. But that's liberating because you're basically democratizing the text to speech because everyone now can download that model and generate speech.
Demetrios [00:38:57]: Start playing with it for free.
Floris Fok [00:38:59]: And that's a big move. So I'm wondering if eleven Labs is sweating at this time.
Demetrios [00:39:06]: Before we wrap up, I want to talk a little bit about how you think about flows with voice AI agents, specifically comparing them to something like the token web usage or Slack bot that you can interact with. We heard from Paul that it's really difficult to do all this in real time. The latency starts to get really high and it's not quite feasible yet.
Floris Fok [00:39:34]: Yeah. So feasibility is a large word, but there are definitely things you need to rethink. I still remember the first time I had the agent running on my laptop with some tools and I was like, I'm so amazing. I made this voice agent work. It will definitely be great. And I started talking to it and it called the tool. I was really happy. But then it was just that silent and it was awkward.
Floris Fok [00:40:05]: I was like, oh, yeah, because you.
Demetrios [00:40:09]: Don'T know if it's working. You might have known because you were looking at logs and you were seeing it. But if it was me on the other end of a phone call, I wouldn't have known what's going on and are you still there? What's going on?
Floris Fok [00:40:21]: Yeah, so it would just stop talking because it was executing the tool because it was the logical next step to do, because I prompted to do that. If the user says this call the tool. It was waiting on the tool output and then it was reading out the tool output. You know, it's like, hey, I just got this tool response and it's saying this. And I was like, oh, that's an awful experience. You know, it was truly awful. And I was like, this is not going to work. And then we knew that we need to rethink these things.
Floris Fok [00:40:51]: So there are a few tricks you can do. You know, it's. If a model calls a tool, it can do. It can. So going back to the text, it can do text and a tool call. Audio is the same. It can do audio and a tool call. So you can prompt it to say, you know, if you call a tool, you know, tell that you are doing it and maybe explain what the parameters were in the.
Floris Fok [00:41:17]: In the. In the tool call. That can kind of buy you a bit of time. But the other side is, of course, make the tool really quick. So don't add tools that take long in a real time conversation. You know, then it's just better to offset. Then you're just saying like, hey, that, that thing you want me to do, it's. It's being initiated.
Floris Fok [00:41:36]: You know, you will get a process message or when it's done, I will send you an email or a text. So you separate the workflow. So don't do stuff real time that cannot be handled at real time speeds. So that's like the first, first realization where we're like, okay, only fast tools we're going to keep. And then we're going to offset these jobs, you know, where it's like, okay, the actual ordering of the food, that is a job. And we say, like, we're going to order you food. That's it. You know, you're not going to call the pizzeria and then they leave you on the line until the pizza is done.
Floris Fok [00:42:10]: Right. That would be really weird. So, yeah, so that's kind of the same with the real time voice. And yeah, a second one is you need to rethink flows in a way where you need to avoid spelling. So if I say, hey, my name is Flores, you know, get me my medical record, you know, to come back to the medical discussion. And it can then spell my name on 20 different ways. You know, it can even detect like a normal English word. But it could be really awful at getting my spelling right.
Floris Fok [00:42:49]: And that's essential for the next step. So you could kind of work around that. So maybe let's take a more an HR setting. So it could ask me like, hey, what team are you in? I'm like, hey, I'm in the AI team. And I say, okay, what's your name? And then it already has a short list of names. So it has like 25 options. And then I say floors. And then it's precise because then it knows from all these names, Flores looks the most like this spelling.
Floris Fok [00:43:18]: And then it can map. And so redoing that workflow now you are overcoming the spelling issue. So it's not a limitation. So where we started where, like, are we. Is it viable? So, yes, things are not viable, but there are workarounds and just we need to be more creative. And I'm not sure if spelling will ever be something that will be solved.
Demetrios [00:43:45]: Problem. Yeah, well, you saying that brings up a really cool company that I've seen called Canonical AI and they basically are doing Product analytics for voice AI agents. And one of the things that they do is they see where these voice agents are falling flat. And it gives you, as the creator of the agent, the ability to click in and recognize. We have a blind spot when people start to say their names or we have seen that there is for some reason on this flow, a lot of users disengage after the agent comes back. And so you get to recognize out of the way that people are interacting with your voice agents. Where is it not working? Because it could be, again, it took too long or it could be it's just giving the wrong answer or it didn't respond that it was thinking. There's so many X factors going back to the initial part of this conversation that you want to try and find where these flows are breaking down or where customers and users of the agents are not having the best experience so that you can try and tweak the flow a little bit more.
Floris Fok [00:45:01]: Yeah, that's super valuable, you know, especially at this point where nobody really knows.
Demetrios [00:45:06]: Yeah.
Floris Fok [00:45:07]: Where these things break. The more insights and especially real time insights. If you're able to follow your agent real time, you can react fast and you know the pain points and you know how to improve.
Demetrios [00:45:20]: Yeah. Or think about if you're a call center and you all of a sudden are having 20 or 50 or 500 calls that are hanging up after the same step. And really a call should be a minute, but now it's like there's an average of 3 seconds for this call. You want to be alerted of that. And it's real time insights to say, hey, something's going on here. It's like your Datadog or your Prometheus in software.
Floris Fok [00:45:48]: Yeah, it just needs to be there. If you're building software products to have these real time insights, I think it's an amazing company. So one of these insights we actually gained from looking at these logs was actually that it was overconfident, you know, if people would mispronounce or miss say something, so it would never ask the question, can you repeat that? You know, and it's such a normal thing to do for humans, you know, if they're uncertain. But it's exactly the same if we have a text, like they're overconfident and they would never give you a no. And the same is with text. So I'm really looking forward to the first audio model that will tell me like, can you repeat that?
Demetrios [00:46:37]: Or it just says, huh, yeah, that's.
Floris Fok [00:46:40]: The next frontier, you know, can you repeat that model.
Demetrios [00:46:43]: Oh, that would be good. All right, that's it for today. A huge shout out to the process team for their transition transparency. Because it is rare that you get companies talking about their failures, especially companies that are this big in the AI sector and really helping the rest of us learn what they had to go through so painfully. Sometimes a mention that they are hiring. So if you want to do cool stuff with the team that we just talked to and even more, hit them up. We'll leave a link in the show notes. And if you're a founder looking for a great design partner on your journey, then I highly encourage you to get in touch.
Demetrios [00:47:28]: We'll leave all the links for all that good stuff in the show notes.