Voice Agent Use Cases
Speakers

Anurag Beniwal is an Agent Engineering Lead at ElevenLabs. He previously served as an ML and Engineering Leader at Amazon, where he managed a research and engineering organization of approximately 45 people focused on worldwide customer support. In this role, he led efforts to scale agent-based systems to handle nearly one billion customer interactions, with a particular emphasis on post-training, evaluation frameworks for support-oriented agentic tasks, and the design of agent orchestration systems.
Earlier in his career, he was the technical lead for real-time recommendation systems for Amazon Style, as well as for recommender systems powering Prime Wardrobe. He was also a founding scientist for AWS SageMaker, contributing to the development of one of Amazon’s core machine learning platforms.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Anurag Beniwal (Member of Technical Staff at ElevenLabs) breaks down the real-world challenges of building voice agents—from latency, transcription accuracy, and turn-taking to the tradeoffs between cascaded systems and end-to-end speech models. The conversation explores why production systems rely on “constellations” of models, how to design for non-technical users (especially in customer support), and why voice unlocks richer context—but introduces far more complexity than chat. Ultimately, it’s a deep dive into making voice AI practical, reliable, and usable at scale.
TRANSCRIPT
Anurag Beniwal: [00:00:00] How do we like make cascaded systems feel like speech to speech, but still like have the enough controls?
Anurag Beniwal: There's a challenge with controls where since there are so many controls, if you have like a developer team within a company product focus company, do this. Uh, then often what I see, uh, with customers is that if you give too many controls. And if they exactly don't know how to sort of set it up, it is like, oh, this just doesn't work.
Anurag Beniwal: Right, Uhhuh? So I think there's always a tension between like pre-configuration, right? And giving more flexibility.
Demetrios: Mm-hmm.
Anurag Beniwal: Uh, but if you give more flexibility, then for example, um, uh, sometimes people like, okay, why the latency's so high? Uh, for, for speed generation because, you know, we maintain a certain buffer size, uh, to make sure that.
Anurag Beniwal: The, the speed [00:01:00] generation has enough, enough context, right. Uh, to, to be pro aware, to, to, to have the right tone mean, maintain the consistency, tone and higher the buffer size, you will have like higher latency.
Demetrios: Yeah.
Anurag Beniwal: But now if you, let's say let them reduce the buffer size, the speech quality sort of goes down and they start, and then they're like, okay.
Anurag Beniwal: Yeah. You know,
Demetrios: complaining about
Anurag Beniwal: complaining. Right. So I think that is, uh, a general challenge. What is the right abstraction, right? So we should also talk about. General abstraction of agents. So, um, many of the companies building orchestration, um, like it, uh, um, uh, on the tech side, line chain, uh, these are meant for like tools to be built by software, uh, engineers.
Anurag Beniwal: Like you build agents over them. But like I see, let's say in customer support, um, customer support is more like. A cost center for a company, which means you'll have like less number of engineers there. Uh, and the, the, the [00:02:00] real person or the real customer is the operations leader there who is basically managing the workforce there and, uh, once through an end-to-end operation.
Anurag Beniwal: But like the current interfaces of our existing orchestration systems are not meant for them. Right. And that is where. Like, uh, we have recently received a project called 11 Labs for Support, but Sierra Deagon, where we define these interfaces for non-technical sort of authors to, to define the behavior of agents, right?
Anurag Beniwal: So, uh, even before AI agents. You would see what, what would, like I was in Amazon and if we have, let's say a hundred k uh, support agents across the world, many of them are not full-time. Right? Like they can come and they can leave at any point, right? Which means there's not really an opportunity to train them on specific support.
Anurag Beniwal: So customer support always had these like pages of documents called SOPs, and every time you call someone, uh, they say, oh, gimme a moment to look up your information. They're literally reading that, right? That, oh, this person is reaching out for a refund. What should I do? I should look for, uh, [00:03:00] information about Demetris.
Anurag Beniwal: Then I should check his eligibility for refund, his past sort of purchase behavior, blah, blah, blah. And then that, that takes time, right? So these SOPs were written by. Um, the operations leaders or managers or customers, support managers, and they would use that to, for human agents to follow them, right?
Anurag Beniwal: Mm-hmm. And then the second mode of sort of interaction between them would be that once the human agents are sort of following those SOPs, um, sometimes they would not follow them and, um, they would break the rules or, uh, be noncompliant to that. And then the, uh, operations manager has mechanism to listen to calls and then, um.
Anurag Beniwal: Give them feedback. Right. So I think my thought process is that if we need to make like agents, uh, used by, uh, non-tech authors, which are, which are largely, which is most of the population who would be using agents. Uh, and they still need autonomy on the behavior of agent. Right? What an interfaces [00:04:00] we need to build.
Anurag Beniwal: Uh, so, uh, I think on customer support specifically. Um, can I keep the interface? Same, right? So instead of human agent, can I have an AI agent? But let's say if I'm the support manager, can I keep the interface? Same where I still define SOPs now, right? Just like I used to. Uh, but now I can go very, very detailed and long because I don't have a human constraint of not being able to read along documents, right?
Anurag Beniwal: Like, uh, uh, all. Large tables. Right? So I think that can change, but like the regular interface can be the same. And then the second is just like I was providing feedback to human agents. Can I also sort of provide feedback to AI agent that, oh, you provided wrong information there, or you sort of, um, um, broke this rule, or do you did not follow the SOP or sometimes the SOP is not clear enough so the agent got confused.
Anurag Beniwal: Even humans get confused. How do we take that feedback and um, sort of. Uh, irate on the specification and procedure. So I think [00:05:00] that is quite interesting that like how do we sort of keep the interface of interaction between sort of domain experts who may not be software engineers, same as how they were sort of working with human agents, right?
Anurag Beniwal: Uh, and. And if you were to do that, then there's a lot more heavy lifting to be done. Right? Which means continual learning, which is pretty hot topic right now that how do we make these agents improve based on natural language feedback, right? Yeah. So can talk about some design patterns for that. Uh, how.
Anurag Beniwal: We are doing it how companies are doing it. So yeah.
Demetrios: Why do you feel like voice is harder than chat?
Anurag Beniwal: Yeah, so, uh, even chat is not easy, right? Like chat has not been solved and, uh, I've spent more time doing chat than voice, to be honest. And, uh, I don't think like, first, there are no real benchmarks for like end-to-end dialogues.
Anurag Beniwal: Tao bench is one. Uh, but it is so much more easier than the real world situation. So [00:06:00] for example, uh, uh,
Demetrios: why is that just because you have so many random sounds or you
Anurag Beniwal: Exactly.
Demetrios: Different accents
Anurag Beniwal: Yes.
Demetrios: That type of thing.
Anurag Beniwal: Yes. So I think, uh, first if we start with chat, uh, dialogues and dialogues are often multi turn.
Anurag Beniwal: So longer a dialogue, there are the chances of failure increases. Right? And then, um, uh, for chats. Uh, the challenge is that how does the agent recover, right? If you change your mind, you change your intent. Uh, you said something that was unclear. I misunderstood. Right? The agent misunderstood. And then, you know, uh, what are the self-correcting abilities?
Anurag Beniwal: What are the abilities to ask? Right? Follow up question at the right time. Uh, because if it's not the right follow up question at the right time, uh, people have this historical baggage of not trusting agents, right? Yeah. Like the, the AI agents or chat agents, right? So, um. Which means that since they don't always trust, you have to be very precise.
Anurag Beniwal: Like when you should ask a follow up question to, for clarification, right? Uh, [00:07:00] when you think you misunderstood them and like, you know, pivot. So these variations, uh, uh, um, are present in the text stack and now you add voice to that right? Voices. Uh, okay. Where are you sitting? Like most people are not in such a studio setting.
Anurag Beniwal: Mm-hmm. Right? Like quiet often. Uh, you know, especially in like. Um, uh, and, and, and, and another important factor is that voice, the distribution of how many, what percentage of population interacts with, um, um, support or, uh, any of these ea just through voice is vastly different between US and Europe and let's say India, right?
Anurag Beniwal: Mm-hmm. Like in India, um, probably 75% of conversations, customer conversations happen on voice, right?
Demetrios: Mm-hmm.
Anurag Beniwal: While that number could be around 30 to 35% in the us right? So, so the difference is very stopped. And now when most people are calling from phone, they're usually outside. There is background noise, there could be multiple speakers, um, present.
Anurag Beniwal: How do you sort of, uh, filter that, um, trans. [00:08:00] People don't talk about enough about transcription people when, when voice agents are there. I think people want to hear like the most beautiful voice from text to speech. Um, but in terms of reliability for specific domains like customer support, um, uh, you need.
Anurag Beniwal: Very accurate transcription, right? So, um, for example, if, if you call me and, uh, uh, I ask you for your order number, your email, or your name, and if I get any of that wrong and I don't have mechanisms to correct that, uh, or my transcription is not super accurate, then there's no way my dialogue or or the conversation will succeed.
Anurag Beniwal: It just fails.
Demetrios: Yeah.
Anurag Beniwal: So now, uh, like we discussed in chat agents, there are. Few degrees of freedoms where it could fail, and that significantly increases it voice because of, let's say, uh, transcription errors, background noise, multi-speaker diarization turn taking is a, is is still a open problem.
Demetrios: Yeah. How do you Yeah, because I, I heard about with [00:09:00] turn taking that you can't just say a number
Anurag Beniwal: Yes.
Demetrios: Of how long you wait or if. X amount of time has gone by.
Anurag Beniwal: Mm-hmm.
Demetrios: You know. Oh, cool. Now I can talk. Yes. So there's like probability.
Anurag Beniwal: Yes.
Demetrios: Is it a whole separate model that you're using? Yes.
Anurag Beniwal: Yes.
Demetrios: Tell me about that.
Anurag Beniwal: Yes. So, uh, you can have a separate, uh, turn taking model, you know, based on transformer architectures.
Anurag Beniwal: Uh, fine tune for turn taking tasks where you collect data. The data should have enough variations in terms of like different accents, but different like pace and intonation. Mm-hmm. Uh, let's say I speak very fast, um, but someone else could take longer pauses, right? Yes. You so, uh, having all those variations.
Anurag Beniwal: Across all accents and languages.
Demetrios: Yeah,
Anurag Beniwal: I don't think those rich data sets exist today in voice. Um, and then like training turn taking based on that, um, large part of turn taking has moved to these neural models. But I do feel, uh, and we can talk [00:10:00] more about that, that the, the old school ways of extracting voice features, like, you know, um, the, the, the pitch, the RMS energy, right?
Anurag Beniwal: Like, um, from your voice and using those features and using like simpler heuristics over them. I think that is still helpful.
Demetrios: Like what the way?
Anurag Beniwal: Yeah. So using these features to determine when to stop, like, you know, when the user has stopped. Correct. Uh, so largely the turn taking models are transformer based architecture, but uh, I do see like a hybrid, uh, as well, right?
Anurag Beniwal: Mm. So because, uh, if you look at turn taking models, most of them take speech and text. Both has input. So imagine in a cascaded world. You speak something, then I transcribe it, then I pass the speech as well as text. It already adds, uh, a bit to the latency.
Demetrios: Yeah.
Anurag Beniwal: Um, but let's say, uh, you know, um, in some cases, uh, you are.
Anurag Beniwal: Your features of voice are strong enough, right? Like, uh, where these caustic features are with a very high [00:11:00] confidence I can say that like, RIAs has just stopped speaking. Oh, okay. Then I don't even need to like, wait until the transcription is over, right? To, to, to detect the turn. So I think this also helps.
Anurag Beniwal: The hybrid approach also helps in reducing latency and to end.
Demetrios: But when you're running those models, it's two separate models you're running, right?
Anurag Beniwal: Yes. Yes. But I think the, the, uh, the old school extracting acoustic feature is fast enough, right? Yeah. Compared to the neural models,
Demetrios: and you have something set up so that the acoustic models are going to be the ones that always override.
Demetrios: Mm-hmm. So if that says, Hey, we're done.
Anurag Beniwal: Mm-hmm. At a very high sort of confidence threshold, right? Like otherwise you still sort of use, um, neural models for a large number of use cases, right? Mm-hmm. So I think that is one pattern to somewhat reduce latency, right? Yeah. Like through turn taking and also have more confidence like, um, using both these approaches, but it still are.
Anurag Beniwal: Open problem. There are not enough good benchmarks for turn [00:12:00] taking open source, something I want to work on once I have more time. Nice. Um, and then, uh, there are not enough good open source models. I know. Uh, pipe Kat has done some work with Smart Turn V two model where, uh, I think they created this interface to publicly collect conversation data, uh, and then, um, uh, use that to fine tune or turn, uh, uh, train a turn taking model.
Anurag Beniwal: But it is often very hard to get data sets from like real distribution. Right. For, because when I'm talking to customer support, more often than not, you will not have access to that data. Right. Uh, so many of these conversations on which this, these models are trained are either synthetic or, you know, um, by some of the data collection companies, they probably pay users to do that.
Demetrios: Yeah.
Anurag Beniwal: But like, when, when both of us are talking and then. I know what the purpose is. It could change things, right? Yeah. Like subconsciously. Yeah. So I feel that is a very interesting area that like how do you, [00:13:00] um, sort of, um, collect this more natural data, um, and uh, um, use that to sort of tune your turn taking Also, it depends on domain a lot, right?
Anurag Beniwal: So the distribution of how people talk. In customer support versus talk the same person talking to a friend.
Demetrios: Yep.
Anurag Beniwal: Versus talking to a customer support agent versus, uh, you know, um, um, uh, maybe, um, talking to a even AI agent, right? Like the, the pitch and annotation can change, right? Yeah. Like, so, um. I feel like fine tuning that for domain data, uh, is, is very useful.
Anurag Beniwal: And I think that is still possible because, uh, when we are working with some enterprise customer, and if you say that, okay, you know, we can have turn taking specifically for your customer support, there's a certain pattern. Let's say most of your, uh, customers are in the us there's a general. Pattern to that sort of conversation, the noise levels versus let's say someone sitting in like, you know, India or in London, [00:14:00] right?
Anurag Beniwal: Like, uh, so, uh, fine tuning definitely helps on specific data set. Mm-hmm. So I do believe, like when we don't have. Uh, great benchmarks here. Creating internal company benchmarks, uh, on real data is a lot more powerful than, you know, just benchmarking on open source because these data sets have the challenges I was just talking about.
Anurag Beniwal: Yeah.
Demetrios: All right, Joel. This episode is brought to you by the good folks at ML Flow, the open source platform for developers who want to build production ready AI applications. Enhance your AI applications with end-to-end AI observability. All in a single integrated platform with ML flow's, gen AI capabilities.
Demetrios: You can evaluate AI applications using a suite of built-in or custom judges, visualize trace executions and agentic analytics, and continuously monitor evaluations
Anurag Beniwal: all while tracking. Every run in one place. Ship better agents faster. You know, that's the name of the [00:15:00] game. Get [email protected].
Demetrios: We had Zach in here from Sierra, uh, a few months ago.
Demetrios: Mm-hmm. And he was talking about how they're running a constellation of models. Yes. You've seen that? That's very common.
Anurag Beniwal: Yes.
Demetrios: In your eyes.
Anurag Beniwal: Yes.
Demetrios: Yes. Can you break down? How many models and what, what that constellation looks like.
Anurag Beniwal: Yeah. So that is true. Um, and, uh, we can talk about constellation of models in different contexts.
Anurag Beniwal: So one of the example was like, you know, for, for, uh, uh, turn taking was like I said, like, you know, one could be a simpler, a caustic features based model. Second is neuro, right? Mm-hmm. Now you go to LLMs, right? That is a big challenge because in the easiest thing in chat is that somehow there is a. There is a higher tolerance for latency where you're chatting with something and there's like, dot, dot, dot.
Anurag Beniwal: Mm-hmm. And thinking Right. But in voice, uh, that is, if that is complete silence. People would just drop the [00:16:00] call. Right? Yeah. And I think there are like more historical reasons for that because, uh, voice agents or voice chat bots traditionally were like so dumb that people generally don't trust them.
Demetrios: Yeah.
Anurag Beniwal: So now if you're putting like, uh, two seconds of pause, they'll be like, okay, uh, I don't think I, I'll get anything out of it. And they, they'll disconnect, right? So, but there's also a need from our customers to use the best model, right? Like if, if, if, uh, cloud opus is there or sonnet is there, they still want to use that as an intelligence layer.
Anurag Beniwal: Um. How do you do that? And I feel like there's something in between the like dumb cascaded way of just chaining things together and like speech to speech where you come up with like more intelligent patterns of cascaded architecture or whatever you call it. Mm-hmm. Uh, for example, let's say when, when two of us are talking, and you ask me a hard question, right?
Anurag Beniwal: Which requires me to think more. Right. Uh, I would not go completely silent after you ask me. Right. Uhhuh, I would [00:17:00] often either, um, um, ask you to give me a moment. Right. And let's say if it's taking more time, I'll be like, Hey, Dimitri, I apologize. Right? But give me five more seconds. Yeah. Right. And that's better.
Anurag Beniwal: But even better is that you ask me something. Let's say about voice agents, right? And you asked a very open-ended question. I had to go, go and then do a deep research to give you a comprehensive answer. But can I keep the conversation going? Right? Can I start with something very basic about voice agents, what they are?
Anurag Beniwal: Where they use. Right. And for that, do I need the most expensive model? No. Right. So, uh, I think what we, uh, often see, um, um, working with our customers, but also in general is like, I can have a, like a smaller model, right? That is for these more cursory conversations, right? A high level chitchat or, um. Are basically like turns that require low intelligence, right?
Anurag Beniwal: Mm-hmm. Where these models keep you engaged and then they, [00:18:00] they do delegate the, the more intense task in background to a, to a more expensive model.
Demetrios: Ah-huh
Anurag Beniwal: Uh, so that now there's no complete silence. I have a model which is trying to be helpful and. Yeah, at some point if it doesn't have enough context of what you're asking about, then it can ask you to sort of wait, um, and, uh, you know, still keep the conversation going while the background tool or background more expensive model return services.
Anurag Beniwal: This is a very common model and now you can have like more than two models in this, right? Yeah. So I think this is, uh, one very common use case of using, uh, multiple models. Second is, um, uh, what I have seen is. In customer support specifically that, uh. And at Amazon, we, we went back and forth multiple times where there was 20, 20, 24 was very confusing, where like you see these great model releases from GPT or cloud when you chat with them, like they're [00:19:00] just like amazing.
Anurag Beniwal: But then on, on task oriented dialogue tasks like customer support where I don't really care about the overall general intelligence of the model. But like I want higher reliability on those tasks there. They would like o often, like the prompt based approach or a react style looping method would often fail.
Anurag Beniwal: So then we would have like a constellation of models again, right? Like where? Uh, for certain tasks, let's say tool calling, uh, Claude was excellent, even in like late 2024. So I would probably rely on Haiku, right? But like, response generation is something where I want to have more control because, uh, um, uh, I'm building models for Amazon's customer support, which is in across so many languages, impacts probably 2 billion plus interactions.
Anurag Beniwal: I cannot like let it lose on Haiku and just prompt the heck out of it, right? So. There I would use like a smaller sort of, uh, fine tuned model for response generation tool calling. I will, uh, um, get the best from Haiku. [00:20:00] So I think that's another sort of, uh, reason to use a constellation of models. Um, yeah.
Demetrios: Huh. I wonder how many is too many? If there ever is too many.
Anurag Beniwal: Yes. Uh, and too many shouldn't be too many in my opinion, because, uh, very hard to, um, sort of, uh, update them, improve them, and, uh, um, so that was a challenge early. And there also, we went back and forth, uh, even in early 2023, where we were like, okay.
Anurag Beniwal: We should given that LLMs are here, and we were already using a constellation before pre LLM because generative capabilities were not there. So for intent detection, you use, like, you fine tune a bird style classifier for response generation. You really don't use generative models, uh, in production then, right?
Anurag Beniwal: We, we would have these like ranking models where you still use the understanding ability of birch style models, but to keep the [00:21:00] dialogue contained and not. Um, uh, not say anything. Oh, yeah. Like we define for each domain, let's say refund from our historical transcripts, we only see, um, 500 types of like responses that are possible, right?
Anurag Beniwal: Given that it's a constrained domain and long tail, it's fine. We can transfer it to a human right. And we as the model to. Basically understand what the customer is saying and then like pick the right template and fill that template with High Demetris. Right, right. And, and I can put the placeholders and choose the template.
Anurag Beniwal: Right. And that then we were using like a, a, a bunch of models together. And then we were like, okay, LLMs are here. We should not, they are intelligent. They should be able to do sort of multiple things together. So what we did was our first ation was, um, um, fine tuning, um, uh, uh, flan and minstrel on, uh, on uh, um, our own data for multiple tasks, uh, which was action prediction.
Anurag Beniwal: These days it's called tool calling, but that went back then response generation, intent detection, [00:22:00] um, uh, dialogue, state tracking, uh, which is still very common. So. And we would have like bespoke models for each of them and we try to consolidate them. And then you will see that like there's an interference, like if one task sort of, um, if the model, uh, regresses on one task, right?
Anurag Beniwal: Uh, then I add, let's say more data to the post-training mix for that. And then it ends up impacting my other tasks as well.
Mm-hmm.
Anurag Beniwal: It was whack-a-mole. Yeah, it's a whack-a-mole. And then, you know, of course you have to still release stuff in production, so. Like we went from a, like constellation of models to a multitask, to a constellation, but like a smaller constellation.
Anurag Beniwal: Um, and I do believe, uh, in many production system, I haven't seen one model doing everything, especially for like more complex situation. If it's like a small to medium sized business, largely a q and a use case, couple of tool calls, right. I think, I think just one model is just [00:23:00] fine.
Demetrios: Yeah.
Anurag Beniwal: Yeah.
Demetrios: But then.
Demetrios: What you're seeing for the various models. Mm-hmm. And the constellation of models. Mm-hmm. What are the different use cases of the different models? You mentioned a deep research.
Anurag Beniwal: Mm-hmm.
Demetrios: I imagine there's, okay, you kick off some tool calling. Maybe you want to go and grab some data from somewhere.
Anurag Beniwal: Yes, yes.
Demetrios: Those are all pretty obvious ones. Are there others that I'm missing?
Anurag Beniwal: It's just not deep research. Um, um. You can, uh, deep research is the most common use case, like doing that. Uh, but sometimes they're just APIs that take longer. Uh, in voice agents, if you see demos from different companies, they'll be like, oh, you know, in 500 milliseconds we do everything right.
Anurag Beniwal: But in real world. You're deploying it in production, the APIs themselves take couple of seconds or sometimes one second, right? And so it, it is not a constellation of models, but often you also use these models to sort of [00:24:00] hide the latency of those, like, uh, uh, uh, more expensive tool calls or legacy APIs, right?
Anurag Beniwal: Yeah. Um, yeah. Uh, yeah. Uh, other constellation of, uh, other is like retrieval, right? Uh, uh, if you still use retrieval today, for example, um. I still need to come up with a quick answer, so I will, I will have like a fast retriever, which is like, you know, largely keyword based or grab based to give you a quick answer and then, you know, a more expensive retrieval that sort of.
Anurag Beniwal: Um, uh, does its job, like, more like deep research.
Demetrios: And so this is something where you would experience it as the voice agent saying, okay, let me talk to you about that more.
Anurag Beniwal: Yes,
Demetrios: yes. And then boom, by the time it finishes that
Anurag Beniwal: phrase, yes. Yes, exactly. Or I'll give you something more useful than that. Right.
Anurag Beniwal: So like, like. If you ask me what are voice agents? I think smaller models are still, if I just do like, uh, grab on my documentation and search for voice agents, right? Mm-hmm. [00:25:00] And then I will still have a reasonable answer to give you, right? Yeah. Uh, so I, I just don't need to ask you to wait. I can give you something and then I'm like, Hey, Dimitri, do you want me to go deeper?
Anurag Beniwal: But I was already going deeper in the background, right? Uh, and then fetch the response for you. So by the time you say, yes, I get enough time to, to, to basically like, you know, use the response from a deeper deep research or a more expensive retrieval, uh, to, to surface. And
Demetrios: that's where. You're masking
Anurag Beniwal: the latency.
Anurag Beniwal: Yes. Yeah. That's like latency masking. Yeah.
Demetrios: Yeah,
Anurag Beniwal: yeah, yeah.
Demetrios: Uh, yeah. That's awesome. Okay, so what else? I know there are so many things that
Anurag Beniwal: you, and then there, there are other examples of like having Constellation, for example, uh, in support. Uh, most of my examples are going to be from support because that's,
Demetrios: you're knee deep in it
Anurag Beniwal: these
Demetrios: days, so
Anurag Beniwal: it makes sense.
Anurag Beniwal: For example, like if you just do like the cookie cutter way of building agents where you put a knowledge base and some tools, right? At best it's going to. Match the [00:26:00] performance of a level one customer support. Mm-hmm. Right? Which is like usually not the best quality in a given organization. Every organization has level one and level two.
Anurag Beniwal: Level two are probably more like permanent employees, um, domain experts, right? Mm-hmm. And no matter what company it is today, it's so hard to beat their performance even today. Right?
Demetrios: Yeah.
Anurag Beniwal: So, uh, one of the ideas we were like, okay, you know, how do we make these. Uh, agents smarter, get closer to these level.
Anurag Beniwal: Can we sort of, uh, learn from their trajectories, right? Like how are they pivoting in a conversation? How are they sort of taking decisions on exceptions and not, what are they saying to customers that is more reassuring that the customer doesn't disconnect? Right? And how do we sort of fetch that context?
Anurag Beniwal: And then also like, you know, give that as an example, uh, to our agent to sort of follow that.
Demetrios: Mm-hmm.
Anurag Beniwal: And so for that, we had like a, another model that basically processes them because these conversation look very similar, right? Yeah. But [00:27:00] just based on like one or two, uh, values of like, let's say, uh. Age of the account or location of the account, the policies could be very different.
Anurag Beniwal: So the model can make mistakes. So we had sort of another sub agent to like fetch these conversations that are high quality within similar context from these expert agents. Give it to an LLM to see if there's something it has to learn from them, right? Mm-hmm. Pass it to, as a context to the main LLM, right?
Anurag Beniwal: Our, our main agent. Um, and that was another sort of. Uh, multi-agent setting where it was helpful to use. So you're
Demetrios: doing this dynamically, it's not like the training.
Anurag Beniwal: No. Yeah. You, you, and we did end up training that because, uh, if you, uh, the conversation that you sort of retrieve, like these trajectories of conversation.
Anurag Beniwal: The top five are so similar, right When you retrieve them that you would even, we would not be able to find which is the correct one and which is [00:28:00] according to the policy and the SOP, right? These domains are very compliance or SOP heavy, right? So you still need like some kind of a. Like fine tuning over that to make it understand the difference between the correct and the wrong one based on some minor details, let's say, um, um, the, the, the loyalty status of the customer or the tier of the customer or their credit usage or, uh, whatnot, right?
Demetrios: Yeah.
Anurag Beniwal: Uh, and. For that, I think you need to still sort of do some kind of a preference tuning to make the model understand those like very specific, uh, differences.
Demetrios: Yeah. Those variables
Anurag Beniwal: yes.
Demetrios: Are only understood by the
Anurag Beniwal: Yes.
Demetrios: Subject matter expert.
Anurag Beniwal: Exactly. Because. Like two of our conversations could be very similar, but then in one of the conversation, probably the location is different or, or, uh, uh, some attribute is different.
Anurag Beniwal: And that makes all the difference, right? Yeah. Whether I should be giving you a refund or not giving you a refund, right? Mm-hmm. So if you don't have like something that. [00:29:00] That learns to have this sort of discrimination and out of the box. I haven't seen anything working
Demetrios: there. Yeah. That discretion is so important.
Anurag Beniwal: Yeah. Yeah. And for some reason, these level two agents have that discretion,
Demetrios: Uhhuh.
Anurag Beniwal: Right. And I think it's still an open problem. How do we, um, sort of learn from their behavior and make the agents like the current state of agents, which are at best level one, get closer to that. It is also like context capture.
Anurag Beniwal: Right. These days, agents are largely working on internal knowledge base and tools. Right? But the judgment of like these experts, right? A lot of that information is in their heads,
Demetrios: but do you think they know why?
Anurag Beniwal: Sorry?
Demetrios: They're like, do you think the experts know why? Mm-hmm. And they can explain. If they were asked to.
Anurag Beniwal: Yes. And that is something I'm very passionate about that how do we capture that data?
Demetrios: Yeah.
Anurag Beniwal: Right. Like, I don't think they log it anywhere, but like with voice becoming so much better. Right. Like, you know, can you [00:30:00] ask them to like, just talk about their reasoning of doing that and like make them log at least some of them.
Demetrios: Yeah.
Anurag Beniwal: Right. And then sort of close that gap. Through that information, but also like observing their behavior and sort of emulating that. I think that's a very important area of applied research, uh, where how do we sort of get this information that's there in their head or through their experience.
Anurag Beniwal: Either observing their, um, past trajectories or like you said, explicitly asking them, but making it easier for them to be able to say that because agent handle time is a big sort of constraint. Um, if I'm a human support agent, especially an expert one, uh, the amount of money I make would be, would depend on how many calls I take in a day, right?
Anurag Beniwal: Which means, uh, if I'm spending more time in. Like recording my rational. Then like it's so making this money, I think the incentive structure has to change there or it has to be a more natural way of like capturing that feedback.
Demetrios: Yeah. You [00:31:00] almost have to give them. As much time
Anurag Beniwal: Yes.
Demetrios: Or money Yes.
Demetrios: Compensation for doing this. Yes. As if they were taking,
Anurag Beniwal: and there's also like a huge tension within the operations org in a company and the org that is building AI agents to make those operations more efficient. Because, uh, for operations org, if I want to like, use the expertise of these humans, uh, uh, it increases their handle time.
Anurag Beniwal: Increases their cost, at least in the short term.
Demetrios: Yeah.
Anurag Beniwal: Uh, while. Without this information, uh, I would not be able to make my agents better. So there is quite a back and forth I've seen across different places.
Demetrios: Yeah. There's a little bit of a catch 22.
Anurag Beniwal: Yes, yes.
Demetrios: You know what always fascinates me with voice is how much more willing I am to give extra context.
Demetrios: So when I Yes. Interact with, uh, someone
Anurag Beniwal: mm-hmm.
Demetrios: Through voice.
Anurag Beniwal: Mm-hmm.
Demetrios: I will say and explain much more Yes. Than if it's [00:32:00] with text.
Anurag Beniwal: Yes. I feel that is like, that is the real unlock in my opinion. Then like, uh, at a very high level, people saying like, oh, voice is the most natural way of communication. Of course it is.
Anurag Beniwal: But how does it help it help, it manifests in exactly what you said. Because it is a more natural way. I can speak more, I can provide more context. Right. I can correct you more. Right. While my patience level when I'm just chatting is like so much low, I will just type minimum amount of words. Exactly. And that is going to be ambiguous.
Anurag Beniwal: Um, the agent is going to get confused and like, but in voice, as long as you don't like frustrate the heck out of me, I am happy to provide more context to you. Yeah.
Demetrios: Yeah. And just naturally I would say that I provide it. Without being prompted extra. Mm-hmm.
Anurag Beniwal: Mm-hmm.
Demetrios: Because I want, it's, as I'm thinking through it Yes.
Demetrios: I'm also talking through it.
Anurag Beniwal: Yes, yes.
Demetrios: Normally when I will write with text, yeah. I am, I'm thinking [00:33:00] through something, but I'll think so much faster and then I'll type out the summary.
Anurag Beniwal: Yes, exactly. And same for like voice, other applications, right? This is just, um, sort of chat, uh, or, or real customer conversation, but.
Anurag Beniwal: The amount of time in, let's say customer support, how much time these associates spend in writing an email. Right? Uh, that is a significant part of the handle time. Sometimes as much as it they, they spend like helping you out. They have to like do some compliance stuff, add information, frame an email, right.
Anurag Beniwal: And I think that part has also been like, you know, um, um. Agents or LLMs have helped quite a bit, and now with voice, that becomes so much more easier, right? Like mm-hmm. If you have to frame the email, you just talk. Or if an LLM tree populates the email based on the summary of conversation, if you need to correct, you can just say it, right?
Anurag Beniwal: Yeah.
Demetrios: Yeah. And so the thing that I've noticed is that I am [00:34:00] starting to get really lazy. When it comes to typing. Yeah. And I think that's the wow moment that a lot of people have with whisper flow.
Anurag Beniwal: Yes.
Demetrios: Is when you can say, oh no, I didn't mean to say that. Yes,
Anurag Beniwal: yes.
Demetrios: Or when it automatically formats with bullet points.
Anurag Beniwal: Yes.
Demetrios: That's incredible. Because normally with a dictation mm-hmm. It's really shitty. And when you say, oh, I didn't mean to say that. It will write.
Anurag Beniwal: It'll write that. Exactly. And that is where I feel like. This sort of another area which I haven't personally sort of. Uh, explode. But like, uh, which, which can get important is like fusing the, the A SR and the LLM layer.
Anurag Beniwal: Yeah. Right. Why most people don't like that in production is because they want to have full control over their LLM. Right. But for applications like this, right, where. I don't want it to transcribe verbatim everything I'm saying.
Demetrios: Mm-hmm.
Anurag Beniwal: It needs to be intelligent enough to have the context that like, okay, whenever I was like, you know, um, back channeling or saying mm-hmm.
Anurag Beniwal: Or, uh, [00:35:00] uh, you know, stop for a moment or set something irrelevant in between. It shouldn't be transcribing. Right. Yeah. And so those models are getting, uh, you, you can literally like, you know, uh, fine tune a open source model, uh, that can take like, you know, multimodal input speech and text both, uh, to be able to like.
Anurag Beniwal: Basically fuse the first two layers, right? Mm-hmm. So I know there are companies and some customers doing that to, uh, manage latencies, but I think these personal use cases, right, where you don't want transcription to be like completely verbatim, you also need to understand the context of the situation.
Anurag Beniwal: Before you, let's say, send an email. Uh, so you can basically bake the a SR and the intelligence layer together.
Demetrios: Oh,
Anurag Beniwal: yeah.
Demetrios: I hadn't realized that, but it makes a lot of sense.
Anurag Beniwal: I think. Uh, there's this, uh, voice ai startup called Ultra Walks. They used to be called fixe.ai and ci. Oh,
Demetrios: yeah.
Anurag Beniwal: They, they follow this approach.
Demetrios: Really?
Anurag Beniwal: Yes, yes.
Demetrios: Oh, I didn't realize they're
Anurag Beniwal: the
Demetrios: first
one
Demetrios: from what I
know
Demetrios: that hard because I remember back in the day, they were trying to do like. [00:36:00] A Lang chain competitor.
Anurag Beniwal: Yeah. No, I think, uh, now they are like, uh, from what I know from my voice, last couple of things changed very fast these days. But they're a voice agents platform.
Demetrios: Oh.
Anurag Beniwal: But they're like general thesis was that like you can, you know, uh, somewhere in between the speech to speech and like complete cascaded is where we basically fuse two components. Right. Uh, so make it less cascaded. Right. Uhhuh. But again, right. There's a spectrum. Then you lose some control, right?
Demetrios: Yeah.
Demetrios: Yeah.
Anurag Beniwal: Yeah.
Demetrios: What other use cases do you see a lot of with the voice agents?
Anurag Beniwal: Uh, customer support, um, uh, inbound sales, um, outbound sales. There are companies that are building over us, uh, uh, for outbound sales as well. Mm-hmm. And inbound is much easier because see, if you look at the sales cycle, right, like, um.
Anurag Beniwal: There are different types of customers, like some are like strategic, like key accounts, right? Where maybe you want to deploy a best sales person,
Demetrios: right?
Anurag Beniwal: Yeah. But there's then there's a long tail, [00:37:00] right? And, uh. Maybe eventually you might still want a human there, at least with the current capabilities. But the initial basic conversations, right?
Anurag Beniwal: Like if you see your LinkedIn and I see my LinkedIn, like there would be um, GDM folks from some startups reaching out, oh, we are launching this product. Are you interested? Happy to set up a call with our founder, or whatever, right? Like, I think that initial call or initial email, uh, that is a very common use case mm-hmm.
Anurag Beniwal: With voice agents or text agents today where the, the first part of the outbound sales cycle is like, um, can be handled by that, which is a long tail of customers.
Demetrios: Don't you think that would piss people off though? If. And they get on a call and then it's like, oh, this is just a fucking voice agent.
Anurag Beniwal: Get outta
Demetrios: here.
Anurag Beniwal: Yes. And that is why I feel like inbound is more common. Inbound is like when customer is reaching out to you that I want to use your product. Yeah. And I want to do some kind of a lead qualification that Okay. Is, uh, they atri that right customer, uh, um, uh,
Demetrios: instead of having them fill [00:38:00] out a form, you
Anurag Beniwal: have
Demetrios: to get them on a call and you
Anurag Beniwal: just walk them through.
Anurag Beniwal: So that is a very clear one. Outbound, I know a few startups, uh, which are specifically companies built for outbound sales agents. They're building over us and they seem to be doing quite well. Yeah. So, uh, I'm pretty sure there are, like, they have found a segment of customers who are just fine with this early calls being with voice agents, especially with how natural the voice has become.
Anurag Beniwal: Yeah. And to be honest, like it was never about, um. Talking to a, at least for me, talking to a human or to a, to an ai, I, I, I'm fine talking to an AI if I have enough trust that it's going to solve my problem, right? Mm-hmm. Like just, uh, um, speaking to a startup founder where, um, they're solving this for plumbing.
Anurag Beniwal: Right where, uh, and I've seen that problem myself and friends. You are, you're looking for a plumbing contractor today. You go on the website, um, you submit your information, you ask a code, and then couple of days later someone reviews that and gets back to you, [00:39:00] ask for you the best time to talk. And, but I want that like thing to be fixed today.
Anurag Beniwal: Right.
Demetrios: Especially plumbing. Can plumbing have some serious problems?
Anurag Beniwal: Exactly. Right. So I, I don't think that works. So, uh, uh, for plumbing, for example, it is like. So much easier now. And that's a great use case in my opinion, where like, if I can basically get a code immediately. Mm-hmm. Right. And, uh, it could basically find, and I think plumbing also has this problem of like, uh, seasonality.
Anurag Beniwal: So if there is like a layer above plumbing, which is. I many cases I don't, I'm not fixated on a contractor. I want a contractor who is charging me like reasonable and is available today. Right. So something that can basically make calls
Demetrios: Yeah.
Anurag Beniwal: To contractors on my behalf, uh, uh, submits that what I'm willing to pay for it and then like get me something like tonight or tomorrow.
Anurag Beniwal: Right.
Demetrios: Yeah. It reminds me of there's these folks who will run. Google ads?
Anurag Beniwal: Yes. [00:40:00]
Demetrios: For like landscaping businesses. Yes. But they don't have a landscaping business. What they have is when somebody clicks on there and says, I want to get a quote.
Anurag Beniwal: Yes, yes.
Demetrios: They then take that lead and they'll sell it to an actual landscaping business.
Anurag Beniwal: Yes. And then booking and reservation is like super uh uh, sort of common use case. Mm-hmm. If you're at a restaurant, and if I'm an authenticated user. If you get a call from me to book a reservation, uh, as long as you're sure that like I'm the person calling, do you care if I, if it's an AI agent on my behalf or it's me?
Anurag Beniwal: No.
Demetrios: No. So booking from
Anurag Beniwal: the customer side? Yes. Yes.
Demetrios: Yeah. Booking any appointment. Yes,
Anurag Beniwal: any appointment, which means in future, I do imagine to have like. Uh, once the whole authentication layer, uh, the payment part is, is taken care of, and I know, um, uh, uh, companies like Visa and others are building a layer for that, uh, for like payment authentication with agents where I do have ability to authorize [00:41:00] my agent.
Anurag Beniwal: To take X, y, Z actions on my behavior. It could be shopping, it could be, uh, making a reservation. And I think once, and I think it'll come like very soon, and then I can have my representation, right, that can make these calls, book these appointments. Because today what happens is like. Uh, I call, especially in asf, uh, reservation in restaurant, they don't pick up phones.
Anurag Beniwal: Mm-hmm. So I think this solves both sides of the problem, right? Yeah. The, the user side, but also, uh, the vendor side or the restaurant side, because there's one person who is also serving, but also taking calls. They'll not pick up the calls. Right. Uh, but if you have, it's, it's very simple, right? Like, I know how many people are reserved for today, how many sort of spots I have.
Anurag Beniwal: Um, um, and then, you know, I could, I could basically, um, book a reservation automatically, or worst case if that information is not there yet, right. I could still like ask the person to call you back, right? Mm-hmm. So that I don't have to call it back right when they're available. So [00:42:00] in both cases, I feel like this significant like time we are spending and we are like so unproductive doing this, that this should already be solved now.
Demetrios: Yeah, dude, this is great. What else do you wanna talk about? What are some other things? I know you mentioned you rattled off a ton at the beginning.
Anurag Beniwal: Mm-hmm.
Demetrios: And I can't remember everything 'cause you said so much.
Anurag Beniwal: I think the whole like cascaded versus uh, uh, like speech to speech debate, right? That is the most common question that comes from customers.
Anurag Beniwal: Also, customers who are probably like, uh, more abstracted out or at like leadership levels where Mm don't. Want to get into the details, but they clearly feel that like, okay, when they talk to like speech to speech APIs, it sounds cool. Why do we need to have this like complex arch orchestration? Yeah. Why can't one model do everything Right?
Anurag Beniwal: And um, um, yeah. So, uh, I was talking to, uh, uh, in, in an informal setting with some open AI researcher and, uh, I think they're still sort [00:43:00] of, um, working on finding how do you sort of keep the speech quality good. Uh, because the model has to be smaller to to, to to, to have the latency to be fast, right? Yeah.
Anurag Beniwal: It has to be fast. But then, then we already know, uh, based on scaling laws right now, right. A bigger model is better. Yeah. Right. So I don't know, like I think that tension would always exist, but more than that, even if these, uh, speech to speech models keep getting better, right? There's always, even with the latest, let's say, um, um, uh, release of Opus, right?
Anurag Beniwal: It'll not satisfy me a hundred percent of the time if I'm enterprise, which means like in some cases I need to like, have ability to fix it. Right? Uh, sometimes it could just be prompt or maybe adding, uh, like tools or maybe using a different model, right? Like we talked about constellation of model. Yeah.
Anurag Beniwal: So I think this whole like cascaded architecture gives me that flexibility. As long as we figure out, like we, we talked about that. There are different layers of users for [00:44:00] voice agents today. There are like non-technical business owners who want to automate their like, concierge experience, right? Like, uh, taking bookings and stuff like that.
Anurag Beniwal: Um, what is the interface for them? They want us to do all the heavy lifting.
Demetrios: Yeah.
Anurag Beniwal: No customization needed, right? Yeah. Then there's a layer be below, which is like, uh, customers who are small enough but have few engineers, right? And they want a little bit more of control, right. Uh, there, it gets like, you know, we do offer some controls, but.
Anurag Beniwal: I think it's still up for debate. What is the right amount of control, right? Because if it is like all the control, then there are just too many knobs to change. And then if you don't do it right, then your experience is bad and then you can churn. Right? It's
Demetrios: witchcraft.
Anurag Beniwal: It's a witchcraft after that, right?
Anurag Beniwal: Yeah. So I feel like the number of knob, the, the biggest challenge of cascade is not that you can't make it work like a real, uh, time speech to speech in terms of voice quality, but with better intelligence, it's more about. The number of knobs and the effort it [00:45:00] takes to do that. And that is why I feel like we see that more and more where customers want to control the tech stack and like offload the voice orchestration to us.
Anurag Beniwal: Ah, yeah. Uh, uh, that is getting very common. And I think there's a spectrum between pure dumb cascaded, which is just chaining them and the speech to speech. Uh, and there are different patterns like we spoke about. You can fuse the, the A SR. Plus, uh, uh, your LLM layer. You can fuse the a SR plus turn taking layer, right?
Demetrios: Mm-hmm.
Anurag Beniwal: Uh, they can be the same models.
Demetrios: Oh yeah.
Anurag Beniwal: They don't need to be different, right? Then, uh, um mm-hmm. You can, uh, have the TT s right? We, uh, a couple of weeks back, we release this expressive mode TTS models. Uh, so, so far, uh, TTS models were already so good, but like in, they were not. Trained are fine tuned for long conversations and long conversations.
Anurag Beniwal: I mean like five to 10 turns, which is pretty standard, uh, in customer support [00:46:00] or other areas. So, uh, these models were just trained on, okay, this is a text, right, with these emotional cues, right? Like laughing or whisper, generate a, a speech based on this, right? Yeah. But like real world application of these models, if you look in a more enterprise setting, which is not on the creative side of things, is like.
Anurag Beniwal: I wanna make sure that when I generate a speech, it has context of my. Conversation so far, right? Mm-hmm. Like, uh, all of a sudden it should not start laughing when the conversation was going quite serious, right? Yeah. And so far in this pipeline, all that offloading was being done on the LLM, right? Where LLM is responsible to basically, um, um, generate a sentence, uh, in a way that it is, it is, it would not laugh, right?
Anurag Beniwal: Either use the right emotion tags with that. Uh, or, or keep the language that way, that the TTS model knows that this is not a happy moment. Right. But now with this con uh, con [00:47:00] conversational, TTS is nothing but the, just you train it in a different way where you pass it, the final utterance that it needs to do, but also pass historical context of the conversation.
Anurag Beniwal: Mm-hmm. And that's super important in, in, um, domains like, uh, customer support, like booking sales, for example. Right? Like, uh, so. With some of these design patterns where these TTS models getting more conversational, like, you know, taking full context of the conversation rather than that immediate turn. Um, a SR models and LLM, you can sort of fuse them together.
Anurag Beniwal: Um, having like this, uh, foreground and background approach of model that you want to use, like. Uh, expensive model, the best model, but also want to keep it natural. So have this like small model that is called, you can call it a masking model, um, that is not as dumb to just say that. Like, say that, give me a moment every time.
Anurag Beniwal: It's smarter than that, but less [00:48:00] smarter than the big model. And so I feel like there are enough design patterns in between, right. Which could basically make it sound as good as speech to speech, um, uh, but with so much more flexibility that when an enterprise customer says that, okay, why I am seeing hallucination at X, Y, Z, or, you know, my, uh, uh, there's an outage on like, you know, Google, Gemini, and I want to like now change the model.
Anurag Beniwal: Now how do you do that with like, speech to speech? Like, what if I'm using a speech to speech API from Gemini? Or, uh, open AI and there's an outage, what do I do there?
Demetrios: You're screwed.
Anurag Beniwal: I'm screwed. Right. Uh, so I feel like for, and that's my personal opinion, that like, for some time now, right, this, this somewhere in between is the right approach, right?
Anurag Beniwal: Yeah. Not the old cascaded way, right? Just chaining them. It doesn't work. You will have so many edge cases that. It would end up sounding as dumb as the old chat bots, despite improvements in a SR and TTS. Uh, [00:49:00] so you can, based on your appetite, right? Like, are you okay? Do you need to, you have a small enough use case that you can use open source model.
Anurag Beniwal: You don't need to rely on, let's say, cloud and uh, uh. In that case, can you fuse the a SR and, and the LLM and or maybe the foreground model, the smaller model you can fuse with a SR, the bigger model can be separate and that can be accessed as a tool. Mm-hmm. Right? That is, that is one way of doing that.
Anurag Beniwal: Second, uh, uh, uh, how do you sort of, uh, uh, have a more sort of conversational TT tts? Um, you can potentially also combine the turn taking with this. So there are multiple permutation and I feel like the right. Developer platform and we are also sort of improving on that side would be at least for the core developer layer, giving these options, right?
Anurag Beniwal: Mm-hmm. So that they don't, making it easier to implement and experiments with these design patterns. I think, I feel like most platforms are like, are a long way to go there.
Demetrios: [00:50:00] Yeah.
Anurag Beniwal: Right. Yeah.
Demetrios: I haven't heard that from
Anurag Beniwal: anyone. Yeah. Yes, yes. So I feel because it, it is non-trivial, right? You, we talked about these ideas, uh, it's easier said than done, right?
Anurag Beniwal: Because I also do like, like implement them or like, you know, um, uh, in this like whole async event loop, like making that all these events tie up together Right. Exactly. At the same time. Right. It is, it is non-trivial. So I think there's the, the, the dev platform layer need to provide this flexibility. Um, uh, then probably once people experiment with these per permutation, then you know, they can choose, choose within the spectrum, not the cascaded and the, uh, speech to speech.
Anurag Beniwal: There are options in between that give them. Enough trade offs between, let's say, um, um, control versus quality, um, uh, reliability versus [00:51:00] quality.
