Sign in or Join the community to continue

Voice model performance optimization // Madison Kanna // Agents in Production 2025

Posted Aug 04, 2025 | Views 75

# Agents in Production

# Voice Agents

# Baseten.co

Share

speaker

Madison Kanna

Growth Engineer @ Baseten

Madison Kanna is a developer, educator, and content creator working on LLM inference at Baseten. When she’s not writing code, she’s tweeting too much, experimenting with AI, and exploring San Francisco.

+ Read More

SUMMARY

Whether you're transcribing a conversation or vocalizing an agent response, STT and TTS models have to run fast. But these modalities introduce new challenges in both runtime performance and network overhead. By optimizing open source models, you can achieve consistent low latencies for both transcription and speech synthesis. In this talk, we'll cover key optimization strategies for Whisper and Orpheus with a focus on real-time workloads, plus a couple of common mistakes to avoid to reduce network overhead.

+ Read More

TRANSCRIPT

Adam Becker [00:00:00]: [Music]

Madison Kanna [00:00:08]: Thanks so much for having us. We wanted to introduce ourselves first. So I'm Madison. I work at Baseten. I just moved to San Francisco five, six months ago. More About Me I think I'm the last person in the world using VIM still as my text editor, and definitely the last, the only woman possibly using it as well.

Kaushik Chatterjee [00:00:28]: Hey everyone, I'm Kaushik. I'm one of the FOIA Deploy engineers at Baseten who specializes on the TTS and voice model stuff. Super excited to be here and give this. Talk about taking your voice models to the next level.

Madison Kanna [00:00:44]: Awesome. We're talking about voice model performance optimizations. Before we do that, we want to dive into two important metric categories for measuring the performance of these voice models. And the first one would be latency. So the speed of conversation essentially. So obviously in voice AI latency isn't just like this one number, and it can be the difference between a really natural sounding conversation with voice models and an awkward one. So these are just a couple of the things that we need to measure time to first token. So how quickly does your system start processing after the user stops speaking? We also have end to end response time.

Madison Kanna [00:01:23]: So this measure is really the complete journey from the moment a user submits input to when they see a response back. And then we also have token generation speed for LLMs, basically measuring tokens per second. This measures how quickly your language model produces subsequent tokens after the first one. Then we also have a second really important metric category for measuring the performance of voice models, which is throughput. So throughput is telling you how many concurrent conversations your system can handle. And there's a couple of key measurements we really want to focus on here. So one of them is concurrent sessions per gpu. This measures how many simultaneous voice conversations a single GPU can handle effectively.

Madison Kanna [00:02:13]: We also have requests per second. This can measure how many individual API calls or processing requests your system handles per second. And then we also have queue depth and wait times. So this is measuring how many requests are waiting to be processed, while wait times measure how long they wait. And throughout all of this, when it comes to optimizing voice models, we have to keep in mind throughput and latency are in constant tension. These are some trade offs you have to keep in mind where increasing batch size improves throughput but can hurt individual request latency. So, so these two things are really intention and you really have to be intentional about the conflict between them. And I guess, Kaushik, do you want to take it away?

Kaushik Chatterjee [00:02:59]: Yeah, of course. Yeah. Thank you Madison I think that was a great overview of the two main categories of metrics. We use both the baseline as well as just like industry wide to measure the performance. Now I'll kind of dive into the specific optimizations you can do which really help in scaling optimals and kind of productionizing them to fit your graphic use cases. I think just like one of the most common and one of the easiest honestly is connection pooling. It's something we've seen a lot, especially as an FTE is that through connection pooling it can tick you and see if above 500 milliseconds for the TTS model down to sub 300 sub 250. Essentially connection pooling is that instead of having to send multiple requests as you might with Python's request library, use Asyncio AI HTTP and you open up a session.

Kaushik Chatterjee [00:03:54]: The logic behind this is that opening and closing connections is expensive as your traffic scales up. This is the handshake as well as the cleaning up. It really adds up, especially as you scale up and handle more and more traffic. Connection pooling removes this overhead associated with these operations and makes it a lot easier to handle concurrency and handle a lot of requests at once. Moving on to the model specific stuff, we highly recommend you use an engine for this. TenSort LLM is our engine of choice. It's from Nvidia. You can think of it as a framework that lets you optimize transformer based models.

Kaushik Chatterjee [00:04:32]: We'll be talking about whisper from OpenAI as well as Orpheus from Canopy Labs. These are both transformer based models. Luckily this means you can make the most of them using TOT LLM. It gives you stuff like in flight, batching, quantization, speculative decoding and a bunch more things to really get the most of your models. Moving on to the infla heavy stuff now and once again this applies for any models you might have in your agentic pipeline. Kvcache Overloading kvcache enables for efficient influence. It's a massive driver in reducing the latency and the TTFP for language models. Now we bring it over to TTS and SCT models.

Kaushik Chatterjee [00:05:11]: So KB cache overloading as the name suggests, it allows the question where the caches to really minimize the latency as you scale up and add more and more replicas that even though you might have a bunch of different pods which have these, you'll always get the lowest latency with this cache overloading. Finally, weight caching cold starts are often overlooked. This is a time taken to go from no replicas. Your mild sleeping to waking it up. Getting to the first replica by bundling the weights into the container itself instead of having to download them separately from hugging face, you can cache these weights and it can really save a bunch of time when you prevent this download from happening separately. And it's just a great way to make sure that the experience when you get started is as good as whenever your models are already learning. Moving on to the specific optimizations now I'll talk a bit about Whisper, which is OpenAI's speech to text model and the specific things Base 10 has done to optimize these. If you could go to the next slide Madison yeah, thank you.

Kaushik Chatterjee [00:06:14]: One thing we did, we implemented a websocket vision. You can think of websockets as an alternative to using less API with connection pooling websockets the biggest advantage is that you get partial transcriptions if you're talking and you want live captions as it goes along. WebSockets is a fantastic way to achieve this while maintaining the low latency we've seen sub 50 milliseconds for each audio chunk which is generated. Something we did was move from a Python TAT LLM to a C LLM framework. This enables actual multi thread capabilities. Unfortunately because of the global encryption lock the GIL with Python you can actually achieve true multi threading by shifting to C back framework we notice about an 18% increase in just speed. Then finally what we do with WISP is dynamic batching. Instead of having to wait for a bunch of requests to come into a batch and send it out, we do this in flight.

Kaushik Chatterjee [00:07:15]: We continually send out the quests and we can batch them as the in flight to really minimize any latency and improve the time to phase by Whisper. Then Orpheus, which is a fairly new TTS model which has changed the game. It's a fantastic model. One thing we do is that we twitch compile all models. I recommend this even outside the domain of just voice agents. Any language models, please twitch compile them. This optimizes the model for the specific hood we're using for the kernel you're using on the gpu. IFS has two models.

Kaushik Chatterjee [00:07:48]: It has a LLAMA model and a SNAC codec model. We choose compile both of these to get the best performance for them on the hood where we're running them on. We quantize Orpheus as well as its Kvcache to FP8. This really saves a lot of memory on the GPU if you can quantize it. This allows for more concurrency actually because you have more memory to run more requests in parallel. If you want to improve throughput, you want to improve currency while reducing any potential queue times. Quantization is a very easy way to achieve those goals. Then finally, this is a funnel.

Kaushik Chatterjee [00:08:22]: Quick Ifeus, the base model. If you guys do use it, there's about 600 milliseconds of silence at the beginning. This is just a byproduct of how the model was trained. You can just cut this out and you reduce the time to first audio byte, which is not initially equal to the time to face by it. Just by doing this, you can get better antenna latency and you just get a smoother experience. So once again, those are kind of the two main models we've seen being used. Very popular, really performing. And these are just the ways we recommend you get the most out of them.

Kaushik Chatterjee [00:09:00]: Yeah, and I think that wraps up a talk. We kind of discussed the metrics which Madison did a fantastic job going over, and these are the easiest ways to actually measure your performance. Like an apples to apples comparison for your models on one provider with another. And then the optimizations we do and what we recommend you all to do, if you plan to kind of scale up voice models to make sure that A, you can productionize them effectively, that they can handle the traffic which you are seeing in a production level, and B, just make sure that you can make the most out of these models. And there's some really easy things you can do just to improve the performance as we've discussed. So, yeah, once again, thank you so much.

Madison Kanna [00:09:36]: Thank you.

Adam Becker [00:09:38]: Kaushik and Madison, thank you very much. Can I ask you to flip to a couple of slides back just in case anybody wants to take a screenshot? I think not this one. Maybe one before. One before.

Madison Kanna [00:09:54]: Probably this one that Kaushik made. Getting the most.

Adam Becker [00:09:57]: Yeah, that is the power slide right there. Okay. How difficult is it to build these systems? Connection pooling, the KVCache aware routing, should I have in mind? Yeah. And are there good solutions out there that can get people started?

Kaushik Chatterjee [00:10:17]: I think a good philosophy is kind of just like Occam's lasing a lot of things. It's like the simplest solution. Just like simple fixes can go a long way. Connection pooling is just something we've seen a lot of people overlook. They're sending one request at a time. It's like, man, why is my latency so high? Why is my TTFB so high? It's because you're not sending these concurrent requests at once. They're waiting one after the other connection Pooling just makes less APIs a lot faster because you kind of mitigate the overhead. So making multiple handshakes and establishing multiple connections, it's like, hey, I've already made 10, 20 connections.

Kaushik Chatterjee [00:10:51]: The handshake is established, I don't have to set it up again. Let me just keep reusing these connections I've established. And we've noticed that in production this can lead to immense speed ups, connection pooling with less APIs or using WebSockets. These are the two kind of like client side improvements you can make to really get benefits. KV cache overloading, of course this gets into the infrastructure. This is not an easy thing to set up. This isn't something like, I think especially if you aren't using a provider like Base 10 or the Fireworks are together, which gives you these capabilities with the engineers. If you're using AWS or Azure, it's a lot tougher.

Kaushik Chatterjee [00:11:27]: You don't have the control, you would want to have this, but at least on base 10. And we've set up KV cache overloading so that whenever you make a request, if it contains the same prefixes as previous requests, we send you to the pod that those listed so that it has a KB cache hit. Because once again, even with kvcache overloading, one thing we do is that when we spin up a new replica, we copy over the cache to improve hit rate. But this pre filling of the cache does take some time. With KB cache overloading, even when you do scale up to handle moist traffic, you never have a latency hit.

Adam Becker [00:12:03]: Yeah. Can you go on to the next couple of slides?

Madison Kanna [00:12:08]: Yeah, of course.

Adam Becker [00:12:12]: Yeah. I imagine a lot of people might be starting with just like Whisper. They might do this on a demo, kind of just like on a single machine. But then at some point we've already covered kind of like what is the gradual journey of just going from some of the out of the box voice APIs to now starting to do this yourself and then you're hitting all of these scaling challenges. And I believe that some people might already be there. They might just be like going through the struggles of how to handle that scale and are there good resources to figure out do you just learn this by trying everything or at what point is there, is there like a more almost like a more consolidated way of like learning how to scale these, these things?

Kaushik Chatterjee [00:13:05]: I think that's kind of a big, I, I think once again that's kind of the, the benefit of going to like an inference provider. Like once again, these 10 fireworks together, I think a common model as well. It's that we can handle these healing challenges on your behalf. I think otherwise like you're setting up an AWS, like you need the load balancer, you need the EC2, you need the SageMaker and you know, the bedlock and whatnot. I think it becomes a lot tougher to kind of manage and manage costs effectively. I think like there's oftentimes with a really easy solution is like hey, just like get a bigger GPU. Like hey, just get like, like 10 replicas. But obviously cost is a really factor and a really limiting factor, especially for startups and people just like starting out in the space, ramping up, hobbyists, whatnot.

Kaushik Chatterjee [00:13:48]: Right. So I think even with him, like with the goods to that, I think it's like good scaling principles which generalize really well because you know, what we do on the engineering team at base 10 is different somehow. AWS or Azure, GCP kind of handle it. So I think just like the best policy is to just like these just the components you need and kind of what goes into each of those. So with a solid balance and how to like effectively delegate. And I think the infrastructure side is one aspect of optimization. You have the model side which is probably the easiest to get into because you can use VLLM or sglang or TLT LLM and you can get model side optimizations which can make a big difference. I think it's a lot harder to manage infrastructure and networking side optimizations just because it requires a lot more niche and domain specific knowledge which it's a lot more difficult to acquire when it's not a product you have control over necessarily.

Adam Becker [00:14:49]: Awesome. What is a good way to keep in touch with you guys, to follow your work, to connect with you? If folks have more questions. Actually there is already one more question. But first, how do we stay in touch?

Kaushik Chatterjee [00:15:04]: I think. Yeah, go ahead Madison.

Madison Kanna [00:15:08]: Yeah, I'm Mostly on Twitter adisonCana, but we also have Base10CO and we have an amazing blog on Base10 where we talk about a lot of issues like this. But go ahead, Kaushik.

Kaushik Chatterjee [00:15:20]: Yeah, I can like, I'd be happy to like drop my email or LinkedIn in the comments if that's the best way.

Adam Becker [00:15:26]: Awesome. But also that knowing about the blog. That's excellent. Ravi. Before we let everybody go, Ravi's asking any consideration for multilingual voice.

Kaushik Chatterjee [00:15:36]: Ooh, for multilingual voice is this. I'm assuming this is on the TTS side. I think Kokoila is probably the best model for that I think fundamentally optimization wise, there's no difference because this happens at a at a model level. So like I as long as it's I actually don't remember the architecture for Kukula off the top of my head, but kind of just like if it is like a llama architecture, it's a language model based architecture you can use to tllm, you can use vllm, you can get these optimizations down the grid list. Ultimately, multilingual is just like a product of how the model is trained itself. So any model level optimizations I've discussed can translate over to that, assuming once again the architecture is something which is analogous to transformer based model.

Adam Becker [00:16:22]: Thank you very much, Madison, Kaushik, thanks for coming.

+ Read More

Sign in or Join the community

Comments (0)

Popular

Watch More

Exploring AI Agents: Voice, Visuals, and Versatility // Panel // Agents in Production

Posted Nov 15, 2024 | Views 1.3K

# AI agents landscape

# SLM

# Agents in Production

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.4K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

PyTorch's Combined Effort in Large Model Optimization

Posted Nov 26, 2024 | Views 1.3K

# PyTorch

# Torch Chat

# Meta Platforms