Building Conversational AI Agents with Voice
Senior product manager at Deepgram.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
Michelle, product manager at Deepgram's new speech-to-text Aura, shares with the community tips on how to create conversational AI agents with voice with good user experience.
Building Conversational AI Agents with Voice
AI in Production
Slides: https://docs.google.com/presentation/d/1SILexYPvRH_P41UU3XNhNMO94EL7Q9WsKnFrVtew2Xc/edit?usp=drive_link
Demetrios [00:00:05]: You are awesome. Don't forget that. You're doing great work. Keep it up.
Michelle Chan [00:00:12]: Hi, everyone. I'm Michelle, and I'm the product manager at Deepgram. And today I'll share more about building conversational AI agents with voice. So a little bit more about me. I'm Michelle and I'm a senior product manager at Deepgram, leading our new text to speech product aura. Deepgram has been a leader in the transcription space in the last few years, and we are launching a new product with text to speech. So these are some of the clients that deep Graham is working with, including Spotify, NASA, Twilio, et cetera. And the reason why we're building text to speech after our transcription product is that we think that with the rise of conversational AI agents in the future, there's a lot we can do with voice.
Michelle Chan [00:01:09]: And when we have both text to speech and speech to text, we can greatly improve the conversational experience as we move forward. So what is a conversational AI voice agent? You can imagine that on the right there is an LLM. You have a customer that dials through, let's say, the phone, and then it will have a transcription of the voice into text. Maybe process is sent to an LLM, and then it will generate some responses that will then turn into speech that's delivered back to the customer. So some typical use cases will be when you're calling on the phone for appointment booking, customer support, inquiry, outbound sales, interviews, et cetera. And oftentimes with a conversational AI agent, it buys you time to. For example, there are many use cases where a user may call to a company for customer support. There may be a triage phase where the user will first talk to an AI agent first to collect all the information before it's passed on to a human agent.
Michelle Chan [00:02:21]: So this can also evolve in the future. But there are a lot of use cases that meets a voice conversational AI. So what's a good conversational voice AI experience? After talking to many users and diving into the space, some of the things I'll share will be some key tips with building conversational AI agents. The first one is latency, which is very important. There is a couple research out there that basically evaluates how long, if a person is talking to, let's say, a robotic agent or another person. What's the maximum latency that the person can tolerate until it gets a little bit off or a little bit slow. And the research results show that ideally in a conversational two person conversation, a 300 milliseconds latency is the benchmark we can look into. This is still a little bit far off if you include an LLM, as many of us have tried out with LLM with a latency, but this is where humans will feel a difference and this is where we're aiming at.
Michelle Chan [00:03:48]: So some other tips that I'll share in this presentation, other than latency, is some of the things that we need to figure out throughout the entire journey. The first problem is, let's say the human talks on the phone. It is then transcribed. But how do we know when the human stops talking? Because that's also a very nuanced question. It can depend on, for example, the tone of the person, the context, the last sentence that the user is talking about. And so as the builders of conversational AI agents, you'll need to look into things like endpointing. This is what we would typically call when we are figuring out how to solve this problem. So Deepgram speech to text also offers an endpointing, and you can use that as part of your transcription to know whether the person stops talking.
Michelle Chan [00:04:54]: The third thing that is important in a conversational AI experience is voice quality. And in the research world we call this prosody, which includes the naturalness in speech, including rhythm, pitch, intonation and pauses. And some of the things to note when you're building conversational AI agents is that there are different types of voices out there with text to speech from different vendors. Some may be optimized for, for example, movie narration. Some may be optimized for reading a news article. Some may be optimized for an ad in a video. For Deepgram, we are looking into voice qualities that can make a conversation sound like a conversation. So it sounds as if you're talking to a person who's themselves talking about your daily life questions, typical conversations.
Michelle Chan [00:05:56]: And so we care a lot about this aspect of the voice quality as we design our product. When you are thinking through building your own voice AI agent, it may be helpful to figure out what is the type of voice that you're looking for. For example, what are the voice branding guidelines you're looking into? What are some reference voices that you find is helpful? What are the characteristics in the tone and the emotion in the accent in different areas? What's the demographic that you're looking for that will be very helpful as you nail down a voice that you would like to use in your conversational AI. So the fourth thing as we are branching off from naturalness is silhouettes. So a little bit different from, let's say, a text to speech. Reading from a news article is that in human conversations, oftentimes, what makes human conversations natural is things like I'm also having a lot of arms right now, breaths like thinking time and all of that in appropriate locations. So what we've seen from our users playing around with conversational AI agents is that they would ask the LLM to speak as if they are speaking a conversation. You can also add pauses.
Michelle Chan [00:07:22]: So for example, three dots like dot, dot, dot, and you can see some vendors having breaths, having pauses and all of that. And you can also try figuring out filler woods as well. So this is also another aspect working with speech to text with llms, with the requirements that I mentioned just now with the last few slides, some of the tips are with latency is that as we all know, llms are autoregressive, outputting word by word. And so how should we think about chunking these words and sentences together and send this to an LLM? Some vendors may offer real time input streaming. You can also figure out yourself, depending on your conversational data, on some logic that you'd like to work on to chunk them together to optimize for the latency and also the quality of the voice. Because if you're just sending out per word, there are cases that maybe the output could be more chopped up as well. It also depends on the model. The second is with quality, and so outputs by default don't sound like a conversation.
Michelle Chan [00:08:42]: And so I mentioned in the past on some things that I've seen users working on is playing around with the prompt and output and pauses and everything so that it sounds more natural when you're using text to speech. So lastly, just going to mention that we are launching a new text to speech product called Aura. You can check out this aura preview announcement. I can share the link with the team to share it out that we're targeting low latency under 250 milliseconds conversational voice and a comparable cost as well. So if you're interested in a preview, feel free to email me. But yeah, thank you.
Adam Becker [00:09:27]: Nice. Michelle, I hadn't quite thought about just how complex, how difficult a task it is to do mean and to do it mean. It sounds like there's just a vast jungle of complexity there. So thank you for walking us through at least some of it. There's a couple of questions here from the audience. I'll just go through them very quickly. To know when a human stops talking, can't you just use like a problem? Isn't it a probabilistic task for an LLM, isn't that sufficient to figure that out? What makes that task difficult?
Michelle Chan [00:10:03]: So I think it's to do with an appropriate, like, let's say a user is talking. And it also isn't just an LLM task. It depends on the user's tone. So let's say you're detecting if a human saying and then thinking, or it's a little bit like it wants to continue. It's not only a text problem. Yeah.
Adam Becker [00:10:32]: One more thing here. How about slang? Is there a way to incorporate slang into this? Do you have different. How do you teach the LLM or the agent here to be speaking at a more colloquial language? On one hand, you already said that the text that, let's say like traditional llms are producing, it's not like spoken word, right? So you have to chunk it. You have to ask it to be much more specific in how it structures its output. Are there other things you can do to just change the actual words that are used?
Michelle Chan [00:11:10]: Yes, you can change the words that are used, but I think the technology is still developing on. Okay, based on an LLM output, how can you actually turn this conversational? There are prompts that you can do, but it also can start from for slangs. It sounds real when it's in the accent together with a slang. And so there's a lot that also from the data set side of things that you can work on as well.
Adam Becker [00:11:39]: Awesome, Michelle, thank you very much. Drop us the link, stay in the chat, please, and keep up the good work. Thanks for coming.