SlimFast: Unleashing Speech Intelligence through Domain-Specific Language Models (DSLMs)
Andrew is a computational scientist and theoretician turned Deep Learning researcher. He earned his PhD in Mechanical Engineering from MIT where he developed scalable algorithms to simulate fragmentation on giant supercomputers. After graduating, he worked for an energy company designing shaped charges and other explosive devices using AI. He now leads the Research organization at Deepgram building products at the frontiers of speech intelligence and generative AI.
Despite their advanced capabilities, Language Model Models (LLMs) are often too slow and resource-intensive for use at scale in voice applications, particularly for large-scale audio or low-latency real-time processing. SlimFast addresses this challenge by introducing Domain Specific Language Models (DSLMs) that are distilled from LLMs on specific data domains and tasks. SlimFast provides a practical solution for real-world applications, offering blazingly fast and resource-conscious models while maintaining high performance on speech intelligence tasks. We demo a new ASR-DSLM pipeline that we recently built, which performs summarization on call center audio.
Link to slides
We have about uh we have only 10 minutes here. So um I'm gonna try to hit the high points. Started, we have about uh we have only 10 minutes here. So um I'm gonna try to hit the high points. Uh So this talk today is entitled Slim Fast Unleashing Speech Intelligence through domain specific language models. My name is Andrew uh V P of Research at Deep Gram. Uh So this talk today is entitled Slim Fast Unleashing Speech Intelligence through domain specific language models. My name is Andrew uh V P of Research at Deep Gram. I would be remiss if I don't first tell you a little bit about Deep Gram where uh speech to text startup or midst stage, founded in 2015 where a series B company, 85 million uh total dollars raised so far so far. Um So we have uh I would be remiss if I don't first tell you a little bit about Deep Gram where uh speech to text startup or midst stage, founded in 2015 where a series B company, 85 million uh total dollars raised so far so far. Um So we have uh what, what we believe to be the best, you know, most accurate and fastest speech to text API on the market. Um So far in Deep Graham's existence, we've processed over one trillion uh minutes of uh of audio. We have a lot of customers who we have delighted. what, what we believe to be the best, you know, most accurate and fastest speech to text API on the market. Um So far in Deep Graham's existence, we've processed over one trillion uh minutes of uh of audio. We have a lot of customers who we have delighted. And uh so I'm gonna tell you about some things that are, that we're working on and we're interested in today. And uh so I'm gonna tell you about some things that are, that we're working on and we're interested in today. So some of the things that, that motivate us uh Here's some of our fundamental beliefs. The first one is that uh language is the universal interface to a I. So we believe that language is the primary way that uh we interact and carries the most amount of information and will be the universal interface that will unlock the full potential of A I. And that is starting to happen now. Um So businesses are gonna be able to realize the power of language A I. So some of the things that, that motivate us uh Here's some of our fundamental beliefs. The first one is that uh language is the universal interface to a I. So we believe that language is the primary way that uh we interact and carries the most amount of information and will be the universal interface that will unlock the full potential of A I. And that is starting to happen now. Um So businesses are gonna be able to realize the power of language A I. But we think that businesses need adapted A I to make it useful. But we think that businesses need adapted A I to make it useful. So if I'm gonna start off here by first making some predictions related to language A I. Over the next two years, many businesses will start to derive tremendous value from language A I products. Um And in the short term, the most impactful products that people are going to build are gonna combine existing technologies in a multimodal pipeline. So if I'm gonna start off here by first making some predictions related to language A I. Over the next two years, many businesses will start to derive tremendous value from language A I products. Um And in the short term, the most impactful products that people are going to build are gonna combine existing technologies in a multimodal pipeline. So what does that mean? Well, this pipeline is gonna have three stages in it. There's gonna be a perception layer which is gonna be fundamentally A S R something that takes audio and turns it into text. So speech recognition system probably complemented by a Dior system So what does that mean? Well, this pipeline is gonna have three stages in it. There's gonna be a perception layer which is gonna be fundamentally A S R something that takes audio and turns it into text. So speech recognition system probably complemented by a Dior system that predicts from that audio who is speaking when and allows you to format that transcript in a nice way to separate out the speaker turns there's gonna be an understanding layer that will take the output of that transcript and apply probably a large language model or perhaps a medium size language model or a small language model, some kind of language model that predicts from that audio who is speaking when and allows you to format that transcript in a nice way to separate out the speaker turns there's gonna be an understanding layer that will take the output of that transcript and apply probably a large language model or perhaps a medium size language model or a small language model, some kind of language model that will understand what's said in that transcript and will produce some kind of distilled useful output like a summarization of the transcript or a detection of the topics or the sentiment. Um Finally, you'll have an interaction layer um which will take say the output of an L L M um which is generating say a response to the uh to the audio input. that will understand what's said in that transcript and will produce some kind of distilled useful output like a summarization of the transcript or a detection of the topics or the sentiment. Um Finally, you'll have an interaction layer um which will take say the output of an L L M um which is generating say a response to the uh to the audio input. And then we'll make that uh we'll turn that into audio using say text to speech, right? So we have this language A I model pipeline. Um So we think that businesses are gonna derive maximum benefit from language A I products that are cost effective, reliable and accurate. Those three things together you need all three. And then we'll make that uh we'll turn that into audio using say text to speech, right? So we have this language A I model pipeline. Um So we think that businesses are gonna derive maximum benefit from language A I products that are cost effective, reliable and accurate. Those three things together you need all three. Um And this, the purpose of this talk is to argue that this is gonna require the use of uh small fast domain specific language models as opposed to large foundational language models, right? So that's my thesis. Um And this, the purpose of this talk is to argue that this is gonna require the use of uh small fast domain specific language models as opposed to large foundational language models, right? So that's my thesis. Um We're gonna argue this point in the context of a specific application, call centers. So what is a call center? Well, it's a centralized officer facility used by companies to handle very large volumes of incom incoming and outgoing telephone calls. Um Call centers are staffed with agents who are trained specifically to handle the customer inquiries that will be coming in, Um We're gonna argue this point in the context of a specific application, call centers. So what is a call center? Well, it's a centralized officer facility used by companies to handle very large volumes of incom incoming and outgoing telephone calls. Um Call centers are staffed with agents who are trained specifically to handle the customer inquiries that will be coming in, the complaints that might be coming in and they're trained to provide technical support or sales or perform sales related tasks. And their training will be specific to the business that this call center is supporting. the complaints that might be coming in and they're trained to provide technical support or sales or perform sales related tasks. And their training will be specific to the business that this call center is supporting. So there's a number of uh A I products that are going to be built for the call center. Some of them are already being built and uh you know, in their initial stages, um these will be products that will help both the customer experience on the left to the employee experience like the agent on the right. A couple of important ones would be voice spots or real time agent assistant So there's a number of uh A I products that are going to be built for the call center. Some of them are already being built and uh you know, in their initial stages, um these will be products that will help both the customer experience on the left to the employee experience like the agent on the right. A couple of important ones would be voice spots or real time agent assistant systems. Um Other importance would be important ones would be audio intelligence features. So basically taking the output of many phone calls and summarizing them or predicting the topics for them or predicting the sentiment of the customer in those calls. Um So you might say, why not use a prompted foundational language model, large language model to build language A I products for a call center. systems. Um Other importance would be important ones would be audio intelligence features. So basically taking the output of many phone calls and summarizing them or predicting the topics for them or predicting the sentiment of the customer in those calls. Um So you might say, why not use a prompted foundational language model, large language model to build language A I products for a call center. Um The first reason I would argue against this is the scale of large language models. They're ridiculous 100 billion plus parameters. If you measure it, in terms of how much resources it would require to launch this model with uh with a 5000 GP US, it would require dozens of them. They're very small or they're very slow. Uh infer on the order of, you know, 100 milliseconds per token. Um The first reason I would argue against this is the scale of large language models. They're ridiculous 100 billion plus parameters. If you measure it, in terms of how much resources it would require to launch this model with uh with a 5000 GP US, it would require dozens of them. They're very small or they're very slow. Uh infer on the order of, you know, 100 milliseconds per token. So, you know, long, generating a long response might take many seconds. So just based on the scale and the inefficiency alone, you could argue against it. Um looking a little, a little bit more in detail at large language models and what they can do, they have broad general knowledge. Um and they can be aligned to do many tasks without explicit training, they do many, many things. So, you know, long, generating a long response might take many seconds. So just based on the scale and the inefficiency alone, you could argue against it. Um looking a little, a little bit more in detail at large language models and what they can do, they have broad general knowledge. Um and they can be aligned to do many tasks without explicit training, they do many, many things. However, conversational text generated by a call center has a high degree of specificity. It covers narrowly distributed topics. Uh the people speaking have unique speech patterns associated with where they are in the world. And there's gonna be a long tale of rare words that the uh foundational language model probably has never seen before However, conversational text generated by a call center has a high degree of specificity. It covers narrowly distributed topics. Uh the people speaking have unique speech patterns associated with where they are in the world. And there's gonna be a long tale of rare words that the uh foundational language model probably has never seen before to get an idea of the task specificity that's involved. We could look at say the top 15 tasks that a call center agent would perform. They're very highly specific and yet complex tasks that require language. Understanding to get an idea of the task specificity that's involved. We could look at say the top 15 tasks that a call center agent would perform. They're very highly specific and yet complex tasks that require language. Understanding another uh point we can make is that domain specific conversational text is generally out of distribution for foundational L L MS. We can get an idea of this by asking chat GP T to continue a call center conversation. So if we give it a uh a brief prompt, right? With uh an intro by an agent, and then we have a customer who's starting to speak and we look at the transcript that gets generated. We see that another uh point we can make is that domain specific conversational text is generally out of distribution for foundational L L MS. We can get an idea of this by asking chat GP T to continue a call center conversation. So if we give it a uh a brief prompt, right? With uh an intro by an agent, and then we have a customer who's starting to speak and we look at the transcript that gets generated. We see that uh it's very unrealistic. The speech is way too clean as if it were written. And that output tends to follow a predictable script of a greeting customer describes an issue. Agent takes an action. Customer accepts the action and the call ends. Very unrealistic. If we look at real examples, we see that they feature things like cross talk, people trying to talk over each other, makes transcripts very difficult to read and interpret uh it's very unrealistic. The speech is way too clean as if it were written. And that output tends to follow a predictable script of a greeting customer describes an issue. Agent takes an action. Customer accepts the action and the call ends. Very unrealistic. If we look at real examples, we see that they feature things like cross talk, people trying to talk over each other, makes transcripts very difficult to read and interpret it. Uh Real transcripts have things like disfluencies um where people are basically stuttering and stumbling while they're speaking, using filler words that also makes uh transcripts hard to read. it. Uh Real transcripts have things like disfluencies um where people are basically stuttering and stumbling while they're speaking, using filler words that also makes uh transcripts hard to read. So we argue that we should shoot for domain adapted language models for call centers and that these models can be small. Um To prove this point, we took a, a small language model, 500 million parameters uh that was trained on general internet data. And then we transfer learned it on uh an in domain data set of call center transcripts. We found that the model improved dramatically in all metrics. Uh next token prediction loss and perplexity. So we argue that we should shoot for domain adapted language models for call centers and that these models can be small. Um To prove this point, we took a, a small language model, 500 million parameters uh that was trained on general internet data. And then we transfer learned it on uh an in domain data set of call center transcripts. We found that the model improved dramatically in all metrics. Uh next token prediction loss and perplexity. And when we use it, we find that we can continue a call center conversa uh a prompted call center conversation and it generates very realistic text. And when we use it, we find that we can continue a call center conversa uh a prompted call center conversation and it generates very realistic text. And then we have gone further and used that to train a summarization model which I'll demo to you now And then we have gone further and used that to train a summarization model which I'll demo to you now in this Jupiter notebook. Um So we're gonna make three calls here. We're going to first make a call that uh in this Jupiter notebook. Um So we're gonna make three calls here. We're going to first make a call that uh so we're gonna call a function that hits uh the deep gram API. Um Tran transcribes an audio, a phone call, audio uh diaries, the transcript and then sends the transcript to, to our S L M to perform summarization. You see that, that call took 8.4 seconds. We're gonna print the transcript that we got back so we're gonna call a function that hits uh the deep gram API. Um Tran transcribes an audio, a phone call, audio uh diaries, the transcript and then sends the transcript to, to our S L M to perform summarization. You see that, that call took 8.4 seconds. We're gonna print the transcript that we got back from A S R and diar. We see that it's highly accurate and we're going to print the summary that we got back and we see a very nice human readable summary that uh is actually quite accurate describing uh what happened in this call. from A S R and diar. We see that it's highly accurate and we're going to print the summary that we got back and we see a very nice human readable summary that uh is actually quite accurate describing uh what happened in this call. And that's what I have for you folks today. And that's what I have for you folks today. Wonderful. Thank you so so much for taking the time to chat with us. It was great. OK. Thank you. Yeah, of course. Wonderful. Thank you so so much for taking the time to chat with us. It was great. OK. Thank you.