Sign in or Join the community to continue

Why Specialized NLP Models Might be the Secret to Easier LLM Deployment

Posted Apr 27, 2023 | Views 2.2K

# LLM

# LLM in Production

# LLM Deployments

# TitanML

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

Share

speaker

Meryem Arik

Co-founder/CEO @ TitanML

Meryem is a former physicist turned tech entrepreneur. She is the co-founder and CEO of TitanML, TitanML solves the core infrastructural problems of building and deploying Generative AI so customers build better enterprise applications and deploy them in their secure environments. Outside the world of AI, Meryem lends her energy and expertise to supporting diverse voices in the tech scene and mentoring female and minority group founders.

+ Read More

SUMMARY

One of the biggest challenges of getting LLMs in production is their sheer size and computational complexity. This talk explores how smaller specialised models can be used in most cases to produce equally good results while being significantly cheaper and easier to deploy.

+ Read More

TRANSCRIPT

Hello everyone. Thank you so much for coming. Um I'm me, I'm one of the co-founders of uh T N. And what we're building is a specialization platform um for N LP models. And today, what I want to talk about is how we can make N LP deployment much, much easier using specialization methods and compression methods. Um Hello everyone. Thank you so much for coming. Um I'm me, I'm one of the co-founders of uh T N. And what we're building is a specialization platform um for N LP models. And today, what I want to talk about is how we can make N LP deployment much, much easier using specialization methods and compression methods. So earlier, Diego gave us a really great explanation of what L L MS are. Um They're essentially really, really, really big neural networks that know a lot of things. Um And we can customize those um for the things that we, we actually care about. So earlier, Diego gave us a really great explanation of what L L MS are. Um They're essentially really, really, really big neural networks that know a lot of things. Um And we can customize those um for the things that we, we actually care about. And these are really driving the N LP development over the last kind of five years have all been down to these really fantastic foundation models like BT five GP T. And these are really driving the N LP development over the last kind of five years have all been down to these really fantastic foundation models like BT five GP T. And the benefits of these when building N LP applications are really, really clear, you need a very low data requirement because they're pretrained on almost every single piece of data on, on the internet. Um The state of the art accuracy and performance all comes from these foundation models and they're really easy to build um into version one applications like Diego. Uh told about earlier. And the benefits of these when building N LP applications are really, really clear, you need a very low data requirement because they're pretrained on almost every single piece of data on, on the internet. Um The state of the art accuracy and performance all comes from these foundation models and they're really easy to build um into version one applications like Diego. Uh told about earlier. However, these foundation models and L L MS come at a cost, they're really, really, really big um which makes them very difficult to deploy. Um You get slow infer and you need to run them on very expensive hardware. So you get very expensive cloud costs. However, these foundation models and L L MS come at a cost, they're really, really, really big um which makes them very difficult to deploy. Um You get slow infer and you need to run them on very expensive hardware. So you get very expensive cloud costs. So they're very expensive to deploy, they're slow. And all of this is really because of the L they're large, they're, these are really, really large models So they're very expensive to deploy, they're slow. And all of this is really because of the L they're large, they're, these are really, really large models and the way that these models work and the reason they have such good natural language understanding is because they've seen so much information and the way that these models work and the reason they have such good natural language understanding is because they've seen so much information and as a side effect, they have huge capabilities. So for example, chat GP T is able to do tasks as broad as you know, writing plays and um brainstorming strategy, right? They're able to do such a huge breadth of task. However, as a business, you're in an application, you actually only need it to do something very, very narrow. and as a side effect, they have huge capabilities. So for example, chat GP T is able to do tasks as broad as you know, writing plays and um brainstorming strategy, right? They're able to do such a huge breadth of task. However, as a business, you're in an application, you actually only need it to do something very, very narrow. Um And, but when you have it, do this very, very narrow thing, the way that you currently uh deploy it, probably you also pay for the, the effects of it being able to do other things like writing love letters or categorizing, uh you know, or reviewing contracts when you only need it to, to categorize um resumes. Um And, but when you have it, do this very, very narrow thing, the way that you currently uh deploy it, probably you also pay for the, the effects of it being able to do other things like writing love letters or categorizing, uh you know, or reviewing contracts when you only need it to, to categorize um resumes. So what we do with specialization and, and this idea of specialization, which I don't think is particularly well understood is we take large language models like your or your GP T S and we take the parts of that model that are relevant for just your task. So if this uh blue thing is my whole foundation model So what we do with specialization and, and this idea of specialization, which I don't think is particularly well understood is we take large language models like your or your GP T S and we take the parts of that model that are relevant for just your task. So if this uh blue thing is my whole foundation model um in that there's only a really, really small amount of that that's actually relevant for the task. Um And this is how you're able to build a much, much smaller model, which is equally as accurate, which is much, much easier to deploy. um in that there's only a really, really small amount of that that's actually relevant for the task. Um And this is how you're able to build a much, much smaller model, which is equally as accurate, which is much, much easier to deploy. Now, these models are just far better from AM L OPS point of view. Um It's much cheaper deployment, they're much easier um to get really, really fast inferences from. Now, these models are just far better from AM L OPS point of view. Um It's much cheaper deployment, they're much easier um to get really, really fast inferences from. Um but also you actually get very good quality models. So a lot of the state of the art benchmarks in very specific tasks are held by these specialized models. Um Not the really, really big um models that we see coming out of places like open A I. Um but also you actually get very good quality models. So a lot of the state of the art benchmarks in very specific tasks are held by these specialized models. Um Not the really, really big um models that we see coming out of places like open A I. So here's a uh a, a quick illustration of, of what this process might look like and, and how this works. So here we have a graph of um like your accuracy and your size trade off and this tradeoff is always going to exist. So here's a uh a, a quick illustration of, of what this process might look like and, and how this works. So here we have a graph of um like your accuracy and your size trade off and this tradeoff is always going to exist. Now, currently your options are what's available, open source. So you might have your original model, which might be a very big but maybe a one billion parameter bath and then you'll have open source checkpoints along the way. So I think this one we've, we've pointed out here is maybe a a base or it's still. But what we're able to do with specialization is from the original model, specialize it and make it much, much, much smaller. Um And give these, what we call uh specialized models along the top. Now, currently your options are what's available, open source. So you might have your original model, which might be a very big but maybe a one billion parameter bath and then you'll have open source checkpoints along the way. So I think this one we've, we've pointed out here is maybe a a base or it's still. But what we're able to do with specialization is from the original model, specialize it and make it much, much, much smaller. Um And give these, what we call uh specialized models along the top. And then you get here this Rito front of models which are much smaller, but don't have this huge accuracy decrease that we see when moving from the open source large models to the open source, more resource efficient models. Um And you're able to get a better accuracy, latency or size trade off than you would have been able to previously. And then you get here this Rito front of models which are much smaller, but don't have this huge accuracy decrease that we see when moving from the open source large models to the open source, more resource efficient models. Um And you're able to get a better accuracy, latency or size trade off than you would have been able to previously. So I'll uh since we don't have very much time, I'll, I'll whiz through and, and talk very quickly about the kinds of results that you can see with specialization. So on the left, um you can see the, the graph of latency and model size. So when we move from the larges all the way to the titan, the titan variants, which are much, much smaller. So on the order of like 100 X smaller. So I'll uh since we don't have very much time, I'll, I'll whiz through and, and talk very quickly about the kinds of results that you can see with specialization. So on the left, um you can see the, the graph of latency and model size. So when we move from the larges all the way to the titan, the titan variants, which are much, much smaller. So on the order of like 100 X smaller. Um But then we can compare that with the accuracy that we see from these models and they beat um base and distill that fine tuned on most of the natural language understanding um tasks. So you're able to get models which are, you know, between 10 and 100 X smaller um while actually sometimes improving the accuracy on on benchmarks, which is really impressive. And obviously, because they're smaller they're much, much, much easier to deploy. Um But then we can compare that with the accuracy that we see from these models and they beat um base and distill that fine tuned on most of the natural language understanding um tasks. So you're able to get models which are, you know, between 10 and 100 X smaller um while actually sometimes improving the accuracy on on benchmarks, which is really impressive. And obviously, because they're smaller they're much, much, much easier to deploy. So uh this process of specialization is very difficult. And the way that it's done currently is individual M L engineers will do one off tasks, uh a one off projects to specialize their models. So they might do a combination of pruning and quantization or graph compilation and neural architecture search. Um But the issue is this is a very expensive and long experimentation process which quite often fails. Um So what the Titan M platform So uh this process of specialization is very difficult. And the way that it's done currently is individual M L engineers will do one off tasks, uh a one off projects to specialize their models. So they might do a combination of pruning and quantization or graph compilation and neural architecture search. Um But the issue is this is a very expensive and long experimentation process which quite often fails. Um So what the Titan M platform does is it wraps um all of these techniques up into defined pipelines um where you can put in your model, your fine tuning data set. And the Tyson platform will automatically specialize it for you and create that model that is 10 to 100 times faster and cheaper and therefore much, much, much easier to deploy. does is it wraps um all of these techniques up into defined pipelines um where you can put in your model, your fine tuning data set. And the Tyson platform will automatically specialize it for you and create that model that is 10 to 100 times faster and cheaper and therefore much, much, much easier to deploy. So I'll finish off uh with a very quick case study of, of uh what we did with a early client of ours. So what they were building um was in a uh it was the original was an electro type model So I'll finish off uh with a very quick case study of, of uh what we did with a early client of ours. So what they were building um was in a uh it was the original was an electro type model and they were doing document classifications. Um And the problem that they were struggling with was they weren't able to get the latency that they required um on sensible hardware. The only way that they could get the latency was on using two A 100 which the inference is pretty silly. Um They'd already tried standard things like on run time and quantization. and they were doing document classifications. Um And the problem that they were struggling with was they weren't able to get the latency that they required um on sensible hardware. The only way that they could get the latency was on using two A 100 which the inference is pretty silly. Um They'd already tried standard things like on run time and quantization. Um But what we did is we, we took that same original model and put it through our, our specialization pipeline and compression pipeline and ended up with a model which had still very, very good accuracy but could be deployed on a single T four um with the same latency. And that's a bit of a game changer when it comes to the challenges and getting these models to production and getting them running at sensible latency and, and at sensible costs. Um But what we did is we, we took that same original model and put it through our, our specialization pipeline and compression pipeline and ended up with a model which had still very, very good accuracy but could be deployed on a single T four um with the same latency. And that's a bit of a game changer when it comes to the challenges and getting these models to production and getting them running at sensible latency and, and at sensible costs. So the uh the T L D R is large language models are very, very, very big. Um But for most use cases, they really don't need to be that big when you're dealing with one specific use case. Um And turning those large language models into much smaller specialized models, um means you have much a much easier time uh getting to deployment and getting the latency and memory constraints that you might have as a business. So the uh the T L D R is large language models are very, very, very big. Um But for most use cases, they really don't need to be that big when you're dealing with one specific use case. Um And turning those large language models into much smaller specialized models, um means you have much a much easier time uh getting to deployment and getting the latency and memory constraints that you might have as a business. Uh Thank you so much for the uh MS guys for organizing this. Um You can reach me, my email is down there or linked in me. Um And let me know if you have any questions in the chat, I'll be happy to, to stick around and answer any later. Uh Thank you so much for the uh MS guys for organizing this. Um You can reach me, my email is down there or linked in me. Um And let me know if you have any questions in the chat, I'll be happy to, to stick around and answer any later.

+ Read More

Watch More

Efficient Deployment of Models at the Edge

Posted Jan 17, 2025 | Views 617

# AI

# Models at the Edge

# Qualcomm

What is the Role of Small Models in the LLM Era: A Survey

Posted Nov 05, 2024 | Views 908

# LLMs

# Small Language Models

# Specialized Tasks

The LLM Guardrails Index: Benchmarking Responsible AI Deployment // Shreya Rajpal // AI in Production 2025

Posted Mar 13, 2025 | Views 427

# LLMs

# Guardrails

# AI Risks