Challenges in Providing LLMs as a Service
Hemant works on Machine Learning Inference at Cohere AI. Prior to this, he spent 3 years at NVIDIA developing Triton Inference Server, an open-source solution used to deploy machine learning models into production. He has a Masters in Data Science from the University of Washington.
This lightning talk explores the challenges encountered in offering Large Language Models as a Service. As LLMs are becoming increasingly larger and more proficient, there are certain challenges that arise which need to be addressed to ensure the efficient and reliable delivery of LLMs as a Service. This talk delves into key challenges such as scalability, model optimization, cost-effectiveness, and data privacy.
 Hello was the moment, so, oh, there he is. Okay. Sorry I missed you for a second. How you doing? Alright, good. Uh, thanks for having me. Yeah, of course. Thank you so much for your patience. Uh, we're a little behind schedule, uh, but that's, that's what happens when you pack a bunch of amazing talks into one day.
Yeah, all of them have been pretty great so far. Yeah. All right, so here are your slides and take it away. All right. Uh, just confirming if you can still see it.
All right, great. Uh, hi, uh, I work at Cohere. I'm a software engineer focusing on ML inference. I'm gonna talk to you about some challenges that we've faced while productionizing and providing language models as services. To give you some context, cohere has APIs that allow people to integrate an LP into their solutions.
And these APIs come in different flavors, but at the cross of it, they're all powered by some language model or the other. So to dive directly into the challenges and some possible solutions, I'm gonna break them down into three categories. These are certain limitations inherent through language models.
Or general models, um, steps for model optimization. And lastly, something really important, the data privacy and responsibility part of these, uh, language models and offering them as a service. In terms of limitations of language models, the first thing I'd like to call out is the model footprint. And here you can think of not just the resource footprint, but the overall time and investment that goes into the actual model.
Not only are these models really large in terms of their memory and compute requirements, they also have limitations in the case of the fact that a single AI accelerator isn't always enough to serve them. So you have to split them across multiple accelerators, and now you have to deal with an entire new realm of problems.
Uh, I try to think of this as a communication versus computation. Trade off. And you need to find ways to cope with that. The modeling teams that actually trained these models and create the architecture for these models need to take into account the inference characteristics of these models. Uh, for example, what kind of token is, are you using?
What kind of attention mechanism are you using, and so on. And then use that to inform their decisions down the line as they go and optimize these models for inference. It's not something that can be an afterthought anymore. The other really interesting part that I like to call out is the fact that like the popular accelerators are in short supply, you should plan ahead and diversify.
If you are making sure your models run very well on a very, very specific kind of hardware, it's very likely you'll run into problems down the road. So you want to invest in that earlier rather than later. The second part that becomes really important when customers are using your fine use to actually solve real problems is the fact that many enterprises and customers, they wanna fine tune these models for specific use cases.
And not only is fine tuning the entire model slow, it's also not always needed, and it's sometimes wasteful. Uh, you should only fine tune when necessary, but also choose the right fine tuning strategy. And you wanna find a fine tuning strategy that is efficient, not just to fine tune, but also to serve.
And I'll come, uh, deeper into that in a bit. But what that means is you can't always fine tune the entire model. You wanna fine tune subsections of the model, and as a bonus, you can actually fo focus on fine tuning, using an adapter network kind of mechanism where you don't fine tune the underlying weight to the model, but you have.
Additional weights that you fine tune that you can then use to serve these models. Now that does come with a qualitative tradeoff, but it's significantly faster to fine tune. And, um, it's, it's kind of like a balance between do you really want the best quality fine tune that's expensive to maintain, or do you want, uh, a more scalable form of it?
You can actually serve. Apart from the overall steps here. The other complexity with fine tuning comes from the fact that data privacy is generally a problem. Customers don't always wanna set their data sets. You might not always want to have some other third party company have access to a very proprietary dataset, specifically if they're gonna use it to create models that might compete with you.
The other part comes from the actual ownership of that fine tune. A lot of folks don't want to have that specific set of weights that they produced from the fine tuning, be owned by the company that actually does that. They want to be able to own it themselves and then be able to do that. Adapting networks, give the companies and um, the companies building the service.
A trade off where you can actually have the entire model partitioned into two parts. So you own your fine tune part of the model, but then another company owns the baseline weights. These are just some kind of the trade offs that you have to make. Uh, the next part comes about cost. So now cost of language models, as you already know, is both prohibitively expensive for training, but also for inference.
You generally do not wanna be training from scratch. You wanna find a good baseline model to train from. It could be something from a third party company, or it could be an open source model out there that's already been pre-trained and exposed publicly. You wanna take an inference framework that supports model parallelism.
And I, and I stress this because of the fact that when you're dealing with really large models, you're not gonna be able to fit them on a single AI accelerator. Being able to have a framework that efficiently splits the model across multiple AI accelerators and takes advantage of that parallelism in the right way, uh, it can be a break or uh, make moment.
Another really important part that I wanna, uh, talk a little bit more about is the inference stage requires you to have some kind of intelligent batching where you can dynamically batch requests from multiple users or even multiple requests from the same user into more efficient form so that your inference through put is high, um, specifically for generative models.
Because of the dimensionality variance between the requests, a smarter batching policy can make all the difference. These models themselves are really large in the, uh, tens or hundreds of gigabytes. And because of that, serving them requires you to have some kind of quantization. You may quantize to the level that you are comfortable with in terms of the accuracy to quality tradeoff, sorry, accuracy to performance tradeoff.
Uh, but that's a decision that you need to make based on a use case. It's not a one size fits all. In fact, some training strategies have been found to be quantization friendly, especially at scale. There's an interesting paper from uh, folks at Coherent, coherent for AI that talks about the intriguing properties of quantization at scale that I recommend you to have a read on.
One more thing that's important about cost is the fact that you can't keep vertically scaling and you can't be deal with large amounts of traffic. So you wanna start investing into the ability to horizontally scale, but all the systems that go around it. So it isn't just like a one point solution. You now have an entire orchestration layer that you need to be concerned about.
Model optimization itself is one of the other challenges in large language models, and I, and I can stress this in a different way. It's not just the model that you're optimizing. It's a very multidimensional problem in the sense of the fact that you have to optimize for multiple things, the latency, the throughput, the cost, and the quality.
But also the availability, especially if you're offering this as a service. And I like to think of quality as not just the measured accuracy of your model in an offline scenario, but the actual utility or usability of these models as assessed by the customer for their use case. As I said, no size fits all, so you should assess what your priority is and give people the knobs return to be able to get, uh, the right qualitative performance or quantitative performance through these.
You could batch better and have a higher throughput, but you'll be sacrificing on latency. You can try to find ways to scale better and configure an architecture that allows you to do that, but it might come at cost that you have to exponentially have more and more services around it to be able to orchestrate that.
You could sacrifice and cost and build a very lean solution. But would that really be reliable? Would that really be able to scale to large amounts of traffic and be tolerant to different, uh, problems? I wanna spend some more time on this just to drive this whole through. Uh, data privacy and responsibility is one of the most interesting and most profound challenges when it comes to these models.
We've seen the open source community make strides in improving the model optimization. We've seen a lot of companies out there share architectures and designs to make the overall limitations that I described in the earlier stages easier. But some of the things that don't get enough attention. And some of the things that make it specifically hard to make these language models a service is the ability for you to have privacy as a part of your design.
You should have the ability for users to opt out of their data being used to retrain other models, specifically the competitive models. One way that Cohere does this is we allow customers to deploy our models in a self-managed container through SageMaker, which ensures data privacy. And this is by design.
So there are steps and considerations you need to make to be able to make, um, the deployment of your models secure for your customers. Beyond that, there's also the concept of data security. You wanna make sure that you don't leak data. This isn't just the data coming in from your customers or the data that comes in to your model for training.
It's also the underlying data that goes into your overall evaluation of the model. You wanna make sure you have very clean evaluation systems in place and that you evaluate on a range of benchmarks, not just a specific kind of benchmark and, and bias your model to be good at only one thing. Vet the data that goes into your model.
And I know really well what goes on cause garbage in, garbage out. One of the last parts comes from the fact that there's a tremendous amount of responsibility when it comes to not just building, but exposing these systems. Not only do you have to make sure your systems are reliable and scalable and deal to use a traffic with low SLOs for latency and.
High availability, but you also wanna make sure that they are tolerant to misuse and they are able to handle attacks such as prompt uh, injection. One of the ways to do this is consider red teaming stress testing your model, and finding ways for third parties to come and try to attack your model to be able to misuse it for certain, um, well nefarious purposes.
Uh, that's all I have for you today. I'm leaving a link here so you can go join our Discord community and learn more about cohere our language models. Now you can use them to build your own. Thank you. Awesome. Thank you so much. And Nathaniel in the chat says, really informative talk, so thanks. Great to hear Nathaniel and Nathaniel message me and we'll hook you up with some swag.
So yeah. All right. So thank you for joining. Thank you. Alright, bye.