Considerations and Optimizations for Deploying Open Source LLMs at Your Company
- Co-founder of Mystic AI
- Y Combinator W21
- University of Bath
Oscar talks about best practices to ensure security, reliability, scalability and speed of LLM deployments.
Our next speaker on this lightning round, um, is Oscar from Mystic Ai. And Mystic AI is actually a Y Combinator company. Um, I think they're based out of the UK in Bath. Um, and he is gonna talk to us about something that I'm personally really curious to hear more about. Um, open source LLMs. Hello Oscar.
How's it going? Hi, Lily. Thanks for having me. Very good. Yeah, of course. All right. Here are your slides and all of their glory. I will let you take it away. Awesome. Thank you Lily. So, yeah, so we are gonna basically talk about some of the considerations when it comes into deploying this open source LLMs. Um, so I'm sure like many of you are probably familiar with this scenario in which, uh, you know, management team goes to the data scientists, orl engineers, and it's like, oh, these, all these models are amazing.
Let's see how we can make it valuable for our product. But, you know, how do you get that? And so really the, the question that everyone is kind of asking is, how do I go from this l lm big model into getting a fast, secure, scalable API endpoint? So something like similar to like OpenAI does with their own prior models.
How can they do the same experience but with open source models or the l m that maybe I've trained myself? And so at the end of the day, it all actually becomes a very similar software engineering problem to other ML models that we've been, uh, deploying. It's just maybe potentially more, you know, memory required to actually deploy this model.
And so if we go through some of these challenges, um, so, you know, we know that each model actually is different. And so because of that, uh, variability, one of the things that happens is that each of them require different libraries, different packages to run. So you need to have a system that is able to manage whatever is the environment required to run this model.
Now, as we mentioned, LLMs are one of those, uh, BIA models. So like they require actually a lot more GPU memory than other models that maybe we are familiar to have in production. And so having a consideration for GPUs that are able to run those uh, models is also very important. Having a system that is able to take care of large, uh, me when there's a lot of memory records.
Required to run this model. And so with that in mind, you actually have to think about which is the best GPU that is gonna like deploy this model. But you know, where we are heading towards the, the kinda like solution is one that you don't even have to think about all of these things, but one actually has to think every time they deploy a model is exactly the requirements from a library's perspective and memory as well.
And so one of the classical, uh, pitfalls and and drawbacks that we're seeing right now is the amount, the limited amount of, uh, GPU availability to actually run all of these models because they consume so much memory and those are actually very expensive and everyone wants them. And so it's how do access a huge pool of GPUs when maybe one cloud provider doesn't have access?
So like, how can you easily go to a different cloud provider that maybe actually has the GPUs that your model requires? Then when you are actually thinking about like scaling this model is one thing is to deploy on one instance on one gpu, but then you actually need to consider, you know, how many users are gonna be using this model?
Is it gonna be just one user? How many requests is gonna be handling per second? And then you actually have to consider potentially hundreds of requests per second. And so if it's running concurrently, that means actually hundreds of GPUs running at the same time. And so you need like a software that is able to actually take care and dynamic scale according to how.
Stack even model is asking for. And then at the same time, we obviously wanna make sure that, uh, you know, we don't spend thousands and thousands, uh, per hour on running some of these models. And so you need some kind of software that is able to dynamically optimize for, you know, how many, uh, what is the most optimal GPU to actually run this.
And, and, you know, there's different, different techniques like using spot instances, leveraging multiple cloud providers in which one of them may be cheaper than the other, et cetera. Now, obviously, uh, you wanna make sure that the end user experience is as optimal as possible. And so we are targeting, or like ideally the industry should be targeting sub 50 milliseconds of API latency, which would allow a much more, um, easier experience for something like streaming, uh, then other optimizations that one has.
The thing is how do you make sure that everything is ready on the gpu, uh, when the user is about to. Do an API call to this model that has been deployed. And so something that we call preemptive caching is how do you make sure that the model is CED ahead of time of the request being processed as well?
Uh, you know, as I mentioned, if you actually wanna like, be able to run, uh, the, maybe the nicest experiences that we are seeing with LMS is that kind of like streaming as if you were like talking to someone or chatting to someone. It's like the word comes out. Uh, as soon as possible and as opposed to having a chunk of text being thrown into you.
And so like you need to have a software that is able to actually provide that kind of streaming as well as being able to deploy these type of models. In the easiest way possible for the data scientist team, and they don't have to go through massive amount of layers of maybe software development or actual teams to ask them, Hey, I have my Jupiter notebooks or my models.
Please deploy them. Make sure it all works. How can we bring that barrier closer into the data scientists and empowering the data scientists to do it themselves? Then obviously you need to make sure that you're monitoring the full system, so because things break, so being able to quickly resolve whatever is that breaks.
Of course you, especially with open source, that's one of the reasons why it's so exciting for many, many companies at the any scale is to actually be able to run these things on your own, uh, premises or on your own cloud, bpc, whatever. So you wanna make sure that you optimize, uh, privacy and security. And so you need software that is able to be deployed, whichever are the premises that the, the company that you're working for, uh, requires.
And so how all of these challenges, how are people like currently solving them? I think there's like three kind of like, um, higher, uh, overview type of approaches. The option one is let's look at like what is the current software engineering team and what are the skills that they know how to use, how to leverage classic Kubernetes plus Docker, and let's just make sure that each Docker file.
Well, each Docker has a, a different model and we just leverage, uh, for the scaling. And that's what we are used to because we run a lot of microservices, mostly on CPU compute. And we assume that potentially this is gonna be the same for GPU workloads. Surprises actually not the same, but you know, there's a lot of like, uh, challenges that come when you actually try to deploy things on Kubernetes for specifically, uh, um, GPU compute.
And so, It's a good thing to actually approach it this way because it gives you a full control and you're able to actually deploy in the infrastructure that you maybe already have in place, but also requires the experience or an expertise of knowing how to build this, uh, with all the challenges that I've just mentioned and how to make sure all of this get, you know, solved through the, uh, infrastructure that through the software that you are, um, that you already have in place in your company.
And obviously building something like this from scale, uh, from the beginning. This ml, internal ML platform actually does take a lot of time. The second option is you look at a cloud and then you just use whatever is a solution that they have, like Vertex ai, aws, H Maker, et cetera. Problems with that is obviously you are, um, You're gonna be like, you know, limited to whatever is the cloud provider that you're gonna be choosing from the beginning.
And so there's a lot of like, uh, cloud lockin of course. And then you require still the expertise on that specific cloud vendor. And it obviously will take still time and, and, uh, the time and the resources to actually maintain all of this infrastructure in this cloud vendor that you've chosen. And then finally the, the third option, which actually is, is now rising a lot more, is, is, is new.
Different competitors and players trying to give you a solution, which gives you all of this infrastructure outta the box. Right? So like going into the first question I was saying like, how do you get given this model? How do you get this API endpoint out of the box? Is that, is the goal, right? And so, This is actually one of the solutions that I'm gonna, we talking about is which is the product that at Mystic we have built, which really allows companies right now to actually get fast is the fastest to go to market because everything is there.
All the challenges that I've mentioned are solved. Allows teams to actually deploy these LMS immediately, whatever they wanted in any cloud provider, et cetera, et cetera. The only challenge, uh, the only problem with that is that obviously you're not building infrastructure yourself, so maybe you are not getting those learnings, but companies have so many other challenges that maybe they should be focusing as opposed to be focusing on infrastructure of deployment.
Same that we don't focus on infrastructure for payments, Stripe handles that maybe we should allow other companies to handle infrastructure for deployment. And so this is where Pipeline Core comes in, which is the product that we built at Mystic. And, and really it's a, it's very much built for the data scientists.
With the data scientists in mind. It's how can we empower the data scientists to deploy these models in, in a reliable infrastructure immediately. And so it all starts with a simple amount of decorators. Like for instance, in this one I'm showing here is just you decorate different functions that you need to define your machine learning pipeline.
The beauty of this is that it's not limited to any framework. You can actually do whatever you want. You can, you know, run a combination of different frameworks. You can have some pre pressing code, post pressing code, whatever Python code you wanna run. This supports it, right? It's not limited to a specific, uh, py file or, or, or ten five, for instance, that we see, um, some other, um, softwares.
And so this is for, for loading the mobile. In this case, it's very simple. I'm just loading. Uh, um, actually this one is for the Falcon, uh, seven B. Um, then you basically, eventually, so there's another function that you, you define actually how the inference part should be built. You then define the step by step of the, um, machine learning pipeline.
And then finally, you know, I wanted to show you a bit of a demo, but I'm running a bit of time. But basically it's just a low test allowing you to suddenly, as soon as you. Define this pipeline. You upload it and you get an API endpoint to actually hit scale with this LM that you just like uploaded into your platform and outta the box.
You get a, a, a dashboard to actually monitor how you, well, your platform is doing. And so this is one of our customers where they are getting like a P 95 for 25 milliseconds. So what you can think about. Pipeline core as this, uh, with as simple as they can, we really give the companies the power of an experience dev team for ML workloads.
Uh, with like dynamics scaling, cost optimization, you can, you're able to deploy across any cloud OnPrem instantly run streaming batch, online inference, and a bunch of other, uh, amazing things. And, and yeah. So we are definitely like, uh, onboarding companies right now. We are handpicking them, so please, Go free it.
Try, uh, book a demo. I'll be in touch. One of our team members will be in touch with you and we'll love to be able to show you more about what is we build with pipeline core, and help you get deployed those models that you have as soon as possible into whatever infrastructure you desire. Yeah. Thank you so much for your time.
Uh, if you need anything from my end, you can reach me out on LinkedIn. You can reach me out on email or feel free to reach out on us at Mystic Ai or please do, uh, book a demo and I can show you a lot more about Pipeline Core. Thank you so much. Awesome. Thanks so much Oscar, and so excited to hear all the new stuff that's going on in Mystic ai.
Thanks for giving, giving us kind of a, a preview of what's going on. Thanks, Lily. Cool. All right, so we'll send you to the chat if people have questions for Oscar.