Graduating from Proprietary to Open Source Models in Production
Philip Kiely is a software developer and author based out of Chicago. Originally from Clive, Iowa, he graduated from Grinnell College with honors in Computer Science. Philip joined Baseten in January 2022 and works across documentation, technical content, and developer experience. Outside of work, he's a lifelong martial artist, a voracious reader, and, unfortunately, a Bears fan.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Model endpoints are a good way to prototype ML-powered applications. But in a production environment, you need security, privacy, compliance, reliability, and control over your model inference — as well as high results quality, low latency, and reasonable cost at scale. Learn how AI-native companies from startups to enterprise are using open source ML models to power core production workloads performantly at scale.
Graduating from Proprietary to Open Source Models in Production
Ai in Production
Slides: https://docs.google.com/presentation/d/1RnoNUYjTl_tmwoGbjWpd8Lg297UQUHbSZz16oN-WIkk/edit?usp=drive_link
Demetrios [00:00:00]: I'm excited to have my man Philip join us. Philip, what's up, dude?
Philip Kiely [00:00:05]: Hey, everyone. How's it going? I'm really excited to be here. Everything coming through loud and clear?
Demetrios [00:00:10]: Yep, yep. You don't hear me double echo, triple echo, do you?
Philip Kiely [00:00:15]: I think there's an echo.
Demetrios [00:00:18]: That's the special effects that I put on my voice to sound cooler, but I can quickly ease the pain by getting off the stage and letting you present all about how you can go from proprietary to open source models, which is a very hot topic and lots of people thinking about in here. So I'll get out of here, I'll get this QR code out of here, and I'll let you do your magic, man. I'll be back in about 20 minutes. 25 minutes.
Philip Kiely [00:00:49]: Awesome. Thank you so much. Unfortunately, I left my excellent hat at home, but. Hi, I'm Philip, I'm from base ten, and today I'm going to be talking about graduating from proprietary to open source models. So I have here a little intro slide. While I go through this, what I want you to do is go to chat, drop your name, drop where you're calling in from today, so I can get to know who I'm speaking with today. With the virtual conference, it's sometimes hard to make that connection, but I really want this to feel like we're all in the same room together. So, I'm Philip, I do developer relations at base ten, and I'm calling in from Chicago.
Philip Kiely [00:01:26]: A little fun fact about me is I'm a big martial arts fan. I'm going to drop a couple of metaphors like that throughout the presentation. And my favorite open source model is playground v two, aesthetic, which you're going to see a whole ton of today. So we've got 20 minutes together and what are we going to cover? So I've got three topics for you today. Number one, we're going to cover the benefits of open source models. We're going to talk about how you can own your data end to end and ensure security, compliance, observability, reliability, all that great stuff. We're going to talk about open source as best in class models. If your model isn't any good, it doesn't really matter what else you're building on top of it.
Philip Kiely [00:02:06]: So we're going to make sure that we're getting really high quality models and getting the best results out of those models. And then finally, we're going to talk about getting this into production because this is the AI in production conference. So we're going to talk about how you build an inference endpoint, which is basically a way to make a request, have it run your model for you and give you the. Let's, let's jump in. Let's get started. I'm seeing some great hellos in the chat. Hello to Utah, to New Orleans, to Philly. I was just there last weekend.
Philip Kiely [00:02:39]: Awesome. So what is an open source model? This can be a little bit of a subject of debate, so I'm going to make sure that just for this presentation, an open source model is any model where you can grab the weights, grab them off, hugging face or GitHub or wherever you can grab those weights. You can grab a gpu, either a GPU on your computer or a gpu on the cloud, and you can run that model. So this is not going to be something like GPT four. That's going to be a proprietary model. We're going to be talking about open source models like stable diffusion, where you can kind of own your whole infrastructure end to end. All right, so what are the benefits of open source models? If you've ever used an open source model in production, by the way, I'd be really curious to know what benefits you got. So if you could go in chat and let me know what benefits you found from open source models, I would love to hear it.
Philip Kiely [00:03:31]: I'm going to talk through some of the stuff that we've seen and some of the stuff that our customers have seen from using open source models. All right, so I told you there'd be a lot of playground, too. All these friendly guys in the corner of my slides, they're going to be some playground, two images to just keep you company, keep you entertained throughout this presentation. So the first benefit of open source models is that there are a lot of them. There are like 500,000 open source models up on hugging face, and even more published every single day. And so what that means is that you get a lot of variety, so you can pick from whichever ones work best for you. But you also get specialization. You can find models that are specifically created for your use case in a way that a more general proprietary model might not be.
Philip Kiely [00:04:17]: You can also use multiple models together for stuff like retrieval, augmented generation, and multimodal image generation pipelines. But what else? You also get independence and control. So I don't know about the rest of you, but when I woke up this morning, I'm an at t customer. My phone didn't work. That's what's called a single point of failure, is when you're relying on a single provider to do something for you. And if they go down, there's nothing you can do. Fortunately, it's working again. So we're here, we'll have in the conference.
Philip Kiely [00:04:49]: But when you have an open source model, instead of being reliant on a single provider and their endpoint, and that endpoint's uptime, you're able to build your own infrastructure and make sure that you have fallbacks and reliability. You can also choose from multiple options. Maybe you like the way that mistral works better or the way that llama works better in terms of outputs and alignments. You can pick your own infrastructure provider. I'm going to give you a little hint. I'm going to tell you at the end you should pick base ten, and you can also avoid something that's really important, which is deprecation and platform shifts. So if you have the model weights, those model weights don't change unless you change them. And what that means is instead of a model maybe changing underneath you, you get to decide how long you run a version of the model and when you update to that new version, as new versions come out.
Philip Kiely [00:05:43]: The final benefit we're going to cover today, privacy and security. And I see that in chat open source models for data privacy. So you can ensure that your inputs and outputs are private, that they're not going to be used for future training because you're controlling this end to end. You can set your own security and access policies, and like we have at base, then you can achieve sock two HIPAA compliance. Cool. So those are some of the benefits. But like I said at the beginning, open source models, even if they offer all these great benefits, none of it matters if the output quality isn't there. So let's go through some different open source models and talk about how to get great results from them and talk about what's on the market that you should be taking a look at.
Philip Kiely [00:06:29]: Like I said at the beginning, my favorite model is playground two. But if you have a favorite open source model, I'd love if you could head over to the chat and let me know what your favorite model is. So open source models have a huge range of use cases. You can do chat, autocomplete, code generation. I'm not going to even run through this whole list. I think it maybe isn't even cut off here because there's so much stuff you can do with those hundreds of thousands of open source models that are out there. Today we're going to take a look at four categories that are maybe a little more popular. And look at some of the best models within those.
Philip Kiely [00:07:04]: So the number one category of ML models, right, is language models. And we've got this llama over here. There's a lot of different foundation model families to choose from. So a foundation model family, that's going to be a bunch of different models, all released under the same architecture, under the same model provider. And some examples of that are the Mistral family of llms, the llama family, Google's brand new Gemma that they announced this week. Alibaba has Quen, Microsoft has Phi, and stability has stable LM. Those two are a bit smaller. But with all of these choices, the question is, how do you decide between that? So we're going to go take a look at evaluating llms.
Philip Kiely [00:07:47]: And llms come in what I like to call weights classes. Like I said, I'm a big martial arts guy. So in Jujitsu, you compete in weight classes to make sure that a guy three times my size isn't just tossing me out the window. In llms, there's kind of a similar idea with different sizes of llms. You talk about comparing 7 billion parameter llms to 7 billion parameter llms, 13 to 13. And what this means is with these different llms sizes, you can run them on different gpus and you can expect different performance out of them, both in terms of inference speed, and then also results quality. Now, this can get a little more complicated when you address things like the mixture of experts architecture that mixtrawler uses. So there are models that kind of exist in between these weight classes.
Philip Kiely [00:08:39]: Kind of like a UFC champion who might be really good and fight up and fight down to take on as many great competitors as possible. But in addition to the size angle, there's also kind of this variant angle where maybe you want to use the fine tune, maybe you want to use the foundation model directly, and maybe you want to use a fine tune of it. So I'm seeing a lot of people in chat saying that they really like code llama. They're able to use that foundation model directly, and that's awesome. But sometimes, you know, you have a fine tune that maybe does a little better on the leaderboard, maybe does a little better on an eval that you run. Something like Zephyr, which is mistral. It's just taken by the hugging face team, and it's additionally fine tuned to be a little bit more of that aligned chat model that we're used to. And the thing about choosing between foundation models and fine tunes is that every day on the leaderboards.
Philip Kiely [00:09:35]: There's some new fine tune that ekes out another point on the benchmarks versus the last one, and this is a really useful signal. These benchmarks exist, and they're popular for a reason, but they're not the whole truth. What really matters when evaluating large language models is if it's going to work for your use. Case, shout out. Austin, right before this was talking through some of this material, so I'm not going to do a worse job saying what he just said. But overall, you want to make sure that you're evaluating end to end how this model performs in your system from both a results and a quality perspective. Not just talk about like, oh, this model scored 27.52 instead of 27.51 on this aval. So I've got to whip out my whole system switch in terms of image models.
Philip Kiely [00:10:22]: You've got another great range of options, all kind of in the stable diffusion family. So you have stable diffusion SDXL, you've got the tobo version, which goes really fast, not quite as high quality, but really fast. We're talking frames per second fast when you get it up to an h 100, and then you can mix in stuff like control net to do masks and make these image generation pipelines that make stuff better than what you can just do with the foundation model. And then, of course, playground, which made all these models that you see, all these images that you're seeing in my slides, like this friendly guy, he's maybe going on a hike, mountain exploration, I don't know, but he looks prepared. Audio models are really a place where I think open source shines. We've got whisper, which is now on whisper three. It's really a best in class audio transcription model. It's actually by OpenAI, but it's open source and has a ton of great use cases.
Philip Kiely [00:11:18]: And then there's this newer model, Piper text to speech, which does speech synthesis in dozens of different voices and languages. You also have experimental models for sound effects, music, all sorts of really interesting things coming out of these audio models. And then finally you have multimodal models. And this is a place where open source is in some ways catching up and in some ways way ahead. So you have stuff like Quenvl and Lava that are these general purpose visual llms that are somewhat similar to GPT four B. But I'm not going to stand here and pretend that the performance on those is at that level yet. It's going to be, but it's not quite there where open source does have a huge edge is in really specialized models, like I talked about before, stuff like document QA, where you're extracting information from pdfs in this very structured format. There are all these very specialized models for tasks like OCR and question answering that can be combined in interesting ways to produce really high quality multimodal systems.
Philip Kiely [00:12:22]: So now that we've seen that there's a huge variety of models out there for you to build with, the question is, how do you build with them? What are you going to build? And the first question I think a lot of people turn to is infrastructure. So if you have a GPU that you use to run the models locally, go ahead and tell me about it in chat. Like, if you have a 40 90, if you've got an h 100 sitting in your closet, this is your time to go flex. If you're like me and you only have maybe eight gigs of VRAM on the laptop that you're running this presentation on, don't worry, I'm going to talk you through how you can get stuff on the cloud and get access to these great gpus. So a couple of tones that I just want to run through as we get started here, and unfortunately, this guy is our last friendly helper in this presentation. After this, it's going to be a couple of infrastructure diagrams that maybe aren't quite as fun to look at, but we have model serving is kind of our major tone here. So that's running inference, which is putting in inputs, getting out outputs on ML models in production. For that, you're going to need some resources.
Philip Kiely [00:13:33]: You're going to need a GPU cpu, and you're going to need an endpoint, which is a way that you can hit an API and get back your model responses. You're going to get this by creating a model deployment, which is a dedicated instance of everything you need to run the model. And then we're going to talk about auto scaling, which is how you're going to scale up and down these deployments in response to traffic. So you're not just spending all of your money on gpus all the time. So to get into it a little bit, the first thing is model customization. Again, the whole point of open source, right, is that we're able to take these really powerful models and then make them work exactly how we want, so we can set exactly the python and system requirements we're going to need. We're going to pick the GPU that gives us the right speed and cost trade off, and then we're making our own API endpoint here, we can specify exactly what we want the input to be, exactly what we want the output to be, and if we want anything to happen, to kind of integrate this into our application, like, say, saving the output to a database or parsing the input before it goes to the model from there, the really big question is how you serve this model in an optimized manner. And this is something that I've spent a ton of time learning about in the past few months.
Philip Kiely [00:14:51]: So you start with the hardware, right? You've got a GPU, you've got a cpu, you've got some got, you know, a connection to the network, and that's great. Everyone has one of those. The question is, how do you get fast results out of that? So the first thing to tone to is a highly optimized sobing engine. We've been using tensor ot a lot, which is by Nvidia, but there's a ton of other ones, VLM, TGI, that can also offer great results on certain models. And you're going to want to make sure that you implement your model servo to use that engine to get these optimized results. We've seen stuff like 40% improvement in latency on stable diffusion XL just published about that today using tensor RT. We've also seen great results with mixture on tensor RTLLM. You're also going to want to decide what quantization to run your model at.
Philip Kiely [00:15:43]: This is another place where you can trade off between quality and speed and cost. Now, this could be a whole 20 minutes talk on its own, so I'm not going to go super deep into this, but overall, usually you're running models that float 16, fp 16, and you can bring that down to int eight and int four if you do it carefully. If you validate that your model is still working well, this can give you big speed ups, two x four x faster inference, or two x four x lower cost. The last thing that we really need to talk about is batching, and that is going to let you trade latency for throughput. That's kind of the secret that allows you to serve big audiences and high traffic endpoints on this, as well as caching, which is going to let you reduce cold start times. What are cold start times? Why do we care about those? Well, that's part of auto scaling. So the thing is, you could set all of this stuff up. You could have your GPU, your serving engine, everything's great, it's ready to go.
Philip Kiely [00:16:44]: You're ready to blast through hundreds of requests at once. But if you don't have any traffic, you're just paying for that while it just kind of sits there. So what you need to build on top of that is an auto scaling layer that's going to be able to scale up to handle spikes in traffic and scale to zero to save money when stuff's not in use. You're also going to need logging and observability on top of that. So building with open source models does offer a great set of advantages around privacy and security, around control and independence, but there's a bunch of stuff you have to build on top of that as well. And that's why base ten exists. That's what we do all day, every day, is put these open source models and people's proprietary models that they build themselves, fine tunes all that kind of stuff into production and serve them in an optimized fashion. All right, I think that's my 20 minutes, more or less.
Philip Kiely [00:17:40]: So I'm going to just say thank you so much for coming to this talk. My Twitter Philip Kylie, LinkedIn. Philip Kylie, as well as base ten, come say hi. Come give us a follow. I'm publishing every week about these topics, and I'd love to help you out with them. I want to shout out Mark Texan for helping me prep for this talk, and I think that there's a Q and a portion at the end, so I'm going to turn it back over to Demetrius to help answer some of your questions.
Demetrios [00:18:08]: Most definitely. So while people are getting all their questions coming through in the chat, I want to ask you about the Vllm versus what you're using. You're using. Yeah. So can you give us a breakdown of why and what the choices were there?
Philip Kiely [00:18:27]: Yeah, so we use both, and there's definitely great advantages to both libraries. We've been working with the Nvidia team directly on Tensorrt for a few months. They've been helping us create these really optimized, model serving engines. And ultimately, what it comes down to there is that tensor RT takes advantage of a lot of the features of the brand new gpus, like the L four, the a 100, the h 100, most especially, is where you're going to see the best performance. And so being able to take advantage of the features of these newer, more powerful gpus is a great reason to use tensor RT. And that's where we've seen a lot of our performance gains coming from.
Demetrios [00:19:09]: Excellent. So we've got one question coming through here, and I want to take a swing at this question. Bish was asking what your thoughts are on data security promised by OpenAI after subscriptions or enterprise plans.
Philip Kiely [00:19:23]: Yeah. So you can take any endpoint provider and take them for their security promises, or you can take an open source model. You can use a provider like base ten to have a dedicated deployment of that, and then you can not have to take anyone's word for it.
Demetrios [00:19:44]: Yeah, that's zero trust. I also was just reading a paper the other day that I mentioned last week, but it was all about the ways that the models will leak data unintentionally. And so even if OpenAI is actually doing what they say they're doing, they are leaking data in ways that they don't even realize. And so that is another piece of the puzzle to be worried about, or just be cognizant of. Maybe not worried. Let's not live in fear. So another question coming through here. Is, is tensorrt capable of continuous batching like VlLM?
Philip Kiely [00:20:28]: I know that we do a lot of batching with tensor RT to make our price per million tokens kind of make sense, even though we work in more of a pro gpu hour format. I am personally not the engineer on that project, so I don't want to say for 100% that it is. But overall, I know that we're doing a lot with batching, and I imagine that I know we're using it for production use cases. So I know that there's a. In some way, the batching is working in the way that you would want it to.
Demetrios [00:21:03]: Excellent. Do you know any open source models that perform well in the medical and healthcare fields?
Philip Kiely [00:21:11]: Yeah. So one thing that I think is really promising for medical and healthcare is doing retrieval augmented generation. I'm probably the millionth person here at this conference to say those words, so I won't get into that too much. But we've seen a ton of great use cases around different healthcare companies and making sure that their models are HIPAA compliant. I also think that there's a ton that ML can offer on kind of the less sexy side stuff like OCR document parsing, that all kinds of companies, including healthcare companies, definitely need.
Demetrios [00:21:51]: Excellent, man. All right, last one for you. This is a bit of a contemplative one. If you have a crystal ball, what do you think about Groak? How long will it be before companies start serving with lpus, if ever?
Philip Kiely [00:22:06]: Yeah, I've definitely done a bit of research on that follow from an expert, so I don't want to go on record with any sort of hard opinion. One way or another. I think it's super.
Demetrios [00:22:15]: Oh you're a diplomat. I like it.
Philip Kiely [00:22:18]: A bunch of companies trying their own hardware and definitely you can't argue with the performance metric they put out but at the same time when you look at the overall cost of the system I've read some breakdowns on that. I think that there's still a lot of room in the market for multiple different approaches.
Demetrios [00:22:39]: Excellent dude. This has been a blast. Thank you so much for coming on here and making it fun with the mixed martial arts references and also your little copilots that joined us along the way. I think you're able to go on to the chat right now because are a few more questions that are coming through so you can go and answer those as you please. This has been great man. I'm going to sign off here because we're going to keep it rocking and rolling.
Philip Kiely [00:23:06]: Thanks everyone. It was fun being with you today. Close.