Opportunities and Challenges of Self-Hosting LLMs
Meryem is a former physicist turned tech entrepreneur. She is the co-founder and CEO of TitanML, TitanML solves the core infrastructural problems of building and deploying Generative AI so customers build better enterprise applications and deploy them in their secure environments. Outside the world of AI, Meryem lends her energy and expertise to supporting diverse voices in the tech scene and mentoring female and minority group founders.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
LLM deployment is notoriously tricky, leaving ML teams with little time left to focus on driving business value. So what can we do? If you run or are a part of a data science team working with LLMs, this one’s for you.
Opportunities and Challenges of Self-Hosting LLMs
AI in Production
Slides: https://docs.google.com/presentation/d/16VIzJA9dtn5R_46QxHMUGOeEXaYKirB-FEv7NNfw440/edit?usp=drive_link
Adam Becker [00:00:04]: Today, Miriam, you're going to talk to us about hosting. Self hosting, right. Which I imagine would be yet one more consideration for AI product management.
Meryem Arik [00:00:13]: I know they don't have enough to think about.
Adam Becker [00:00:17]: Exactly. Okay, the floor is yours. I'll be back in ten minutes.
Meryem Arik [00:00:22]: Awesome. So today I'm going to talk about the opportunity challenges of self hosting llms. Over the last year we've seen a lot of people working with llms, but it's been mainly API based and we're really interested in self hosting them. So I'm going to talk about why you might want to do this and what pitfalls you might experience along the way. So firstly, very quickly I'll define what I mean by self hosting alums broadly. I think of there being two camps. You can use an API based model, which is this is what most people are used to. So using a service like OpenAI or cohere or anthropic where it's deployed in the client's environment, sorry, in the third party environment.
Meryem Arik [00:01:02]: The alternative, which is what I'm going to be talking about today, is self hosting your language models. So typically this means you use an open source model, maybe something like Mistral, mixturel, llama, Bert T five s. There are literally thousands of them and you crucially deploy it in your own environment. So whether that's your cloud VPC, your on Prem, wherever you would deploy your normal machine learning models. So when I think about self hosting, that's what I mean by it. Why am I talking about this? So I'm the co founder and CEO of Tide NAML. At Tidenoml we work on making self hosting llms easier so ML engineers can focus on getting back to building great applications. You can check out my LinkedIn as well and hopefully we can speak there.
Meryem Arik [00:01:50]: So I'm broadly going to talk about three things. Firstly, I'm going to talk about the opportunities. So why do I think self hosting is a good option? To begin with, I'm then going to talk about what are the challenges involved in a self hosted system. And then thirdly, I'm going to briefly touch upon what a self hosted stack might look like. How do you build that kind of infrastructure? So firstly, why might you self host? So I think there are three really compelling reasons why you might want to self host your large language models. The first one is there's a decreased cost. So at the beginning you might decide to use OpenAI because you're just playing around with it. And the cost there is really, really low.
Meryem Arik [00:02:35]: If you're not using it very much. However, once you've proven product market fit and you want to actually get it in production at scale, those costs ramp up really quickly. However, if you're using a self hosted model, the upfront costs may be a bit larger, but they tail off really quickly. It becomes really, really efficient at scale, so you can think of much decreased costs if you're deploying enterprise scale applications. Secondly, you have improved performance. So this might be counterintuitive because a lot of people think that the best performance is on GPT four, which is true for general use cases. However, when you fine tune or doing very domain specific tasks, actually we find that the performance is better with smaller, more well defined models. So that's the second reason.
Meryem Arik [00:03:26]: And the third reason is about privacy and security. So if you live in Europe like we do, you need to make sure that you have data residency. If you're a healthcare organization, you have extra provisions you need to live by. And when you self host, it just removes a lot of the complexity you have to deal with. You don't have to have your lawyers read terms and services from third parties. It's already in your environment, it's already where you're processing data. There's nothing really you need to worry about. And there is a fourth reason, which I don't have here, but actually I'm going to mention it because it's relevant and timely.
Meryem Arik [00:04:02]: Today there was a huge outage at OpenAI for multiple hours, which is a great reason why you should have diversity. So even if you do use OpenAI, you should have backup options for when there are outages. But it's not all roses and lovely. There are some challenges involved with having a self hosted system, and the biggest one is just complexity. So dealing with a self hosted system means that you are taking responsibility for infrastructure that previously you didn't have to look after. So you might recognize this guy. This is Clem. He's the CEO of hugging Face.
Meryem Arik [00:04:43]: I've also heard rumors he's also the nicest man in AI, and he tweeted or posted a while ago that comparing. It's not fair to compare a open source model and an API based model, he said it's more like comparing an engine and a car. I think it was actually him quoting someone else. And he's totally right. When we use an API based service, there's so much going on beneath the layers that we don't see, whereas the model, when we have that, we then have to build all of this infrastructure around it. So it really is like comparing an engine to a car. If we want to self host, we go from just jumping in the car and getting to driving to just getting the engine and having to deal with the rest ourselves. So if I use an API based model broadly, my interaction looks like this.
Meryem Arik [00:05:39]: So I have a user and I query an API and I get a response, right. Everyone knows how these things work. However, beneath the surface of this API is a lot going on, and this is stuff that you have to deal with. If you're self hosting, you have to deal with a batching server, you have to make sure that your models are fast enough and are cheap enough. You have to deal with your Kubernetes infrastructure. OpenAI also deals with things like function calling. And this is a lot of infrastructure that you previously took for granted that now you have to build. So that is by far the most tricky bit.
Meryem Arik [00:06:17]: Now what's also frustrating is that getting this infrastructure right really, really matters. There are huge differences in performance between when teams put a lot of effort into their self hosting stack and when teams don't put a lot of effort into it. So on the screen you can see an example between two infrastructures. One is very, very simply just using hugging face and pytorch and fast API. And you can see that my model is significantly slower. I think in this case it's three times slower. But this also depends on the hardware. We've seen cases where it's up to twelve to even 20 times slower, whereas you can see on the right, when you throw the kitchen sink at it, and you really make sure you're building great infrastructure, you get much, much faster applications, you can run it on cheaper hardware.
Meryem Arik [00:07:11]: So we've seen up to 90% compute cost saving on an ongoing basis by self hosting with great infrastructure, between three to 20 x latency reductions and eight x memory reductions, and being able to get guaranteed structured outputs like JSON. So getting this infrastructure right really, really matters if self hosting is right for you. So I'm going to talk very briefly about the self hosting stack. I think I only have a minute or two left, so the serving toolkit is pretty broad and pretty deep. So here are just a few of the things that we need to think about when we're looking to serve. So how efficient is our server? Do we have guaranteed JSON output constraints? For example, are we quantizing our model? And if we are quantizing our model, are we doing it in a way that preserves the accuracy? Lora adapters are another thing that are going to become increasingly important in 2024. As more people are looking to fine tune. How can we serve hundreds of lauras and hundreds of models on a single gpu in a single server? Things like caching and kubernetes and so, so much more.
Meryem Arik [00:08:22]: This is a really, really deep serving toolkit, far more than I could talk about in ten minutes. And on the right, you can see a very high level schematic of what we do at Titan with our inference server. So I'm going to end there. What I hope that you guys have learned today is that firstly, self hosting language models is a really great thing to do. However, if you do decide to self host, you need to think really carefully about the infrastructure that you're going to use to make sure that you can build great scalable applications. If you want to learn more, you can email me on the left or on the right. I've also put my LinkedIn and. Yeah, thanks for listening, guys.
Adam Becker [00:09:00]: Miriam, if the perfect timing that you've demonstrated on this presentation is at all indicative of what you guys do at Titan, I recommend everybody to check it out. Thank you very much. This was fascinating. It feels like we could keep you here and just grill you for many, many hours. I think everybody wants to know how to self host and how to deploy these things on your own and all of the challenges that are involved. So I hope that you're going to stick around in the chat. I think there's a couple of questions there for you. I'm going to put the link for you so that you know how to get there in our private chat and then hope to see you on the slide back as well.
Adam Becker [00:09:36]: Miriam, thank you very much for coming.
Meryem Arik [00:09:38]: Thank you so much, guys.