Challenges in Deployment Automation for AI Inference
Aarash Heydari, based in New York, is one of the engineers responsible for the deployment and reliability of Perplexity's first-of-a-kind Online LLM API.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Perplexity uses Kubernetes to serve AI models. In this talk, I will describe some challenges we have faced. This includes loading weights, templatization of raw Kubernetes manifests, and more.
Challenges in Deployment Automation for AI Inference
AI in Production
Slides: https://drive.google.com/file/d/14aTkgYBRBe2u_2lss4cD-gwXINPT9any/view?usp=drive_link
Demetrios [00:00:05]: Now we've got Aarash coming up from good old perplexity. AI. Dude, y'all are blowing up. And I feel honored that you are giving us a talk today. I'm so excited.
Aarash Heydari [00:00:17]: Thank you so much for having us, Demetrios. Yeah, doing very well. I'm excited to be here and talk a little bit about AI inference and how we're doing at perplexity and what we're it. I guess I'm already here, so I'll go ahead and get into the presentation. So, yeah, my name. My name is Arash. I work on infrastructure at perplexity. And so I'll talk about some of the challenges that we had and some of the things that we've got for you.
Aarash Heydari [00:00:44]: So in case you don't know perplexity, we're providing answer engine, conversational answer engine, where we just answer your question very quickly. And everybody likes trustworthy information. We try to provide that to you by citing sources. So this is what our UI looks like. You can ask it a question that even involves real time data, like, when are the next MLOPS community conferences? And it knows about us. It knows about us being here right now. And so this is arguably my favorite. Slide some of the feedback.
Aarash Heydari [00:01:17]: So in the top left here, this is a text I got from my friend recently, who's like a PhD in sociology. And it's just awesome to see a friend of mine actually getting value out of the thing that I built. There's just so many instances where people reach out to me or our teammates and talk about how much they like it. For example, bottom left here, Mr. Emmet Sheer. That's Mr. 24 hours OpenAI CEO who talked nicely about us on Twitter and shared a tweet that was pretty cool. And then here on the right, I don't actually know where this video came from, but some people were comparing the token throughput of different third party open source LLM providers.
Aarash Heydari [00:01:58]: And that's us at the top, sort of almost a two X factor above sort of the others here. And so I was pretty stoked to see that. It means we must be doing something right, I suppose. Yeah. So we serve a lot of AI models at perplexity. We have sort of our own in house capacity that we're running these models on. And so there's this link labs perplexity AI, which is sort of our playground for trying different models. And actually, you can see at the bottom there, Gemma, which just dropped yesterday.
Aarash Heydari [00:02:29]: We were quick to add it. It's already there. You can play with it there. And, yeah, we have a public API that you can use in your application to use this as a platform. And we have these first of a kind online LLMs that are grounded by search on the Internet to sort of answer your questions and have access to real time data. So go ahead and try that out. As for the talk, I guess I just wanted to talk about why we have AI inference and why we use Kubernetes to basically get the job done and what sort of arises from doing that. So, no surprise to anybody, you can make awesome generative AI apps without ever hosting your own model.
Aarash Heydari [00:03:17]: In fact, I think that's what people are normally doing, and the important thing is to have a good product, and that's sort of what will work for you. The reasons you might start hosting your own models are if these third party APIs are expensive for you and you think you can save money by doing it correctly yourself. If for some reason you need to avoid sending your data to these third parties, then that would be not an option. And then finally, the most sort of obvious thing is if you're relying on a third party API and it's not giving you the answer that you want, there's not much you can do. And so if you have a specific sort of data distribution that you're trying to run whatever generative AI application on, eventually you'll probably do well to fine tune it to learn your preferences. And that's sort of what spurred us initially to start serving our own models. And so yeah, the next question is, why use Kubernetes? And really, you can't beat the primitives that Kubernetes have. It's just like so optimized for running a reliable infrastructure at scale.
Aarash Heydari [00:04:24]: And it's so deeply customizable, even the most complex sort of systems like datadogs themselves, run on. You know, there's actually opinionated toolkits out there specifically for AI inference. I'm aware of kserve, and I'm sure there's lots of other ones, but actually we just sort of did vanilla Kubernetes natively, sort of the old fashioned way, and we enjoy being directly on top of the Kubernetes primitives. As the cloud infrastructure engineer, you want your teammates to feel like the people on the right of this bus and not the people on the left. So there's a lot of things that can be confusing or feel annoying, especially if you're introducing Kubernetes to an organization that's not using it yet. There's just like a lot of conceptual stuff and the way that networking and containerization and the fact that file systems are usually ephemeral and all these things might be unfamiliar to people. And if this complexity is all exposed directly to your AI team, they're going to be like, hey, I have other things I need to be working on, but if you're trying to run a reliable service that has high availability, you sort of need it and it gives you this repeatable environmental environment agnostic way to deploy. Obviously we're a fan of infrastructure as code and Gitops to make sure that just things are repeatable and we understand what the current configuration is.
Aarash Heydari [00:06:04]: When I'm sort of talking about Kubernetes to my AI team and trying to sort of create their experience for them, it's all about hiding the complexity they have enough to worry about. If they're sort of talking about how they need to deploy things, it should sort of be simplified as much as possible. And what that looks like is you don't give them raw manifests to edit. And that's sort of how we started out naturally when we were just a small number of models or doing sort of the first cut. But eventually we're at such a scale, you saw how many models we're serving that there would just be an incredible amount of duplication and other things. So obviously every Kubernetes engineer eventually starts using some kind of template engine. And I think every organization sort of does this a little bit their own way. Like helm is sort of the obvious way to do it, but there's other options too.
Aarash Heydari [00:06:57]: But it's all about making sort of the language that my AI team is speaking is like I want to deploy four of these llamas and here are sort of some configuration stats for them. And they don't need to worry about what is a deployment, what is the service, all these other things. As for AI, from Kubernetes perspective, what are the unique challenges of, okay, maybe you know, how to run Kubernetes applications in general. What's sort of special about hosting in house like LLMs and stuff like that on it? The sort of first and most obvious thing is you're probably going to be wanting to use GPUs. It's sort of also possible to not use GPUs, but I think for sufficient. If you're looking for a cutting edge level of performance, you sort of need it and it's going to be a scheduling bottleneck for you. Because GPs are expensive and they're limited, not sort of as easily elastic and auto scale capability as your average CPU or memory resources. So this will be something that has an upper bound on sort of what your capacity is to deploy.
Aarash Heydari [00:08:08]: And you got to want to make sure that your GPUs are busy, so you need to be watching their utilization. And if they're not being utilized, well, then you're not getting your money's worth. Another sort of challenge that arises is cold start. So what is actually an LLM inference server doing? It's got some weights that it needs to serve in an API and those weights are actually huge, like they're up to 200GB. And the first thing that you need to do is load all that data into your GPU memory. And that's why inference is fast, is because it all sort of happens in these huge matrix multiplications on the GPU. That's sort of your startup time is like how quickly can you load the weights into the GPU? And that sort of depends on a lot of things. Are you running sort of in a situation where the first thing you do is download those weights? Well, downloading things takes time, especially on depending on your network setup and the disks that you're using for your machine suddenly matters.
Aarash Heydari [00:09:22]: For your startup time, you have to read all of it from disk. If you're downloading it, or even if you've already downloaded it, there's multiple ways you could go about doing this. You could mount s three directly. You could have a persistent volume that's a read write many mount. That is sort of a local cache of it. And that actually happens to be what we do. We choose sort of a high disk I ops machine, a high disk I ops file system, and sort of cache the download there. And so it only needs to happen once sort of per model, you could say.
Aarash Heydari [00:10:08]: But yeah, this sort of is a natural start time problem and one that we had to make progress on. And that's actually all I had. Happy to take some questions or talk about perplexity more or I know we're sort of low on time, so happy to also keep the show moving.
Demetrios [00:10:28]: No, we're definitely asking questions. You're not getting away that easy, man. So we've already got some coming through in the chat, and I can only imagine that there's going to be a lot more coming now that we've opened the floodgates. How does this config translate to Sboms? And there's probably or ML bombs? Is there a way to pronounce this that I am not aware of? The S bombs?
Aarash Heydari [00:10:57]: S b o M. I actually don't know those acronyms.
Demetrios [00:11:02]: I don't know those acronyms either. Whoever. Yeah, it might be Gonzalez.
Aarash Heydari [00:11:06]: Like, open model configuration. There's like Onnx and all these other sort of. Sort of neutral agnostic model formats for storing weights. I wonder if that's what he's referring.
Demetrios [00:11:23]: Follow. There's more acronyms in here also. Do you follow SPDX for ML Gen AI also?
Aarash Heydari [00:11:29]: Don't know what that is, actually.
Demetrios [00:11:32]: Gonzalo, where are you coming up with these acronyms, man? Neither of us have any idea what they are. I really want to know now, though. Or are you just making it up? Are you just trolling us just to have so we don't know? Gonzalo, if you want to enlighten us on those feel free software bill of.
Aarash Heydari [00:12:04]: I mean, like, the idea is just that there's a simple configuration file where the AI team just specifies their minimal intent, which is things like, okay, how many replicas and what are these super obvious ML specific parameters, like the context window of the model and the batch size that you should use or something like that. And so we just give them that. And then there's this template engine which spits out actual kubernetes, which is deployed to our Kubernetes cluster. It's sort of the high level.
Demetrios [00:12:34]: Okay, so Gonzalo is saying it definitely is not making up SPD.
Aarash Heydari [00:12:41]: Yeah, no, I definitely believe that acronym. I just wasn't aware.
Demetrios [00:12:46]: I wasn't sure. I wouldn't put it past them trying to fool me with something that sounds like something that I would do. Okay, there is another question coming here through. Actually, this one kind of falls in line with you talked about Kserve. This one's asking about seldon core, which is. I haven't stayed super recent with the project, but the last I knew about it, Seldon core and Kserve had a lot of overlap and similarities.
Aarash Heydari [00:13:23]: I'm actually familiar with Selden core either. Yeah, I do enjoy just using Kubernetes directly. I just sort of understand the basic primitives that it has and I just want to operate at that level. And it's possible that having frameworks on top of it could make your life easier. But the problem, I find is that they might be opinionated, sort of in ways that you may not have wanted or that may lock you up in some way is sort of my only.
Demetrios [00:13:59]: Down the line for that. So basically you're saying that by doing it completely vanilla, you don't have to worry about any decisions that are being made for you, and you'll take the brunt of the lift. You are okay. Doing the heavy lifting because you would rather have the control.
Aarash Heydari [00:14:24]: Yeah. I think if you're comfortable on Kubernetes, you just want to talk directly. The Kubernetes language.
Demetrios [00:14:32]: Yeah. Okay. That makes complete sense. Yeah.
Aarash Heydari [00:14:41]: Time the first token. And actually you can measure that if you go to labs. Perplexity AI and you pick a model and ask a question, maybe I can change the sharing. I wonder if that's possible. Show us. So here is our latest in house model. Let's see.
Demetrios [00:15:09]: And this is a fine tuned model. What do you mean by in house model?
Aarash Heydari [00:15:12]: Yeah, so it's a fine tuned model and the eight x seven B would tell you that it's fine tuned from mixed roll. Eight x seven B. We have an announcement coming that is going to announce it in more detail and give some document evalves and stuff like that. So you're getting a little bit of a sneak peek. I'm just trying to think like when. Well, maybe interesting to use the online model. Like when was Steve Jobs born? Interesting question. And so here the time to first token was a bit slower because there's actually like 1 second that goes into doing a web search to make sure we find the correct answer.
Aarash Heydari [00:15:58]: Out of curiosity, if I make the same question to the offline version of the same model, it seems to just know the answer because the model just knows. Maybe. But yeah, so this first token, pretty freaking fast. 0.2 seconds. Wow. So you can play with this. And of course smaller model will be faster. I don't think we can get that much faster than that.
Aarash Heydari [00:16:24]: That was a bit slower. We'll show you that all up here. And like I showed you on that slide earlier, I think there's a pretty big consensus that our API is one of the fastest of the third party APIs providing open source models.
Demetrios [00:16:42]: Yeah, that's for sure. I think that anyone who has played around with it knows. It's like not only in the way that you feel it, but also in the actual graphs where you see. Oh yeah, that's what I've been feeling. So there was one question that came through here about how do you make sure to cite things or how do you make sure things get cited? I can't find it now.
Aarash Heydari [00:17:09]: Right. It's sort of more obvious with our web app, which maybe I can do one more short demo.
Demetrios [00:17:18]: As many as you want, man, you came in way under.
Aarash Heydari [00:17:22]: Um, let's see. You're all good. So like, you know, when is the next Mlops community conference? Yes, we're doing that web search. I'll just skip. Based on what I found on the Internet, it sort of asked me a follow up, but I just wanted to give me an answer. Okay, interesting. When I asked this question yesterday, it told me about today, but now today has already gone about this one. And that of course is linked here.
Demetrios [00:17:58]: Oh, that's interesting because it's the mlops world. It's not the mlops community, which are friends of ours, but it's not exactly us. But that's.
Aarash Heydari [00:18:12]: Cool. Well, anyway, yeah, so we sort of show the sources right up front and center here in the API. We're actually about to release public availability of getting citations in the API response. So if you're an API user, look at your email inbox or look at our announcements like next week.
Demetrios [00:18:34]: So then that totally changes things on the way that you build rags.
Aarash Heydari [00:18:42]: Yeah, I think the hope is that people that use our API as a platform just want to have that confidence of the same confidence that they have when they use our product web app of like, okay, but why are you saying this and giving them that? Citations will be really nice for that.
Demetrios [00:19:05]: Wow, dude. I think the last thing that I have to ask about going back to just Kubernetes and how there's so many places that I want to go with this, but one thing that I know people often ask is, yeah, I know Kubernetes, but what do I need to know to add ML to Kubernetes? Like how is serving or dealing with or deploying models, especially now in this day and age, like large language models, different than just like deploying normal software on Kubernetes?
Aarash Heydari [00:19:48]: Yeah, and I was trying to get at that with my last two slides, but maybe I rushed through it, but it's really just like, there's GPUs now and so, okay, how do you make GPUs behave in Kubernetes? You have to use a special thing called the Nvidia device plugin and then the scheduler knows about how many GPUs are on which nodes and pods have to claim the GPUs. It's like, I want two GPUs, I want four. Mean, that's like one thing, not too mean at the level of what you're deploying. It's actually quite simple. And then it's just a matter of having your inference server. And for that, the obvious choice is to use open source. There's plenty of open source here. So we're close friends with Nvidia.
Aarash Heydari [00:20:34]: They have Nvidia TRT, LLM, like there's also vllm and there's many other ones that are sort of up and coming or already here. So you just sort of put those pieces together and it's not as different as you think. It's just a web server. It just happens to load the model weights into GPU memory first. Interesting. Yeah.
Demetrios [00:21:02]: And do you feel like a lot of the speed that you're seeing is because there's no custom hardware that you're using in the background or any of that? It's all on the software layers. Right.
Aarash Heydari [00:21:16]: We are on those Nvidia H 100 GPUs and that is basically the cutting edge. And that's maybe one of the reasons why our API is so fast.
Demetrios [00:21:29]: Yeah, but I mean, I guess that's kind of table stakes these days for a lot of these API providers, right? Like Mistral, I know, has probably got a ton of those. Maybe they're saturating them all, building new models and so they don't have any availability for their API. But I would imagine there's a lot of tweaks that you're doing on the software.
Aarash Heydari [00:21:55]: Have a, we would say better optimized inference server than the average one you can find. And as for why other providers like Mistral seem to be bit slow, there's some things at the configuration level of like you can sort of trade off between latency and throughput. Basically you can sort of tell the inference server to be more aggressive about just serve requests as fast as possible. It's okay if you can't batch them well enough, but when you do that, latency of an individual request goes down until you have a huge amount of volume and then the throughput ends up being worse. And so it could be that mixed drill is overly optimizing for throughput or something like that. Or there might just be more optimizations they can do in their inference server. Those are sort of my two obvious takes.
Demetrios [00:22:49]: Basically from the get go, it's been almost like in your blood to say we're going to roll our own for everything. And really what is going to put us above the rest is that we can do these things. I imagine you're not using any Vllm in the background. You're just saying I'm going to go straight into the veins with kubernetes and try and go as fast as I can. And everything that I can do, I'm going to be doing myself.
Aarash Heydari [00:23:27]: Basically. I think we have a high amount of self reliance. I think there is a bit of sitting on the shoulders of open source, obviously, like models like mixed roll were just sort of made available to everybody and Vidi has an awesome open source community. But I think the attitude for us is like if you need something, you should probably build the thing you need and build it with an 80 20 to like okay, what do you actually need? And what you end up with is something that might be more optimized than the thing you pull off the shelf. This is a bit of a balance, but I would say overall we believe in do it yourself correctly and you will be rewarded with having the fastest API.
Demetrios [00:24:13]: It's been working out for you. So I like it. There is one last question and then we're going to roll to the next talk. It was on Ishwara's question, how you do inline citations. How do you attribute individual sentences to the source?
Aarash Heydari [00:24:33]: Not sure the level of detail I can give, but basically when we ask the AI model the question within the context window that we feed to the AI, we have sort of the sort of information that we pulled from the Internet, basically the rag system, the web search in the context window, and just sort of ask the AI model. Hey, make sure you cite things.
Demetrios [00:24:58]: That's the very simplified version I can imagine. Yeah, it's just as easy as that, folks. Just tell the model. Make sure you sign and tell us.
Aarash Heydari [00:25:10]: To not lie and not hallucinate.
Demetrios [00:25:12]: Yeah, don't lie and cite things. And before you sign off, can you tell us how many requests the inference server handles in real time?
Aarash Heydari [00:25:24]: Well, we're in the sort of two to three digit, like you could say 100,000 per day sort of range of requests coming through at this point. It's sort of been changing rapidly, but that's like ballpark of where we are. One thing I forgot to mention is that we're definitely hiring. If you go to blog perplexity AI careers, let me blog.
Demetrios [00:25:54]: You're in the chat, right? Yeah, just drop it in the chat.
Aarash Heydari [00:25:57]: I think that'll work. But yeah, we got a lot of job postings there, so if you're on the market, check that out. If you have an awesome background, we'd love to talk to you.
Demetrios [00:26:08]: Excellent, dude. Well, thanks for doing this. We're going to keep it moving. You were very kind with your time. I appreciate you coming on here and teaching us all about what y'all are doing with Kubernetes.
Aarash Heydari [00:26:20]: Been an honor. Thank you, Demetrius, talk to you later.