Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable
Speakers

Ioana is a Senior Product Manager at CAST AI leading the AI Enabler product, an AI Gateway platform for cost-effective LLM infrastructure deployment. She brings 12 years of experience building B2C and B2B products reaching over 10 million users. Outside of work, she enjoys assembling puzzles and LEGOs and watching motorsports.

Igor is a founding Machine Learning Engineer at CAST AI’s AI Enabler, where he focuses on optimizing inference and training at scale. With a strong background in Natural Language Processing (NLP) and Recommender Systems, Igor has been tackling the challenges of large-scale model optimization long before transformers became mainstream. Prior to CAST AI, he worked at industry leaders like Bloomreach and Infobip, where he contributed to the development and deployment of large-scale AI and personalization systems from the early days of the field.

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
SUMMARY
Experimenting with LLMs is easy. Running them reliably and cost-effectively in production is where things break.
Most AI teams never make it past demos and proofs of concept. A smaller group is pushing real workloads to production—and running into very real challenges around infrastructure efficiency, runaway cloud costs, and reliability at scale.
This session is for engineers and platform teams moving beyond experimentation and building AI systems that actually hold up in production.
TRANSCRIPT
Adam Becker [00:00:00]: is I think good for both of our projects. Because as you get more and more and more models and as people are building more and more AI tools and plugging in AI into their applications, you're just going to see— I mean, I imagine that I'd be very curious to get your guys' thoughts on this because right now I suspect that when people start out, they not really think about serving. I mean, you have to kind of make it to some level of maturity. Either you made it past your initial, first, like a prototype, but like a proof of concept, and then you manage to get some— people need to use the thing before you really care about it. Is that right? How do you think about— by the way, we're just diving right in. If you guys are joining us, we had a fascinating conversation. Even in the green room and we're just opening the door to let everybody in. If you haven't yet joined us, this is the ML Ops community.
Adam Becker [00:01:04]: This is going to be our first panel for the year 2026. Find us mlops.community. If you're not yet a member of the Slack workspace, we have probably one of the best Slack workspaces in all of tech. Honestly, just join everybody that's in building anything, meaningful in machine learning and ops and AI is all in that Slack workspace. So join that, subscribe to the newsletter, podcast, all of that stuff. Wherever you find your podcasts, you'll find us there. Okay, I was just starting to ask you guys the question then. Today we're going to be focusing on serving, and that is an— it feels like it's just a Grand Canyon of considerations there.
Adam Becker [00:01:42]: There's so much going on there. And for most people, and perhaps maybe for a lot of people that are tuning in, they might never have even conceptualized the difficulty and the complexity of this terrain of serving because they might still be in that pre-serving camp. They're just in a— they're not yet wearing those glasses to show them, oh my God, actually, once I actually make it to production, when people— once people actually start to use this thing, it's a whole can of worms for how to actually serve these things. So what have you seen Well, first of all, is that true based on your experience? Is that 100% definitely?
Igor Šušić [00:02:19]: Yeah, that's totally true. Like, first you need to learn to crawl, then you can actually walk, right? So I think Ioana can actually share some great stories that we've seen with our customers, uh, exactly struggling with this.
Adam Becker [00:02:33]: So, so if that's the case, then maybe the two of you can start— can introduce yourself real quick. Uh, who are you? And then and what you do, you're both from Cast AI. What is it that is the lens through which you guys are looking at this increasingly complex landscape? Just so that we know when people are going to ask questions in the chat, folks, write down questions in the chat if you have any. We will know how to best direct those. Nice, ceremonious. It does all these funky stuff. We will know how to orient and to direct the right question to the right person. Ioana, Igor, first of all, thank you guys for joining today.
Adam Becker [00:03:14]: Do you want to start? Give us just a brief introduction to each of you. Ioana, if you want to start.
Ioana Apetrei [00:03:20]: For sure. It's nice to be here. So I'm Ioana. I'm a product manager. I have a background in fintech and entertainment, and today marks the second year to date since I've been with Cast AI.
Adam Becker [00:03:34]: Congratulations. The balloons should have come for you. Maybe that's what it was.
Ioana Apetrei [00:03:37]: That's, you know, exactly, celebrations.
Igor Šušić [00:03:42]: Yeah, on my side, I'm just a little bit shy over the year with the KSTI currently. Um, staff machine learning engineer there, uh, working on our platform product for the AI. Before that, worked with the AI when it was not that popular, so before the GPT and the recommendation systems. And your usual textual classification stuff, NLP, before it was super huge.
Adam Becker [00:04:08]: Nice. Before it was cool, before it was a household name. At some point it was crazy. I mean, everybody just started saying things like deep seek. I'm like, how do you even know? All of a sudden, just the most technicals. Okay. Johanna, do you want to share with us? Maybe you have a presentation to share and afterwards Igor does too. Maybe just as a way to bridge us over to that presentation.
Adam Becker [00:04:29]: You can tell us a little bit about, are there, are you seeing these challenges that customers are having? Is that the primary thing that you're engaging with is figuring out how, how to go from the crawling to the running?
Ioana Apetrei [00:04:44]: 100%. So last year, what we noticed is the majority of the customers that we talked to, they were just experimenting, right? You have an idea, you think AI is going to help you. Solve it, right? Or you find out, you identify a problem and then you try to solve it with AI. But you don't start by thinking about the infrastructure and you don't start about thinking about scaling, right? When you're building a car, you're not thinking first about how to build a factory to build a car, right? You're thinking about the dynamics of that car and the mechanics and everything. It's the same here. So that's definitely something that we noticed and I think the year 2026 we'll see more experiments or more projects moving from experimentation and POC, finding product-market fit, scaling, and then starting to hit a lot of challenges and blockages in infrastructure.
Adam Becker [00:05:40]: Awesome. Okay, so a lot of questions that I already have about exactly what that pathway looks like. I might hold off until you do your presentation and we'll get off the stage, we'll give you the space I could see your screen right now and we'll be back in 15 minutes. Does that work?
Ioana Apetrei [00:06:02]: Awesome.
Adam Becker [00:06:03]: See you soon.
Ioana Apetrei [00:06:04]: See you soon. Okay, so what I'm going to talk a bit about the lessons that we learned building our product and talking to our customers over, over the years. And then Igor is going to take you through the nitty gritty technical details of everything, which I know that he is excited. To get started with. But, so some time ago, we read the papers that proved that you don't need the biggest, the baddest model of the bunch for every task, right? We saw that you can save 80, 90% in cost by picking the right model for the right job. Small models, you use them for low complexity tasks, while larger models, you use them for planning and for reasoning and so on. So we did the normal thing. We set out to build an intelligent LLM router, something that identifies the prompt complexity and routes it to the best available model.
Ioana Apetrei [00:07:08]: Sounds good, right? So we built it, we tested with real users, and that helped us identify two problems. The first one is small models they need fine-tuning. More often than not, the off-the-shelf version models weren't accurate enough for production use cases. So while we did see cost savings up to more than 90%, there was a tax to pay in accuracy. So in production, we immediately hit this kind of, this wall. So if you want to use small models, more often than not, you need to fine-tune them. And secondly, these open-source models, they need to be hosted somewhere. Like I said in the beginning, the majority of the teams that we were talking to last year, they were still experimenting, right? They had an idea and they were trying to get to product-market fit.
Ioana Apetrei [00:08:13]: And they weren't thinking about building complex pipeline or infrastructure, which makes sense. So they went for the convenient, they went for OpenAI APIs, and they started building from there. But telling these customers, getting them to move to open source models from the convenience of the model API It was challenging because all of a sudden we were telling them that now you need to build this, you need a Kubernetes cluster and you need to orchestrate your GPU and figure out autoscaling and VLM parameter tuning and all this stuff, right? So it was kind of a steep learning curve. So we stepped back. We now, we understand we have these problems. We took a step back and we asked, you know, We already have a successful product in Cast AI that automates and optimizes Kubernetes clusters. Why not build something that lets customers deploy open-source models directly in their Kubernetes clusters? We can handle the GPU orchestration, the auto-scaling, the VLLM parameter tuning. Our users can focus on their AI applications.
Ioana Apetrei [00:09:33]: So that became our product. That became AI Enabler. But then we learned the next lesson. We hit the next challenge, which is that AI engineers often rely on or have to rely on their DevOps teams and colleagues to create the clusters, to connect them, to manage them. And what this means in reality is AI teams get stuck waiting for infrastructure access. No cluster, No model access, no progress. So we decided to put our money where our mouth is, um, and we started hosting the models ourselves, dogfooding our own platform. So we focus on optimizing the token throughput, the GPU orchestration utilization, and so on, and our users can just call our endpoints and pay for what they use, right? Um, when they're ready, they can naturally progress towards managed GPUs and self-hosting and a more complicated, more control over the model, the infrastructure, and self-hosting.
Ioana Apetrei [00:10:40]: So that journey taught us something important, that there's no single right answer. Different situations need different approaches, which brings me to the first real lesson. The biggest mistake we see teams make is assuming there's one right way to deploy LLMs. We're either an API shop and we only use private and model APIs, or we self-host everything. But that kind of binary thinking can cost money and can slow you down. How we see it in reality is that this is a spectrum, and where you land depends on your specific situation and your specific workloads. Let's have a quick look at some of the deployment options that you typically have. First, there was the model API.
Ioana Apetrei [00:11:41]: The model APIs, they make sense for when you get started, when you're experimenting, You, like I said, you have an idea, you want to validate, you have some hypothesis, you want to validate them. You don't want to lose the time building the underlying infrastructure just to get there and see that your hypothesis was invalidated. So easy way to get started is model APIs. You pay per token, no infrastructure to think about. You can use the models out of the box. And it's also, it works well for small to medium scale, especially if you have workloads there that are input heavy, right? Your next option is to use managed GPU. So this is kind of the middle tier. Your AI code runs in a container on some managed GPUs.
Ioana Apetrei [00:12:37]: Here you're not paying per token, but you're paying per compute usage. Still don't pay— you're still not paying for the full GPU, you're only paying for the compute usage of your container. This means that you can run your own models, you can have control over your inference, over your server configuration, you control the application layer. We handle the GPU orchestration underneath, all the provisioning, the scaling, the capacity planning, and so on. And this type of option works well for teams that have custom AI models, that have fine-tuned their models and have maybe not enough usage to justify a full GPU, and they have fine-tuned their models and they cannot use the model APIs. They have spiky or unpredictable traffic, or when you do heavy pre- and post-processing. Okay, so this, that's the managed GPU. And thirdly, we have self-hosting.
Ioana Apetrei [00:13:52]: You see how we kind of, we started with the self-hosting, so we started from the, from the back and worked our way to the front. So when does self-hosting make sense? First of all, this is when your model runs completely in your own VPC, right? You start to pay for the actual GPU. You reserve the GPU, you have to pay for it regardless of how much usage you have. So you can use tooling to make your models more— run more efficiently, of course. Orchestration, auto-scaling, optimization, hibernation. But the infrastructure in the end is yours. This works out very well for teams that have strict data privacy rules that are handling sensitive or processing sensitive information. So they need to have maximum control over the model, the model weights, the data, and so on.
Ioana Apetrei [00:14:55]: Or for output-heavy use cases, or, and especially for teams that are already running, using Kubernetes, because that just makes the transition so much more smoother. So, but what's the right answer here? The right answer that we, or the answer what we see in reality after talking to lots of customers, is that the reality is hybrid. You, you choose your deployment option based on your specific use case needs, right? So I'm not talking about organization needs. It's use case based. So you will see that most mature teams end up running a mix. They use APIs for some workloads, especially those that can utilize out-of-the-box models with no customization. Managed endpoints for others and self-hosted for the stuff that really demands it. And the key here is to design for optionality and to design for movement.
Ioana Apetrei [00:16:05]: Your needs will change, so can you move between these tiers without rewriting your application? Because if migration requires a rewrite, you'll stay where you are. You're either overpaying or underserving. Lesson 2 that we learned through our journey is that self-hosting is harder than it looks. It looks hard, but it's actually even harder than that. So tell me if this sounds familiar to any of you. Let's say that you've decided self-hosting makes sense for your use case. You need a 70-bit— you need to run a 70-bit parameter model. You need real-time latency, and that means serious hardware.
Ioana Apetrei [00:16:51]: So you need at least one node running 8 H100s to serve this in production. So you go to your cloud console to provision, but you can't find it because there's no quota for it, or at least there's no quota for your region or at all. Okay, you request a quota increase, you fill out the form, you fill out the justification, it gets rejected. You explain why you need it, you provide an adequate explanation. A week later, it finally gets approved. Yay, we got the, we got the GPU quota. Now you go back to your, to your cloud console and you try to provision your node. But surprise, there's insufficient capacity in that region.
Ioana Apetrei [00:17:37]: So now you have permission to use the GPU, but it doesn't exist. And this is the GPU market reality right now. Great, but let's say You do get the capacity. Let's skip past that. You got the capacity. You maybe you have a tool in place like Cast AI that can expand your clusters across regions and clouds to provision the GPU. Cool. Now what? Now you have an expensive node that you have to pay around $20K a month.
Ioana Apetrei [00:18:14]: But here's the question that nobody asked so far, or maybe it wasn't really considered: do you have enough usage to justify this? Because if you're paying $20,000 a month for a node that runs at 50% utilization, you're actually paying $10K for the inference and $10K for the idle GPU. There's ways to tackle this, of course, like time sharing, which can solve the utilization problem depending on your usage and your memory needs. So instead of dedicating a GPU node to one workload, you can share the capacity across multiple nodes, maybe teams or use cases and so on. So when one workload is idle, another one can use the capacity. The GPU stays busy and the effective cost per token goes down. And that's not theoretical. One of our education tech customers actually migrated from AWS SageMaker to a timeshare setup in Cast AI, and the result was that they got— it was 40% cheaper with the same or better latency. This allowed them to expand faster.
Ioana Apetrei [00:19:27]: So it's not just about cost or the cost savings, but what those costs Cost savings unlock you to do. But there's the important thing here is that there's not one-size-fits-all solution. Different workloads have different trade-offs and you always have to balance these trade-offs. You have a latency-sensitive customer-facing app? Well, maybe you want dedicated capacity to guarantee response times. So you might be running at least one on-demand node, and maybe you can burst into spot capacity to handle the traffic spikes. Or maybe you have an internal batch processing pipeline, in which case time sharing makes perfect sense for you. Or maybe just spiky unpredictable traffic where managed endpoints with instant scale-up where you don't have to wait for the GPU node to come online, is a better fit. But the mistake here is assuming that you can have one architecture that works for everything, 'cause it doesn't.
Ioana Apetrei [00:20:39]: What we learned is that you need to evaluate each workload individually. What's the latency requirement? That's very important and is gonna drive a lot of important decisions. What's the traffic pattern? What's the cost sensitivity? And what's the data, um, what are the data restrictions. And my colleague Igor is going to explain in a bit why this matters when it comes to infrastructure decisions. But underneath all of this, you increase your quota, you have the GPU, uh, capacity and everything, you're not done, right? There's still real tech complexity to be tackled here. If you're hosting, let's say you're using LLMs, you're hosting LLMs, VLLM is powerful, but it's not quite plug and play. You still need to figure out your batch sizes, your KB cache configuration, tensor parallelism settings. The defaults work for experimentation, but for production, settings, you need iteration, and you need expertise.
Ioana Apetrei [00:21:51]: Do you believe that Kubernetes is the right operating system for AI? But auto-scaling GPU workloads isn't like auto-scaling web servers. Here you have long cold start times, you have expensive idle resources, getting the right scaling processes requires deep domain knowledge, The difference between a default deployment and the well-tuned setup can be 2-3 times in effective capacity.
Igor Šušić [00:22:25]: Cool.
Ioana Apetrei [00:22:26]: Here are my key takeaways. First, every path leads to infrastructure. It's not a matter of if I will have to handle the infrastructure, but when I will have to make these decisions. Handle the infrastructure layer. Secondly, you need to match your deployment options to your workload needs. There's, like I said, there's no single right answer. It really depends on your latency, your SLOs, your performance, your usage, your cost sensitivity, and so on. And third, when you need to be agile, solve your core problem first.
Ioana Apetrei [00:23:08]: Buy or outsource the rest. Because if GPU orchestration isn't your competitive advantage, then it becomes a distraction. If your job is to build great AI applications and it's not to become infrastructure experts, right? And just one more thing before I wrap this up and I give the mic to my colleague Igor. We talked about the friction of getting started, DevOps bottlenecks, GPU quota hell, weeks of waiting before you can experiment with open-source models. So if that resonates with you, we're opening up our inference endpoints. There's a free tier, no credit card, no commitment. Test the open-source models, see what performance looks like, decide if it fits for you. We're going to start with a coding use case, allowing you to connect Cast AI in OpenCode, for example, or any other tool that you're using and execute coding tasks for free.
Ioana Apetrei [00:24:11]: We're going to roll out gradually, so there's a waitlist. Sign up if you want early access. I'm going to share the link in a bit in our chat, and I'm happy to bump anyone here to the front. Edit.
Adam Becker [00:24:27]: All right, Ioana, thank you very much. Brilliant. I'm gonna take the screen share off, and thank you very much for that. That was excellent. I, I think a key takeaway in my mind is the spectrum component. That is, you're— it's not— you shouldn't think about it, is my company going to be self-hosting? Is— it's just about a specific use case. What exactly are the needs of this specific use case and how do we build a flexible and resilient and expansive enough system that can incorporate all of the different needs? Is that— does that take away— resonate?
Igor Šušić [00:25:05]: Right.
Adam Becker [00:25:06]: Okay.
Ioana Apetrei [00:25:06]: Yes. And making sure that you can be flexible enough so that you can move between them and you don't lock yourself into an expensive option, for example.
Adam Becker [00:25:16]: Wonderful. Okay, Ioana, thank you very much. Igor, are you ready to jam?
Igor Šušić [00:25:20]: Always, man.
Adam Becker [00:25:22]: Okay, so your screen is up. I'll be back in 15 minutes too, and we'll take questions, uh, both for you and for Ioana that come from the audience. So folks, put them in the chat, and I'll see you soon.
Igor Šušić [00:25:36]: See you shortly, Adam and Ioana. Uh, hi everybody. So, okay, the title of the whole talk is a bit, um, is a bit weird. We have 15 minutes. For somebody it will be maybe boring, for somebody exciting. The goal of this 15 minutes is to actually introduce you to how you should think about these workloads, how you should think about approaching them, and talking with your stakeholders, or maybe you are the stakeholder, how you should be doing actually decisions. After this, I hope you will have questions that will go into your mind and you will be thinking about, I never actually thought about that. Now I need to think about that.
Igor Šušić [00:26:20]: And that unlocks something for me. So I would say that most production wins will actually come from your profiling and the measuring, the benchmarking that you actually do, and it won't really come from these exotic techniques. We'll mention the exotic techniques, and they are needed actually for some things, but not for everything. So your regular continuous batching or your quantization and just understanding your workload and knowing your bottleneck that will probably get you 80% of the way, right? And now it will depend on your use case. Do you actually need that additional 20%, which is exponentially harder, actually? So first thing that you would need to think is which type of workload, uh, do you actually have, right? Ask yourself that question. So here you can see on the y-axis, you can see the output tokens. On the x-axis, you can see the input tokens. Notice that I don't have the actual numbers written anywhere.
Igor Šušić [00:27:18]: That's because the numbers or even the magnitude of the order, it really depends on the company, on the specific client, on the specific use case. But in general, you need to think in the following way. If you have a generative workload, what that usually means is you have a bunch of output tokens that you need to generate. That's very expensive. You know that simply by looking at all these APIs that are provided to you. Because they are usually charging more for the output token. And we will talk in a bit why is that, right? You have your summarization workload, which is like I give you a bunch of stuff to actually process and generate me only a few tokens. This is super cheap.
Igor Šušić [00:28:07]: This is like heaven for the GPUs, right? For example, your coding use case, when you put a huge context into the when you put actually the huge context into the LLM, that's great for the GPU utilization. The rest of it will actually depend on how many tokens you want to generate. Do you want to generate the whole file or something a bit smaller? Then you have the chat-like workload. That's how actually everything started back in 2023, let's say 2022. You would have your 200 tokens or 100 tokens on the input, you would be fine with the answer that was like 50 to 100 tokens. Like, today with the technology, with the algorithm improvements and the kernels that we have, this is like, if you have this use case, I'm pretty sure you will be able to handle it, or we are able to currently handle it really, really seamlessly. And then you have the average use case, like that's That's how the things are looking right now. That's like you are not too heavy on the generative side, you are not too heavy on the summarization workload, but it balances out in between itself.
Igor Šušić [00:29:19]: This graph will actually influence all the decisions later. This is probably the single most important thing to understand about your workload. One more thing I would like to mention here is that one specific deployment, usually does not fit all of the use cases. If you think about deploying one specific model to cover all of these, depending on your workload type, if you mix everything together, it usually doesn't work out really well. After you understand your workload, you need to think about your metrics. I split them in three tiers. This is not any specific or regular scientific tiering methodology. It's just what I think, how I think about the metrics itself.
Igor Šušić [00:30:13]: So usually, you know, time to first token, inter-token latency, or maybe end-to-end request latency depending on your workload type. Those are the ones that your customers or the users of your system actually feel. So when you, you know, if you want to summarize 10 pages long Google Docs, right, you actually care about time to first token. You don't really want to wait 15 seconds, 20 seconds for the first token. And then once it, you know, starts streaming, like, you will be fine. You can improve on inter-token latency, but that, you know, time to first token actually keeps your customer. If you have some use case, for example, maybe in that case Agents are talking one to another and you don't really care which phase is taking how much, but you care in general about the end-to-end request latency. You just want your agents to be as fast as they can possibly be.
Igor Šušić [00:31:11]: Then after you figured out what's okay, what do you expect, you will use that in the tier 2. You will balance that with the throughput. You will actually check your GPU utilization to see what's happening there because you want that GPU utilization high because you will connect that with the tier 3. From the tier 1, you will decide on goodput. So goodput is basically, it's a just kind of umbrella, umbrella term for multiple metrics from the tier 1. It says like, okay, now I have my time to first token, I have inter-token latency, I need to define what is acceptable for my customers. So, you know, for each request let's say time to first token needs to be under 500 milliseconds and inter-token latency needs to be under 50 milliseconds. And you define your output like that and you track all the requests that are actually under that limit.
Igor Šušić [00:32:10]: And you also need to understand your concurrency limit. Concurrency is very important. Why? Because with these generative workloads, once somebody submits a request, usually it's not like fire and forget. It's like you actually receive a request, then JPU is actually doing some crunching depending on how many work it needs to do. You need to understand how many of these requests in parallel you can actually handle at any point in time because that will be your limit. But all these metrics are still also defined by the throughput or confined by the throughput. You don't want throughput of 1,000 requests per second, if your throughput is very bad. And then in the end, of course, you need to look at the cost and the utilization.
Igor Šušić [00:32:59]: You want to bring your utilization of the GPU to the maximum, and you want that cost per token to be as less as possible. And we will talk how to, how to, how to reach that. So when you are talking with your stakeholders about, about this, actually, you need to tell them that You know, they can have it all. You can have low cost, you can have high throughput and low latency at the same time. You can imagine it like a— here in this graph, I'm kind of missing maybe, let's say, accuracy standpoint. It's not that big of a deal, but when you include quantization and methods such as quantization, like accuracy is kind of important. But imagine it like a ball that you are kind of stretching between these three topics, right? So here I have just simple examples. Please ask questions so we can actually answer on some specifics later.
Igor Šušić [00:33:53]: But for example, if you want bigger batches, right, in that case your latency will actually become worse, but your throughput will become better because now you will batch more requests together, you will have a higher GPU utilization, and your cost because of that will go down. All right, if you just add more GPUs, obviously, uh, you have more hardware, so your throughput will be better even if you do nothing, probably. Your latency will also become better just for the sake of it because you have a bigger system, but now you are paying for the more GPUs, right? Quantization is a tricky one. Like, here I have everything, like, everything is better. It's not really true. It really depends on the quantization type and on the hardware that you use and on the kernels that you run, but we can talk about it later. So I'll give you one example. Usually when you do quantization now, for example, you can do bits and bytes quantization.
Igor Šušić [00:34:50]: You will have, let's say, 4-bit weights and 16-bit activations or something like that. Then what happens is your model is actually smaller on the memory side. It means that you need to move less memory on your GPU when you are doing differencing, which is good. And now you can fit it in the smaller GPU. But when you are actually doing computation, you have this process of dequantization. So from the 4 bits, you need to go back to the 16 bits, you do your computation in the 16 bits, and then you go back to the 4 bits again. So that part is still equally compute-heavy. So there is no compute optimization there, right? So it really depends on the combination of the hardware and the technique.
Igor Šušić [00:35:45]: And of course, you can just go, like, for example, here with the cheaper GPU from A100 to L4, you will have less high-bandwidth memory. So obviously, because the cost is going down because the GPU itself is cheaper, throughput is worse. Well, obviously, and now I notice that my actually latency is in green, but it says worse, so it should be worse, right? So here you have two reds, one green, let's say. So you need to think about the trade-offs. You need to connect them with the metrics and the type of the workload that you have. This is what you need to discuss with the stakeholders, and then after that, you actually can move forward. Right. So when you are thinking about starting to serve an LLM, some options will come naturally to you when you are actually trying to, you know, research this stuff, either using Perplexity or OpenAI or ChatGPT, right? Whatever you're actually using these days to research Claude, maybe.
Igor Šušić [00:36:51]: VLLM is the— like, all three of these are inference engines. Inference engines are just like a specific piece of software that's written to efficiently execute forward pass in the neural network, right? In this case, transformers specifically. So VLLM is the most popular out of them all. It probably has the biggest community around it. It's probably the easiest to start with in terms of documentation, but it has its nuances, right? And we will see actually those later. TensorRT is a bit harder one. That's from NVIDIA. It actually has a specific format that you need to kind of transfer your model into, and then you use the specific format to actually do the inference.
Igor Šušić [00:37:41]: It's super optimized for the NVIDIA GPUs, so you can sometimes get better performance than using, for example, VLM. But there is also an option where you can, like, kind of— in TensorRT, you can plug into VLM as a backend. So one more option there. SG Lang, it was neck to neck at the beginning with the VLM. It still has its own, let's say, advantages. What I would say is it depends on the use case. SG Lang, for example, excels in the use case where you have like a long session, you actually have a context that you are tracking, something like, you know, something like a long chat session that's favorable for the edge AI. But these are not enough.
Igor Šušić [00:38:36]: And today you have multiple projects which are actually doing some high-level orchestration on top of the inference. So for example, AI Bricks from the— I think it's from the PyTorch I think it's their project actually. Now it's under the VLM umbrella. It does the GenAI orchestration. Dynamo from NVIDIA. LLMd is again tied to the VLM community, but I think it's a Red Hat project. Don't take me for my word. Kserve, that one existed even before the LLMs.
Igor Šušić [00:39:14]: And the LMCache, which is a bit different than all of the above. All of the above are actually doing some kind of the orchestration for the inference engines. And LMCache is actually expanding your cache to a cheaper storage, right? But that is also a crucial part. So let's talk why these projects actually exist and why do we need them. So there is a concept in the inference which is called which is called prefill and decode. So this is how we actually split one request. We split it in two phases. One phase is a prefill phase, the other one is decode phase.
Igor Šušić [00:39:58]: So the prefill phase is compute-heavy. That means that once you actually send your data to do a forward pass, so all the stuff that you actually, that you consider your context length before you generate the first token, That's the text that kind of fills in that prefill phase, right? And the GPU to actually work on that, it needs to do quite a number of chunking. But once you generate the first token and you need to generate every other token after that, right, because it's autoregressive, you need to have full context before that, you have this decode phase. Which is super slow. Decode phase is super slow and doesn't fully utilize your GPU in terms of the compute because what is happening usually is it's memory bound, right? So on these graphs here, this is from the paper DistServ, which kind of explained the disaggregated serving. And disaggregated serving, I would say, falls in that additional 20%, right? So when I mentioned at the beginning, like, your regular stuff will take you 80% of the way, this would be that 20% depending on size of your workloads and also, of course, the hardware that you have available. So here you can see on the left, you can see the graph. On this graph, we have the input length of 128 tokens and you have a byte size.
Igor Šušić [00:41:32]: What is compared here is multiple requests batched together. The blue line represents the requests that are batched together, but they have one prefill request. You have all the decoding ones and then you have one prefill there. You have only decode here. You can see what's happening on the graph. You can see how is it slowing down. And the difference even gets bigger when you actually go, when you actually go to increase actually the input time. So what is the solution for that? Basically the solution is disaggregated serving.
Igor Šušić [00:42:14]: Disaggregated serving basically means, okay, we know we have prefill, we know we have decode. What we need to do, we need to split them apart. So what we will do is we will optimize We will basically deploy, let's say, if we take VLM, we'll deploy one instance of the VLM on a specific node. For that node, we will select the hardware that fits the use case perfectly. We will benchmark that part and we will just send the prefill request there. Then once the prefill is done on that node, we will copy basically the key recache. We will move it to the decode node. Decode node can have a different type again of the GPU that's specialized, for example, exactly for decoding.
Igor Šušić [00:42:55]: And there you specialize just to basically decrease inter-token latency, right? So your prefill instance is specializing or optimizing with the time to first token. Your decoding instance is optimizing on the inter-token latency. And the beautiful thing is that you can actually, you can actually optimize these two things in isolation, basically, right? So you can choose the different GPUs, the different nodes, etc. Be careful with this because if you just try to deploy something like that and you don't really have enough traffic for this or you don't really have hardware for it, it can just, you know, you can shoot yourself in the foot basically. Because you can see like performance degradation of even 30% depending on what's your current use case. So don't forget about the hardware. Like if I want to like We are talking about different inference engines. They are running on something, right? And that something is hardware.
Igor Šušić [00:43:54]: Today we all know like NVIDIA currently has supremacy. So look at the specs of your hardware. Understand what does it mean when you have like integer 8 tensor cores, integer 4 tensor cores, what your model actually requires, right? So in the Hugging Face config file, open that config file, look at the model, Look what's the quantization used on the model. Is there any— what cores do you actually need? What's the size of the model? Is it mixture of vectors? Like all that counts actually. Look at that stuff, understand your GPU, understand how fast it can move the memory back and forth. What's your bandwidth? How many flops can you actually do? And then, you know, that's just like the pre-selection so that you know that you chose the right GPU for your workload, you still need to benchmark it after because usually no amount of simulation or kind of theoretical framework will get you exactly to the numbers that you will have in production. You know, because we cannot account for, I don't know, networking latency in some cases or how the physical boxes are actually set up, etc. Attention, Understand the attention of your model, understand what's happening there.
Igor Šušić [00:45:16]: Here I have like few attention types. So multi-head, group query, and multi-query attention. I won't go too deeply. The point is here the queries, what is incoming to your model, these are the blue boxes. The keys and values, these yellow and red boxes are what are you saving in your key-value cache. So we can see that based on the different attention, you are saving different amounts of these blocks, which means that depending on whatever attention is in your model, it will use a specific amount of KVCache for, let's say, one full context length request. You need to understand that. You need to calculate that number.
Igor Šušić [00:46:00]: You need to see the trade-off, like how many, how much KVCache you actually need to have a successful inference. And we can, we can talk later, uh, what's that number and how, how can you decide on it. Um, what we learned also so far from serving these, uh, these models, I want to share this with you, quantization really do work. Um, so here, uh, this is, this has nothing to do with the quantization per se. This is just like, uh, from different memory cards from NVIDIA. But you can see throughout the years, NVIDIA is shipping more and more specialized hardware for specific operations. For example, A100, they have like integer 8 tensor core, right? Or FP16. But if you go to the newest one, it will have floating point 4.
Igor Šušić [00:46:52]: So they optimize for the quantization. They kind of enable you to run 4-bit quantization even faster, right? The Hopper architecture, the H100, like they only had, it only had FP8. And at the end, I would also want to mention, I think I'm a bit over my time, but this is the last slide. So at the end, I would also want to mention that kernels are probably the most important piece. So when you understand all of the pieces above, let's say you take your quantization, And you take the, I don't know, you take your quantization and then you quantize the model. Now you run it and there's not much improvement. Why? Well, simply because it's not executed efficiently on the hardware. And these things at some inference engines happen immediately.
Igor Šušić [00:47:46]: They have a way to figure it out through the config of the model. They understand for the specific model how to run it. But usually, in most cases, you can actually select it by hand manually. What do you want to run? So, for example, I know Marlin is a really optimized kernel that fuses some of the operations together for the 4-bit weights and 16-bit activations. It gives you a huge increase in throughput when you compare it with the baseline. Don't just ignore these things. You kind of want to go deep on, on it to, uh, to figure out your, your deployment. And thanks, uh, that's, uh, that's everything from my side for now.
Igor Šušić [00:48:32]: I hope I unlocked those questions for you. Uh, Adam, I'm, I'm ready for the chat.
Adam Becker [00:48:39]: Nice, Igor, thank you very much. That was excellent. And as I suspected, I feel like I could keep you here for many, many hours You want to suggest it as much. Well, as I let questions come into the chat, maybe I can take a stab at a couple of them. One takeaway that I had from the last couple of slides that you shared is people shouldn't be scared to just prop open the model, understand what's happening under the hood, look at how attention is built and architected, look at the sizes, look at the dtype, look at all of the stuff, and then make informed decisions about what infrastructure, what hardware to use, what compression techniques, what different, uh, different things you can do to further optimize serving. First of all, is that true? Is that the right takeaway? Second of all, a lot of people, if they come into this through playing with the model APIs, they might not feel the confidence that they could do this properly. Maybe people are application engineers, they play around with OpenAI APIs, but now you want them to understand KVCache. High of a jump is that? And what does it take for them to overcome what might feel like a lack of expertise?
Igor Šušić [00:50:17]: Well, you know, for that, I don't have like a super great answer. Uh, the time is probably your best friend. Like, if you have time and you can just dig deep, do that, right? But have in mind, depending on your project, You cannot just go running around and, you know, trying to learn every new thing and then, you know, hope it works because you will miss something along the way. So do it in stages, right? Even if you go to decide with company like ours to actually do some stuff to serving, right? In the meantime, like, we are actually talking to you. We want you to understand your workloads. You will learn things. At some point, you will be free to self-host, right? So it's a process, it doesn't happen overnight, right?
Ioana Apetrei [00:51:02]: Yeah, exactly. That's why we started with the self-hosting option. We talked to a lot of users and then we realized, oh, okay, wait, there's a big barrier here that we need to break first and then work our way up to this level of complexity.
Igor Šušić [00:51:21]: Yeah, I'll give— sorry, Adam, I'll give you one more regarding this. I think the problem with companies in the space is like you need to fight not to become consultancy because people have so many questions and it's like it's not a great business model.
Adam Becker [00:51:39]: Well, you could see that I'm trying to drag you into consulting mode right now because it's— well, when I started honestly in data engineering, that was my strategy because I knew very little when I started, but there was a lot of startups that were building a lot of very cool tools and so I would just ask ask them a million questions, you know, like, how do you think I should order that? You know, because otherwise there was very little access to learning so much of this stuff. And people that are building tools, they often have expertise and they can share that expertise because they want to get their tool across. And so it was like a win-win, although it's at some point you end up getting a lot of books and stuff and that's fine. Do you feel like that what you guys are building at Cast AI is almost going to be a substitute for that type of expertise? Or it will only— it will simply enable people that already have that expertise to do more and get more bang for their buck?
Igor Šušić [00:52:34]: You know, I think it's both, right?
Ioana Apetrei [00:52:38]: Same here.
Igor Šušić [00:52:39]: Usually you can't sit on two stools, but in this case it seems like it's both because, you know, for power users, you just help them get rid of their problems. I don't want to, you know, manage these things. I know how they work, but I want to spend my time actually designing my use case, designing my application, making things, you know, Agentic Flow actually works and delivers results. And then at some point I can think of, you know, moving away freely, right? In that, in that whole process, as I said, you are usually not alone, especially if you go with the route that we serve it, for example, in your VPC, you have that— you almost have it like it's almost like you have a team next to you in your company that actually does that for you, and you can always go there and ask, right? So it really depends on you and how much you actually want to learn.
Ioana Apetrei [00:53:39]: Yeah, honestly, between me and Igor, I think both of us have been involved in almost every customer conversation so far. So if you reach out to us, these are the first faces that you're going to see and you're going to talk to, because I think it's important to understand where our customers are coming from, what's their level of expertise, so that ultimately we can build a product for them to self-serve. We're not there yet in totality because we still see different use cases. Like I said, it's a spectrum. We're trying to identify the patterns and then, you know, our job as a product team is to build a product to solve that problem, to productionize this.
Adam Becker [00:54:19]: Nice. Okay, we have another couple of questions, maybe 2 or 3 questions in the chat, and I will attend to those just real quick. If one of you might have the gradual tab open, make sure that it's on mute because I hear— I do hear some echo. So it might be coming from, from that. Okay, so first we have, uh, Samid Ahmed Khan is asking, I understand that there are different quantization techniques that work differently, but if we talk about the W4A16 model, that model has to go back to 16-bit weights. Um, won't b-computation take similar compute power as W16A16? What benefit would it give except storage?
Igor Šušić [00:55:02]: I mean, the storage was the first reason why you would actually do the quantization like that, right? Like now it's easy to think like the Rubin, I think, is coming out or it actually came out. So it has like a terabyte of high-bandwidth memory. But 2 years ago, 3 years ago, when these methods were actually created, your A100 was the best choice, right? And for these bigger 400 billion models at the time or more, you actually, the memory is a big problem. So you want to reduce the memory because there is one more thing. Remember that the decode phase, in the decode phase, what you are doing, you need to move that model from the high-band memory to actually to, let's say, much more speedier L1 cache or the memory that's really close to your computation. And depending on NVIDIA hardware, it's called differently. But you are moving those bits every time in the decode phase. And for each decode phase, for every new token, you do it all over again, right? So there is, there is, that's your, that's the crux, that's the goal.
Igor Šušić [00:56:15]: But the computation, the computation itself, if you have activations in the 16 bits, you cannot multiply 16 bits by 4 bits, right? You— the precision and the output that you would get after that, it wouldn't be stable. You would get gibberish. So, uh, it destroys your— like, yeah, I, I hope I explained it well.
Adam Becker [00:56:39]: Samir, if you have a follow-up, uh, put that in the chat. Okay, so Siddharth, uh, Tapar, Igor shared some inference optimization ideas. How do those apply to specialized LMs, like large embedding or ranking models. Any additional ideas that apply to such models? Which of these techniques are useful? Also, which ones to apply to training optimization as well?
Igor Šušić [00:57:03]: Yeah, so let's say, let's take an embedding, right? Embedding model or re-ranker, right? It all depends what's your architecture of that model. So if you have some encoder in there,, right? If you have a tension mechanism in there, these ideas still apply to that. It's, the model is called differently because usually the last few layers are different or like, you know, bits of the architecture actually serves a different purpose. But the crux of it in the transformer, like it's always the same. So you can reuse this concept across different across different models.
Adam Becker [00:57:44]: We got, we got 2 more minutes unless folks in the audience have another question. I want to ask them 2, one for them and one for me. Okay, so I'll start with the question for them. Do you, have you seen some blind spots that people have when right now they're still using model APIs but they're doing something about their system, something about their architecture something about the way that they're envisioning interacting, uh, with these APIs that ends up making it even more difficult for them when they make that transition to self-hosting? Anything they could do today, insofar as they're not yet self-hosting, what can they do today so that that transition can be made even more easy?
Igor Šušić [00:58:34]: Is that for me or for the audience?
Adam Becker [00:58:37]: Oh, though I'm, I'm asking on behalf of them.
Igor Šušić [00:58:41]: Listen, in that case, I would say pay more attention to your usage, right? So how much of the tokens you're actually doing on the input, how much on the output, uh, what's your usual, let's say, session size or the context length. Like, all these things are important. Like, keep track of that because you are taking for granted what some platform is actually doing it for you. And when you will try to go self-hosting, you will see like it's not that easy, right? Server chat use case, 200 tokens in, 200 tokens out, great, superb, works on WonderQuest. Try to scale it out.
Ioana Apetrei [00:59:18]: Wonderful. From my side, I would say it's set your expectations right. When you're transitioning from a large 400+ billion model to a 20 billion model, set your expectations right. Don't expect the same level of accuracy right off the box. You're going to have to put a bit of work to, you know, compare the results.
Adam Becker [00:59:42]: Nice. Thank you. Okay, I want to steal— can I steal a couple more minutes just for a final— sure. Okay, sure. So one from Muhammad Abdallah: how do you know when serving many requests using my app that I'm moving from Tier 1 to Tier 2? And so on considering the latency of responses. So I wonder if the tier 1, tier 2 is referring, and Mohammed, please clarify. I suspect you came up with these different metric categories, right? But, and then these are, you said almost like you could start by paying attention to the first one, second one, because the first one was closer to the user. Is there some phase transition by which you should now start to pay attention to the— or should it always be like a continuous process?
Igor Šušić [01:00:32]: So ideally continuous process, always, right? The tiers are there to kind of— what's the first thing you look at, right? So the first thing that you look at is, I have— this is my workload, I want to improve my time to first token, right? But at some point, because you will be constrained by your hardware or what you have, you need to see— I'll give you an example, right? You take an A100. On the A100, you put some smaller model, let's say like 7 billion, right? And you can get 1,000 requests per second. Uh, okay, 1,000 requests per second is a silly number, but let's say 50 requests per second, and you can maintain that great, right? But maybe with the 50 requests per second your time to first token is like 600 milliseconds and you want to kind of go down a bit. You really cannot serve 50 requests per second at that point. You need to find the lower number that saturates your GPU to kind of make happy both your hardware and your customer, right? This is the trade-off that you need to look at. So I would say when starting, Simply try to use the product yourself, see what's usable for you, and based on that, or based on the feedback of customers, define what are the numbers that you need to see on the output in terms of like latency, right? And then for that latency, try to find the other numbers, try to find the concurrency, try to find the throughput that you can sustain on some hardware or budget that you have.
Adam Becker [01:02:11]: Yeah, you wanna— did you want to add to that? Is that how you've seen people kind of like interface with the problem?
Ioana Apetrei [01:02:18]: Yeah, I would say the first thing that we ask and we try to define whenever we have a customer discovery session, uh, talking about their customer problems, what they want to do, and setting up their, their models, is define the SLOs. Like, what do you actually need, right? And then all the decisions stem from that, right? So make sure that you understand your SLOs, you connect it with the metrics like Igor said, and then you adjust the infra and kind of automate based on that. But everything starts with the SLOs. And if you have, if you can come to us with a good put definition, like what's, what does good put mean for you, we're going to become best friends.
Adam Becker [01:03:01]: Yeah, yeah, I suspect that this is Again, an invitation to understand what our users and customers need. There's really no amount of AI that can do some hocus-pocus unless you understand what your users and customers need. You'll just continue to be led astray by fancy technology and science and engineering unless you understand that. I think that's a good takeaway and maybe last one here. Nitin is asking, insofar as anybody is interested in going deeper into serving and inference, is there some either— you know what, let's make it more interesting. What do you imagine the future of serving and inference looking like? Are there specific areas that you think are likely to grow and become more mature? More interesting? Like where, if anyone were to want to do research, let's say, into any one of these domains in serving, where would you orient them to?
Igor Šušić [01:04:05]: Well, for now I think the one of the most exciting parts are mixture of experts. And like the models are getting sparser and sparser, so you have less and less actually experts activating on each request, right, to improve your inference. That's one part. And the second part is you have so many companies trying to tackle at this point in time kernel optimizations. So you have all these different projects that are kind of, hey, we will compile or we will actually generate the best code for you to run on this specific high-performance chip. That's the open field. So many competitors there. And I think that field is maturing.
Igor Šušić [01:04:49]: I hope it will mature soon because it will make our lives as machine learning engineers much easier.
Adam Becker [01:04:57]: Awesome. Nice. If anybody wants to follow up with you guys and get to know you more and connect, are you on LinkedIn, Twitter? Where are you? Where can people find you?
Igor Šušić [01:05:08]: LinkedIn. I'm hiding.
Adam Becker [01:05:09]: I'm only on LinkedIn. Nice. If you want to drop your LinkedIn below, that would be cool too. And you could put it in the chat. Ioana and Igor, thank you very much for coming. This was a true education.
Ioana Apetrei [01:05:22]: Thank you very much.
Adam Becker [01:05:27]: Thank you. And everybody else, thank you for, for tuning in. Uh, we may— people asked whether the slides will be available offline after the fact. If you're saying yes, then, then that's the answer. Yeah, why not? Okay, so that's the answer for you. Uh, Ioana and Igor, thank you very much and best of luck with all of your work. And everybody else, thank you for tuning in, and we'll see you next time.
Ioana Apetrei [01:05:50]: Thank you.
Igor Šušić [01:05:51]: Thank you.
Ioana Apetrei [01:05:52]: Bye-bye.
