$360k Question - Understanding the LLM Economics
Nikunj is the co-founder and CEO of TrueFoundry, a platform empowering ML developers to deploy and optimize Language Models. Prior to this role, he served as a Tech Lead for Conversational AI at Meta, where he spearheaded the development of proactive virtual assistants. His team also put Meta's first deep learning model on-device. Nikunj also led the Machine Learning team at Reflektion, where he built an AI platform to enhance search and recommendations for over 600 million users across numerous eCommerce websites.
During his time at UC Berkeley, Nikunj pursued a Master's degree and published papers in Graph & Optimization Theory. Additionally, he developed ArchEx, a software solution utilized by United Technologies Corporation for synthesizing Aircraft Electric Power Systems. Nikunj completed his undergraduate studies at IIT Kharagpur. Furthermore, he successfully exited his first startup, EntHire, through its acquisition by Info Edge, one of the leading HRTech companies globally.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Most of us are using LLMs and some of us are getting to the point where LLMs are going to production. Honeymoon phase is going to get over soon and practical realities like cost & maintainability are going to become mainstream.
However, the cost of running LLMs is not well understood or often not put in perspective. In this talk, we will dive deep into what type of costs are involved in building LLM-based apps. How do these compare when you run RAG vs Fine-tuning, what happens when you use Open Source vs Commercial LLMs?
Spoiler- If you wanted to summarize the entire Wikipedia to half its size using GPT-4 8k context window, it would cost a whopping $360K! If you used a 32k context window your cost would be $720k!
Introduction
Where's Na? Yeah, there he's is. What's up, dude? There he is. He's looking very focused. I don't know if he can hear us. I know. Looks hang. Yeah, I think he's trying to share his screen. Can you hear us? I don't think he, he must have muted. Yeah, he can hear us All right. There we go. Hi. What's up man? Good. How are you?
Demetris? Hey. Hi David. Nice to meet you. How are you? Very good, very good dude. So, uh, I'm guessing you're trying to share your screen right now cause I don't see it on. Yes, I am trying to share my screen. Um, and there we go. I see it and, uh, I may have to give you the winning talk title. Anytime you start to throw around big numbers in a talk title, you got my attention.
So $360,000 question. That is where we're at right now. I love that. Nice dude. Well, I'm gonna kick it off to you. I'll be back in like 20, 25 minutes when you're done, and then I'll ask you some questions. Does that sound all right? Sure. Should we, uh, like, so, um, uh, am I, um, like what time do we like, uh, start presenting?
Start presenting right now, baby. You're on stage. You're up and running. All good? Yes. Perfect. Awesome. That right? Tell me you need a minute. I could play us a song. No, no, no. I think I don't need a minute. If, uh, just let me know when you're putting me live. Okay. You already are live. We've been live. Perfect.
Awesome. Okay. You're fully live. People are loving this. But the good thing is that, uh, I don't know if you know anything about me and my technical follies over the past couple months, but whatever technical difficulties you have, if you have any, then I can one up you in spa. So don't worry about anything.
You're, you're all good right now. I see your screen. I see the slides on the left too, and I'm gonna get off the stage, man. Perfect. Thanks a lot, demetris. See ya. All right. Um, thanks a lot, uh, Demetrius for getting me in and, uh, nicely putting me on a spot right now. Um, welcome everyone. I am Nick Bajaj, uh, co-founder and c e o at True Foundry.
And today, as Demetri mentioned, We're going to talk about this large number three $60,000 question. And the goal for this talk, um, basically is twofold. We are going to talk about what is this three $60,000 number, what the hell does this mean? Um, where is this number coming from and what does it mean in the context of LLMs?
And secondly, we are going to put this number in perspective. Um, the idea is like, you know, the, the overall, uh, overarching goal here is we, all of us have been starting to use, um, have started to use LLMs and uh, there's a lot of interesting type of pricing that we see in terms of G P U usage, in terms of API calls, in terms of how many tokens you're invoking.
And while in my conversations with customers, I realized that this pricing is not as clear in people's heads. So I just decided that I'm gonna take, take this time in the talk and just demystify and put all the pricing of LLMs in perspective. Okay? And I do want to call out in the beginning that this is actually not going to be a presentation on LLMs.
You will notice why I mentioned that, that this is going to be practically a math presentation and we are going to be dealing with a lot of numbers. We are going to be doing a lot of additions and multiplications. So, uh, so bear with me on that. Um, cool. So before we get started, let's actually define, um, a, a sample problem statement that we are going to solve throughout this presentation.
Okay. And given that we are talking about NLP LLMs, Wikipedia is our great friend. So, so the problem statement that we want to take up is we wanna summarize the entire Wikipedia to half its size. It's a very dummy problem statement. Please don't try this at home. Um, but, uh, but it'll hopefully help put things in perspective.
So we are looking at about 6 million articles in Wikipedia where each article has on an average seven 50 words. I have taken some artistic liberties when I came in with this numbers because I have approximated them. So please, uh, don't hold me accountable to the last decimal point here. Um, now seven 50 words roughly translates to a thousand tokens, and token is essentially the currency of pricing when it comes to lms, right?
And our goal is that we want to reduce this to, uh, 50% less words. So same 6 million articles, but now we want to have only 500 tokens per, per, per article. So that's the problem that we are trying to solve, and we are going to use our favorite LLMs to solve these problems now. Just a quick disclaimer that while I'm focusing only on the cost aspect of running LLMs in this presentation, which l l m to decide, uh, that that works best for you is completely for you to decide because there are so many other factors like the quality of response, the latency, uh, the privacy angle to it.
Um, so, so you, you wanna figure out like, you know, all those, like weigh in all those other factors and decide, but in this case, we are going to just focus on the economics around it. All right, so what's our favorite model? What's our favorite model? Our favorite model is GT four. Um, GT four with eight K context length.
Let's talk about how are we going to use GT four and summarize our Wikipedia to half its size. So there are two main levers of pricing when we talk about g pt food. Number one, that you pay for the number of tokens that you send in your prompt. Okay? Number two, you pay for the number of tokens that it sends you in its response, okay?
As simple as that. Now, they charge about $30 for every million token that you send in the prompt, and $60 for every million token that it gives you in the response. So in our case, we talked about Wikipedia articles, 1000 tokens per article. Right. So basically 1 million tokens would be 1000 such articles.
Okay? So if you summarize 1000 articles from 1 million tokens to roughly 500 K tokens, what's happening? Let's, let's look at that. And o o overall, we have 6,000 such blocks of articles, right? So effectively, here's the pricing. You have 6,000 articles. You're paying 30, uh, dollars for the prompt. So roughly that's, that's come, that comes out to one $80,000.
And for your response, we have, again, like, you know, 6,000 articles and we multiply it by a factor of half because we are reducing the size to half. Um, and then $60 for the pricing. So that's another $180,000. So that's your number, $360,000 to take Wikipedia and reduce it to half the size. Okay, that's, that's what I meant by the three $60,000 question.
And now let's go ahead and try to put this number in perspective. First off, if we choose the bigger brother of GT four, the one with the 32 K context length, they actually have the prompt pricing and the response pricing. Just two x of the eight K context length, which would effectively bring down, bring up our cost to $720,000.
Okay. And if you look at the neighbor Entropic, like if you pick up one of the best models of entropic. Um, we see that there is some disparity in terms of how they price their prompt tokens compared to their response tokens, but roughly the, the pricing model is similar. $11 for a million prompts, $32 for a million response, and the cost comes out to be $162,000.
Okay, now let's, let's come back to, let's come back to OpenAI and talk about one of the instruction tuned models. So like from the GPT three threes, right? So they have like four models. Add Babbage, Curie and Da Vinci, and we will talk about the two, uh, the two of the higher end models, right? The Da Vinci and the Huie in this, in this stock.
Now, the way Da Vinci is priced very similar, except they, they charge the same for both proxying your prompts and your response. So they're paying about $20 per million tokens, okay? And that would bring, uh, our pricing to about $180,000. If you were to use DaVinci, The cost of running q e is significantly lower actually in order of magnitude lower, and you're paying basically just $2 per million tokens, uh, for your prompt and response, and your cost becomes $18,000.
So roughly here's the math. Like we started with 360 K for GT 4, 7 20 K for GT four with 32 K context window. Then we talked about aro, DaVinci, and cutie. Okay, I think so far this, this part is clear. Now let's make this math, this, this entire ARI, a bit more complex, um, and talk about some of the self-hosted models.
Right now, the way you price self-hosted models is actually very, very different compared to how you would price the. A P I A P I calls, right? Because now the unit of your pricing becomes the cost of your GPUs, right? The, the shovel of the, of the new bold rush, basically, as they call it, right? Um, and in this case, uh, I'm taking a node, uh, with 800 GPUs.
Eight, 800 GPUs. I'm sure most people in this stock are very. Familiar with this kind of node, I have put together a pricing of the spot instance, but realistically, you might be using, um, the on demand instance and like, you know, this very variable pricing across different GPU vendors. So again, like, you know, just for ease of calculation, I've taken the pricing of the spot instance here, which is $10 per hour for this eight node, uh, eight eight p u machine.
Okay. Now how do we price? How do we figure out that? How much would it cost to summarize our Wikipedia? Um, to half its size, right? So the way this works is how fast a model, which is roughly a 7 billion parameter model, um, is able to process its inputs and its outputs. So the unit is called number of tokens per second, or number of tokens per r that it's able to process, and you're basically going to multiply with the cost of GPUs for those many number of hours, right?
So in this example, We, uh, this, this number comes, comes out to process the, basically when you process the input, you're able to send out all the inputs in parallel. This process is essentially what we call as flop bound process, okay? And you're able to process the inputs a lot, lot faster than you're able to produce the outputs, uh, which, which happens in a sequence.
Okay? So the cost of processing your prompts is going to be $350. So you can basically feed the entire Wikipedia to this model in about $350. And you can get half of Wikipedia out of this model in about $1,750. So that brings our cost to about $2,100. Okay. And I'm not happy with this much amount of math.
I wanna throw some more math into the equation. So what I'm gonna do next is bring fine tuning to the mix of what we are talking about so far. Right? Let's take a couple of examples of how fine tuning changes the pricing of. Summarizing our Wikipedia. Okay, so let's start with our, the, the high highest end, uh, model available that's open to fine tuning from OpenAI, right, which is Da Vinci.
Now, you would notice that in this case, the pricing changes quite a bit, so the cost of processing your prompt tokens now is $120 per million tokens. This is in contrast to $20 in per million tokens that we saw in the previous slide for the same The Vinci model and the same price for your producing the responses.
One $20. Okay. Now I have also put together the pricing for the actual fine tuning using Wikipedia. So like if you were taking like a Wikipedia size data set and you were fine tuning the model with that, you would roughly be paying this much amount of money. Okay, so now let's, let's get back to some, some more calculations.
So cost of processing your input from a fine tune da Vinci model. Basically feeding the entire Wikipedia to a fine tune Da Vinci model is $720,000, and the cost of processing the output is going to be $360,000. If you add in the cost of fine tuning, that goes over 1.25 million. Okay? So, 1.25 million to fine tune on Wikipedia and summarize it to half its size.
That's the API cost that you would be paying. Okay. But Da Vinci is a very high end model model. So the default, if you go by the default fine tune model with open ai, that's qi. Okay. And. As I mentioned, QI is one 10th the cost of the Vinci. So on an average, every single number that you see on the slide is slashed by a factor of 10.
So $12, $12, $18,000, and this number finally becomes $126,000. All right, and now let's compare this pricing of fine tuning a acuity model compared to what we had of, uh, our self-hosted model, right? The 7 billion parameter model that we were hosting. Now the good news about this model is whether you're using a fine tuned version of it or you're using a pre trained version of it, the cost of processing the inputs and producing the outputs actually does not change.
So this remains at three $50. This remains at $7,050, but you will be paying for the cost of fine tuning. And the speed of training is actually different from processing the inputs and processing the outputs. So roughly this comes out to be about $1,400. Which brings our net pricing to about $3,500 here.
Okay. Um, now let's put all of this, these numbers that we have seen in perspective. If there is one table that you want to capture from this presentation, it's probably this that, uh, that kind of summarizes all the, uh, pricing together. There, there are a few key takeaways that I want us to focus on. So when you're invoking your open AI models, right, they're roughly seven x more expensive.
When you com, when you use the fine tuned versions of them compared to only 1.66 more expensive when you're fine tuning an open source model. Right? So the way you want to think about like at least one factor that you want to consider in your choice of large language models is would you need to fine tune or are you okay with let's say a few short learning or a zero shot learning type of technique?
If you want to fine tune, maybe it would make sense that you consider using some of the open source models. Again, there are a lot of considerations, but at least one to keep in mind. Okay? Secondly, the cost increases to X if you are increasing the length of your context window. So 32 K context is, um, charging, like you're paying, uh, two x more for compared to the eight K context window.
Lastly, as all of us can imagine, like the much larger models, number of parameters, actually you end up paying a lot more compared to like, you know, when you work with smaller models, right? So if you have a use case that you can work off of, smaller model, that's, that's a lot nicer. Now this one slide is the only one in the presentation where I'm going to be deviating from my focus on just talking about cost, because I wanted to put some things in perspective.
Now here, I'm not presenting our work. Actually I'm presenting, um, like, you know, one of the other, uh, works from the community, which is done by a company called Move Works. And what they did was not only do they talk about the pricing aspect of it, they actually talk about the quality aspect of it. Okay.
So they ended up training a 7 billion parameter model internally. Okay. They ended up fine tuning this model and compared the performance of this model across a bunch of different use cases, uh, they had, they had 14 use cases across which they evaluated, I only put a sample up here. Okay. And then they realized that you are able to get.
Even better performance, or almost equally equal, equally almost a comparable performance to gpd four few short learning. So you notice that the performance here is 0.93 compared to 0.95, but then in three other tasks actually beating GD four. And in this case also the performance is kind of comparable, right?
So the, the message that they give in this, in this blog, and I encourage everyone to read out this blog, to read this blog, is. That lower number of parameters. If you're fine tuning, you could actually get a decent performance on a few tasks, and that's what we believe at True Foundry, that eventually, um, open source LMS and the commercial LMS are to co coexist.
We actually don't think that one will completely trump over the other, and they're going to coexist probably even within an application. Like within an application. You will have multiple stages. And some of these problems would be, would make more sense to be solved via the open source LMS and some others via the commercial lms.
And one simple mental model to think about this is you could offload easier or more specific tasks where you can have some more fine tuning data. Um, the task is simpler. This does not require a lot of complex reasoning. You can offload that to your cheaper open source, large language models. And the more ambiguous, the more abstract reasoning type of task that's picking up context from uh, word number one to word number 39,000.
From your context, you probably want the mega LLMs to take up those tasks and pay for that basically. Right? And this is where true Foundry come in. So like either you could be using your open AI APIs or your commercial APIs. And we help a little bit when you're, when you're doing that part. Uh, and the second thing is you might want to focus on open source lms.
And we help a little bit in this, um, in this side of your journey as well. So let's talk about how we do a very simple thing, uh, when we talk about the open AI APIs, right? We, we basically. Reduce the number of tokens that you are sending to open AI APIs and thereby saving you real dollars. Let's see how, like what's the intuition behind this?
Right, so in one of our previous slides, we noticed that you're essentially paying more than half your cost in just processing your context or prompt tokens. Let me take you back to this slide. In this case, if you look at the total cost, you would notice that in almost every single row, Half or even more than half of your cost is coming from your input cost or your prompt token processing cost.
Right now. The, the key to the kingdom here is that not all the words that you're sending as part of your context are actually necessary. For your LL M LLMs are actually great working with some incomplete sentences, reduce the length of the word, remove some words that are not necessary, rephrase these sentences, do whatever.
But you could actually reduce a lot of context and still get practically the same response, right? So call it a lossless compression. And that's one of the APIs that we are building. So you basically invoke our compression a p I and roughly save 30% of the cost that you're invoking with open AI models.
That's one. Now, let's say you want to work with your open source models. We have actually a lot more elaborate offering in that, where we are building a model catalog of a bunch of these open source models that are highly optimized for influencing and fine tuning. So it it, and like, you know, the way we present this, Is a drop in replacement for your hugging face or open AI APIs.
So if you are already using, let's say, one of the Open AI models and you wanted to try out that, how will my dolly 7 billion or M p t, um, perform for a particular task? It's, it's like probably changing one word in your, in your code base and things will just work out of the box. And lastly, the way we are orchestrating this entire setup from an infrastructure standpoint is we are running this on Kubernetes clusters running on top of spot instances.
You're able to completely leverage, which, which runs within your own infra, right? So you can leverage your cloud credits, your cloud budget, anything, um, that comes from your cloud, cloud directly as opposed to kind of like, you know, trying to figure out a whole new budget for, for running your lms.
Basically. Let's dive a little bit deeper into the compression api. So I'm giving a very quick example here where. We have a context, and this context is about Beyonce. Okay? And I'm asking a question that, what was the name of Beyonce's first solo album? We get an answer Dangerously in love. I don't know if this is the correct answer, but this is what we get out of the context.
And the number of tokens consumed for this task is 6 49. Okay? And the humongous cost that you're paying for this operation is $0, okay? However, if you were to use our compression API for the same task, You basically provide the exact same context. You provide the exact same question, you get the exact same answer, but the number of tokens that is getting consumed is 465, which is roughly 30% lower than what you were passing in the previous one.
And when you are running your open source models, there are a few things that, that we are working and we would love to. Build the platform together. If some folks are actually using these open source lms, uh, we'd love to build up, build this thing together with you all. The idea is that we have a model catalog of all these open source models that are ready to be deployed or ready to be fine-tuned.
You can very quickly build and deploy your no code apps. And one of the other realizations in this case is when you're working with LLMs, it's actually not just the LLMs themselves, but then whole, but the whole ecosystem around it. Because you, yes, you want to deploy your models, but you may want to run some of your data processing jobs.
You may want to have some other Python functions, which are like your pre-processing functions. You may want to deploy a vector database. You, you probably want to have your data to be, uh, read really, really fast to improve on the performance, right? So there's a lot of things that go around to actually build, uh, a full, full blown LM map.
And the, the, the, the part that we are trying to solve is kind of build the whole e ecosystem around it so that you as a developer can have a lot of these problems solved in one place. If any of this sounds interesting to you all, I encourage you all to sign up using this link and, um, I promise to provide a $200 GPU credit on our platform and, um, love to work with you all.
Thank you so much for listening.
$200. Oh, man. All right. That's the one way to do it. I like it. I like it a lot. So now, I gotta ask you a few questions. I mean, there's some incredible questions coming through in the chat and I love it. I think this is so valuable, man. This is one of those things. If you look at the report that we just put out, and I'll throw that on screen in case, uh, there's people that have not downloaded this report yet.
Go check it out. We just did it warning. I am not Gardner and I am not at all to be seen as someone that writes these reports on a regular basis, but I did spend the last three months doing it. And coming back to your talk, the biggest thing that people were mentioning as a challenge or a reason that they're not using LLMs was cost.
And how they could not figure out what the ROI was. They could not figure out how much cost it would actually take to do this. And so I think that is so valuable, man. Thank you for putting this together. Appreciate it. Appreciate it, Demetrios. And I don't know if I might have scared people away from using LLMs by giving this talk if cost was the biggest concern for them.
Oh man. Yeah. So actually there is, uh, there is something else that I was gonna ask about. When it comes to the, um, the algorithms you're using, or basically, I wanna know, have you played around with Frugal G P T? Have you seen that one? I have, I have. I have seen the Frugal G pt, I went through some of the optimizations that, that, that, you know, they have proposed in the paper.
And um, like some of the things that we are incorporating, that we are incorporating in our compression API are actually like, you know, we, we noticed that there were a lot of overlap that we actually put in there. Yeah. Yeah. That's cool. I mean, it looks like great minds think alike. So let's get to some of these questions, um, in the chat and.
So Gui Bosch is saying that they want you to be their personal finance advisor, and that is, uh, yeah, I'll second that one. That's awesome. If you could do this with my finances, that would be great. Uh, you LLMs work as well on text that you remove stop words and, um, and the limitizing words. Or, oh man, that's a big word.
Amazing. Me right now. Yeah. Your words. So like, is the, is the l l m gonna perform as well? So I'll, I'll share what, what experiments with it. The, the short answer to this question is no, they would not. Okay. Mm-hmm. Uh, if you just directly like went ahead and removed, stop words and like lize the words, uh, without any context, the performance actually dropped.
So we actually drawn, ran some extensive experiments on q and a data sets, summarization, data sets to figure out. How can we ensure the performance does not drop? And when we did this naive approach, the performance actually did end up dropping. We tried a bunch of other things like, you know, we tried some few short learning approaches to kind of, um, actually give human labeled, um, sum summaries, which were a lot lower in the number of tokens.
And that performed much better, but still not quite there yet. So eventually where we got to like, you know, much better performance was actually fine tuning one of the LLMs with this specific task that you learn to not lose context, but still reduce the number of tokens basically. So that's, that's kind of how we, we had to get to solving this problem.
And to be completely honest, this, this solution, even with all the stuff that we, that I'm talking about, is still not generically applicable to all the L l M use cases. For example, if you were passing a code, As a context and you try to summarize your code, well, it's gonna suck at it. So there, there are some specific use cases where this performs quite well in some other use cases that is still not performing very well.
Yeah, that makes complete sense. And, uh, you mentioned fine tuning in your talk. Uh, is that fine tuning with cue Laura? Oh, so this fine tuning that we are talking about is actually a full blown, uh, fine tuning to be honest. Uh, but if you were to fine tune it with, uh, low rank adaptations and stuff like this would be a lot cheaper, a lot faster, uh, than what, than the numbers that I put together on the slide.
Yeah. Yeah. I feel you. All right. Um, do you feel like the seven B self-hosted model is capable enough? Um, the answer to that question is, is always on the task that we are talking about. I don't think that the 7 billion or the 13 billion models themselves are generically capable, um, and like nowhere close to being capable, uh, to the, the larger commercial counterparts, right.
But what we have seen is sometimes because commercial LMS are so generic, they work well in all the use cases. We actually try to throw these LMS at every problem that we are trying to solve in this context, and what I've realized is that for simpler problems, it probably makes more sense. To have like these smaller models fine tuned to that specific use case.
For example, if you were to just figure out one of the four classes, uh, that, that you want to get, you would still probably in your workflow today, throw a prompt at it and say that, okay, figure out which class does this thing belong to. But problems like those, you could actually get practically the same, uh, performance using the smaller lms.
Mm-hmm. Dude, awesome. Thank you so much for this, and it is such a big question. I mean, it is. Something that everyone is thinking about and wondering. And I know that, uh, one person in the report mentioned how just because you add AI capabilities to your product doesn't necessarily mean that you get to charge more for those capabilities.
And so the cost of you running those that like these new AI capabilities, you have to basically. Take outta your margin. And so, yeah, that, that's something to think about. You really wanna know how much you're gonna be losing off the margin if it is something that is big. And I actually, I just noticed, uh, the other day on notion, the free notion.
Now you can't do the ai, uh, stuff that doesn't have the AI capabilities. And so I realized that that was probably because it, it cost them quite a lot of money to be running that AI stuff. For sure. Yeah, absolutely. So dude, Nick, thank you so much and if anyone wants to check out more of what you are doing, I am going to just show them that they can go to the solutions tab right here and find out all kinds of cool stuff, uh, about what you're doing at True Foundry.
Check out that virtual booth, man, that's looking good. And you've got some cool things for people to. Go off of. So feel free to look at that, have a peek. And now I'm seeing myself inside of a picture of myself. This is gonna get a little bit confusing in a minute, so, all right, man, I'll let you go and for any more.
Questions, uh, OSH is asking you a question, so I think you'll, you're gonna go into the chat right now. Otherwise, you're also in the ML lops community Slack. So just tag Ko and let him know that, uh, that you got questions. All right. So, thanks, man. All right. Thank you so much, Demetrios. Uh, thank you so much everyone for listening.
Take care. Bye.