Sign in or Join the community to continue

Cost Optimization and Performance

Posted Apr 27, 2023 | Views 1.3K

# LLM in Production

# LLM

# Cost Optimization

# Cost Performance

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

Share

speakers

Lina Weichbrodt

Freelance Machine Learning Development + Consulting @ Pragmatic Machine Learning Consulting

Lina is a pragmatic freelancer and machine learning consultant that likes to solve business problems end-to-end and make machine learning or a simple, fast heuristic work in the real world.

In her spare time, Lina likes to exchange with other people on how they can implement best practices in machine learning, talk to her at the Machine Learning Ops Slack: shorturl.at/swxIN.

She works with LLMs at her current client, a stealth startup. Lina will be moderating the panel.

+ Read More

Luis Ceze

CEO and Co-founder @ OctoML

Luis Ceze is Co-Founder and CEO of OctoML, which enables businesses to seamlessly deploy ML models to production making the most out of the hardware. OctoML is backed by Tiger Global, Addition, Amplify Partners, and Madrona Venture Group. Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, where he has taught for 15 years.

Luis co-directs the Systems and Architectures for Machine Learning lab (sampl.ai), which co-authored Apache TVM, a leading open-source ML stack for performance and portability that is used in widely deployed AI applications.

Luis is also co-director of the Molecular Information Systems Lab (misl.bio), which led pioneering research in the intersection of computing and biology for IT applications such as DNA data storage. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and the Wall Street Journal. Ceze is a Venture Partner at Madrona Venture Group and leads their technical advisory board.

+ Read More

Jared Zoneraich

Founder @ PromptLayer

Co-Founder of PromptLayer, enabling data-driven prompt engineering. Compulsive builder. Jersey native, with a brief stint in California (UC Berkeley '20) and now residing in NYC.

+ Read More

Daniel Campos

Research Scientist @ Snowflake

Hailing from Mexico Daniel started his NLP journey with his BS in CS from RPI. He then worked at Microsoft on Ranking at Bing with LLM(back when they had 2 commas) and helped build out popular datasets like MSMARCO and TREC Deep Learning. While at Microsoft he got his MS in Computational Linguistics from the University of Washington with a focus on Curriculum Learning for Language Models. Most recently, he has been pursuing his Ph.D. at the University of Illinois Urbana Champaign focusing on efficient inference for LLMs and robust dense retrieval. During his Ph.D., he worked for companies like Neural Magic, Walmart, Qualtrics, and Mendel.AI and now works on bringing LLMs to search at Neeva.

+ Read More

Mario Kostelac

Staff Machine Learning Engineer @ Intercom

Currently building AI-powered products in Intercom in a small, highly effective team. I roam between practical research and engineering but lean more towards engineering and challenges around running reliable, safe, and predictable ML systems. You can imagine how fun it is in the LLM era :).

Generally interested in the intersection of product and tech, and building a differentiation by solving hard challenges (technical or non-technical).

Software engineer turned into Machine Learning engineer 5 years ago.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

In this panel discussion, the topic of the cost of running large language models (LLMs) is explored, along with potential solutions. The benefits of bringing LLMs in-house, such as latency optimization and greater control, are also discussed. The panelists explore methods such as structured pruning and knowledge distillation for optimizing LLMs. OctoML's platform is mentioned as a tool for the automatic deployment of custom models and for selecting the most appropriate hardware for them. Overall, the discussion provides insights into the challenges of managing LLMs and potential strategies for overcoming them.

+ Read More

TRANSCRIPT

Amazing uh to have you all here. We're gonna talk about um cost optimization performance. Um So I'm also an M L engineer and I have an amazing panel for you today. So I have two people who run large language models in production. We have Daniel who is a research scientist at N A which is an ad free private search solution. Tried it uh quite nice and they're using um large language models in amazing uh to have you all here. We're gonna talk about um cost optimization performance. Um So I'm also an M L engineer and I have an amazing panel for you today. So I have two people who run large language models in production. We have Daniel who is a research scientist at N A which is an ad free private search solution. Tried it uh quite nice and they're using um large language models in search for all kinds of questions like summary semantic retrieval, create generation. So I'd be super interested to hear uh what you learned um by applying it to the large space. Then we have Mario who was a staff engineer at Intercom and they uh put one of these early G BT four uh power projects out which was a chatbot. And then we have two people from the um tooling space. Really also very excited to have here, Jared, which is the co-founder of prompt layer. search for all kinds of questions like summary semantic retrieval, create generation. So I'd be super interested to hear uh what you learned um by applying it to the large space. Then we have Mario who was a staff engineer at Intercom and they uh put one of these early G BT four uh power projects out which was a chatbot. And then we have two people from the um tooling space. Really also very excited to have here, Jared, which is the co-founder of prompt layer. Um Hey, Jared, hey, um it's a platform to manage your production L L M apps with uh reliable prompt engineering monitoring of the prompts like versioning cost latency and you can track historic usage history and performance. So I'd be excited to talk more about evaluation. Um Hey, Jared, hey, um it's a platform to manage your production L L M apps with uh reliable prompt engineering monitoring of the prompts like versioning cost latency and you can track historic usage history and performance. So I'd be excited to talk more about evaluation. And last, but not least we have Lewis who wears many hats. Uh He is a professor of Computer Science in Washington and also the co-founder CEO of OCTO M L where they offer easy and efficient deployment. And the very interesting part for me especially was that you can just upload any And last, but not least we have Lewis who wears many hats. Uh He is a professor of Computer Science in Washington and also the co-founder CEO of OCTO M L where they offer easy and efficient deployment. And the very interesting part for me especially was that you can just upload any computer vision um or N LP model and it will be optimized and it will tell you which instance to run it on and soon you can also run your models there. So I'd be very interested how you help users to make this model run very, more cost efficient and faster. computer vision um or N LP model and it will be optimized and it will tell you which instance to run it on and soon you can also run your models there. So I'd be very interested how you help users to make this model run very, more cost efficient and faster. OK. OK. Um So maybe let's start with costs. Uh We know generally running large language models is quite expensive, uh much more expensive than normal paid uh back end API S. Um So maybe let's start with costs. Uh We know generally running large language models is quite expensive, uh much more expensive than normal paid uh back end API S. Um So Um So um Mario and Daniel maybe because you run it for practice. Um um Mario and Daniel maybe because you run it for practice. Um was cost an issue. How did you approach this? Um the cost angle of running and production? was cost an issue. How did you approach this? Um the cost angle of running and production? Uh Yeah, so cost is an issue especially with scale. And I think that's kind of like where we started out, we start off with some of these kind of like foundational API S as a way to figure out. Is this a product that's useful? Does this help our ranking signals to our users like this? And then once we actually started getting that Uh Yeah, so cost is an issue especially with scale. And I think that's kind of like where we started out, we start off with some of these kind of like foundational API S as a way to figure out. Is this a product that's useful? Does this help our ranking signals to our users like this? And then once we actually started getting that both the cost of running that and how much we could run this on really becomes a bottleneck. So if you think about like summarization systems, uh potentially running it like in production real time for users, that's very possible. The API it's kind of slow, it can get expensive. But what if you want to run summaries on, let's say the entirety of A we index which is on the order of billions of documents at that point, no longer can kind of become actually cost possible. So you have to move in house to some type of smaller system both the cost of running that and how much we could run this on really becomes a bottleneck. So if you think about like summarization systems, uh potentially running it like in production real time for users, that's very possible. The API it's kind of slow, it can get expensive. But what if you want to run summaries on, let's say the entirety of A we index which is on the order of billions of documents at that point, no longer can kind of become actually cost possible. So you have to move in house to some type of smaller system at the same time. It's just a question of like how widely do you want to cast the net when you're using these models, how many people do you want to access them? And what is their expectation for turnaround time? at the same time. It's just a question of like how widely do you want to cast the net when you're using these models, how many people do you want to access them? And what is their expectation for turnaround time? Yeah. Do you have a guideline? Like how long would you start? Uh do you have a working prototype? And also maybe could you speak on? I imagine you're running it in house now, but it's still expensive, right? So do you have a ballpark uh compared to other services? How, how, how do you evaluate that? And how do you bring it down to, to which level? Yeah. Do you have a guideline? Like how long would you start? Uh do you have a working prototype? And also maybe could you speak on? I imagine you're running it in house now, but it's still expensive, right? So do you have a ballpark uh compared to other services? How, how, how do you evaluate that? And how do you bring it down to, to which level? Yeah. So on the evaluation side or initially using kind of these foundational models, it it's about kind of like having rapid iteration and the ability to kind of revive things via prompt and get this in front of users to actually kind of figure out what they're thinking and once users are actually saying like, oh, this is great. This is the use case that I'm working on, then we tend to kind of focus in and figure out how can we distill this, extract this into a smaller model Yeah. So on the evaluation side or initially using kind of these foundational models, it it's about kind of like having rapid iteration and the ability to kind of revive things via prompt and get this in front of users to actually kind of figure out what they're thinking and once users are actually saying like, oh, this is great. This is the use case that I'm working on, then we tend to kind of focus in and figure out how can we distill this, extract this into a smaller model within these kind of smaller models? The process we generally think about is less around uh within these kind of smaller models? The process we generally think about is less around uh just the dollar cost more on kind of like where can I run this? And for us, the golden threshold has been, can we make this model run on a single A 100 GP sorry A 10 GP or on any type of C P requirement? Because like when you move from the A 100 to an A 10 say you're serving Aws that moves you from having to pay 40 bucks an hour because you have to get eight of them at once to on the A 10, you can pick up a just the dollar cost more on kind of like where can I run this? And for us, the golden threshold has been, can we make this model run on a single A 100 GP sorry A 10 GP or on any type of C P requirement? Because like when you move from the A 100 to an A 10 say you're serving Aws that moves you from having to pay 40 bucks an hour because you have to get eight of them at once to on the A 10, you can pick up a instance for 30 cents. And so like at that price point, you can do pretty much anything. And then if you even move it to C P U, the game changes even further where you can move to kind of billions of items without necessarily even thinking about this. And so a lot of our work, we're moving like for some, our semantic search, these are small query encoders, these are 30 million, 140 million parameter models. And so in that case, we can actually run the query encoder on C P U. instance for 30 cents. And so like at that price point, you can do pretty much anything. And then if you even move it to C P U, the game changes even further where you can move to kind of billions of items without necessarily even thinking about this. And so a lot of our work, we're moving like for some, our semantic search, these are small query encoders, these are 30 million, 140 million parameter models. And so in that case, we can actually run the query encoder on C P U. So it sits in the same system that's actually doing the retrieval. And so that can just simplify production. It's just a question on really like figuring out what scale you're going to, what is your latency throughput that you're looking for and what is kind of the size. And one of the things that we saw was the foundation of API S are actually really slow and partially that's actually really good because it changes user perception. Whereas like before people's perception via Google search was like anything over 100 milliseconds is terrible. Now, people are used to waiting 34 or five seconds. So it sits in the same system that's actually doing the retrieval. And so that can just simplify production. It's just a question on really like figuring out what scale you're going to, what is your latency throughput that you're looking for and what is kind of the size. And one of the things that we saw was the foundation of API S are actually really slow and partially that's actually really good because it changes user perception. Whereas like before people's perception via Google search was like anything over 100 milliseconds is terrible. Now, people are used to waiting 34 or five seconds. And we've been able to see that we can take if we're running co here, we're running open. I uh what's it called? Uh I totally anthropic. They're in the order of like three seconds for a response for a single batch, we can get a model that is 20 billion parameters to respond in 400 milliseconds. So that like net improvement in speed is huge just by bringing it in house. And we've been able to see that we can take if we're running co here, we're running open. I uh what's it called? Uh I totally anthropic. They're in the order of like three seconds for a response for a single batch, we can get a model that is 20 billion parameters to respond in 400 milliseconds. So that like net improvement in speed is huge just by bringing it in house. And that's just because you don't have to go on the internet. It doesn't have to go through their orchestration, you can deal with your own. And then the other thing is when you bring this in the house, you really become a master of your own faith where one, any type of latency optimization is actually owned by you. You don't have to worry about API rate limiting you don't have to worry about like, their entire system going down. I think that was actually one of the things that really pushed Neva to bring these things in the house because And that's just because you don't have to go on the internet. It doesn't have to go through their orchestration, you can deal with your own. And then the other thing is when you bring this in the house, you really become a master of your own faith where one, any type of latency optimization is actually owned by you. You don't have to worry about API rate limiting you don't have to worry about like, their entire system going down. I think that was actually one of the things that really pushed Neva to bring these things in the house because sure, open A I has 99% uptime, but that's 99% for everyone. That was not what we were experiencing at all. And when we're trying to deliver this product and we're like, well, this use case doesn't work because their overall system is down and we can't do anything but sit on our hands. It really gave us kind of like the impetus to bring this in house. sure, open A I has 99% uptime, but that's 99% for everyone. That was not what we were experiencing at all. And when we're trying to deliver this product and we're like, well, this use case doesn't work because their overall system is down and we can't do anything but sit on our hands. It really gave us kind of like the impetus to bring this in house. Yeah, we experienced the same issues. We also had regular, they were regularly not available sometimes 15 minutes at a time. So that was a problem. Yeah, we experienced the same issues. We also had regular, they were regularly not available sometimes 15 minutes at a time. So that was a problem. Um OK. Anyone else on, on cost? I can, Um OK. Anyone else on, on cost? I can, I can I just jump right in here to say, hey man, like I love it was just what I just said, like this is exactly our point of view is that often you don't need the highest end hardware, right? So you can live with hardware that actually hits your laten and put requirements. You know, even though there seems to be um a strong, um let's say uh shortage of high end silicon. Many people don't realize that they don't, probably don't need high-end silicon to deploy their models. I can I just jump right in here to say, hey man, like I love it was just what I just said, like this is exactly our point of view is that often you don't need the highest end hardware, right? So you can live with hardware that actually hits your laten and put requirements. You know, even though there seems to be um a strong, um let's say uh shortage of high end silicon. Many people don't realize that they don't, probably don't need high-end silicon to deploy their models. Um And you know, basically what offer is exactly what you talked about. But in a general way in a fully automated platform for broader users to deploy their own model, their own custom model or an open source model in a platform that they control automatically choosing the silicon for you and turning models into hybrid Um And you know, basically what offer is exactly what you talked about. But in a general way in a fully automated platform for broader users to deploy their own model, their own custom model or an open source model in a platform that they control automatically choosing the silicon for you and turning models into hybrid performance, portable um artifacts and code that you can run on different hardware. I just need to say, you know, we should definitely follow up after this because our our our view of value here is 100% aligned. So, but in panels, we need to disagree with each other. So somebody should disagree with, with him to make this interesting, right? So performance, portable um artifacts and code that you can run on different hardware. I just need to say, you know, we should definitely follow up after this because our our our view of value here is 100% aligned. So, but in panels, we need to disagree with each other. So somebody should disagree with, with him to make this interesting, right? So you can disagree about the most promising um ways to go about cost reduction. So if you each had to name like your top two you can disagree about the most promising um ways to go about cost reduction. So if you each had to name like your top two methods, because you can do so much, right? If you, if you start reading on the internet, what are the top things to do to reduce costs? You have a list of like 20 items. So maybe you can disagree on that. That, that let me give you, give me each your top two. methods, because you can do so much, right? If you, if you start reading on the internet, what are the top things to do to reduce costs? You have a list of like 20 items. So maybe you can disagree on that. That, that let me give you, give me each your top two. OK. So since I was gonna say two, first of all, um make uh pick, pick a model that does the thing that you need and optimize as much as you can because then you have better performance in a given target. And then second, choose the right silicon for it, the one that has the lowest cost possible and hits your uh performance requirements. OK. So since I was gonna say two, first of all, um make uh pick, pick a model that does the thing that you need and optimize as much as you can because then you have better performance in a given target. And then second, choose the right silicon for it, the one that has the lowest cost possible and hits your uh performance requirements. So you're also saying a different model type, like a smaller type of model. So you're also saying a different model type, like a smaller type of model. Exactly. Pick, pick, pick the right model optimize as much as you can hopefully automatically and then to pick the right harder for it. So Exactly. Pick, pick, pick the right model optimize as much as you can hopefully automatically and then to pick the right harder for it. So yeah, my two things would be kind of like structured pruning and knowledge distillation where yeah, my two things would be kind of like structured pruning and knowledge distillation where you can take the large model and then you actually just remove portions of it until it fits in whatever systems you have. And like I like direct example of this, like we're serving a Flan T 5 11 D parameter model that physically doesn't fit in an A 10 GP U until you remove at least 60% of the weights. So you have to just because like the actual size of the model, once you've done that you can make that model fit in that smaller GP U, but then you have to use some form of you can take the large model and then you actually just remove portions of it until it fits in whatever systems you have. And like I like direct example of this, like we're serving a Flan T 5 11 D parameter model that physically doesn't fit in an A 10 GP U until you remove at least 60% of the weights. So you have to just because like the actual size of the model, once you've done that you can make that model fit in that smaller GP U, but then you have to use some form of knowledge distillation so that the model doesn't actually have this like uh decrease in quality. And generally speaking, those two things together really provide like big, big performance improvements. And if you get really crafty with your distillation, and uh I used to work at Neuroma which focused on distillation for C P US and everything else. And we were able to get 30 40 50 X improvements on text, text classification test by basically just knowledge distillation so that the model doesn't actually have this like uh decrease in quality. And generally speaking, those two things together really provide like big, big performance improvements. And if you get really crafty with your distillation, and uh I used to work at Neuroma which focused on distillation for C P US and everything else. And we were able to get 30 40 50 X improvements on text, text classification test by basically just pushing in that like knowledge distillation as far as we could with the biggest teacher model and the smallest student model. And that just works extremely well. The downside to it, I guess I'll take my own uh going against myself in the other corner. There's a nonzero cost actually moving all these things in house. Like if you think about the beautiful part about these foundational models is the feedback cycle is super fast. So if you think about like say you're doing something with the prompt for search engines, uh pushing in that like knowledge distillation as far as we could with the biggest teacher model and the smallest student model. And that just works extremely well. The downside to it, I guess I'll take my own uh going against myself in the other corner. There's a nonzero cost actually moving all these things in house. Like if you think about the beautiful part about these foundational models is the feedback cycle is super fast. So if you think about like say you're doing something with the prompt for search engines, uh you want to rephrase how your system works And so you just change the prompt and it runs, it's good if in production, it's good to go. If you want to do this in house, you basically say you're like you want the system to take a more neutral tone. You have to go through all this iterative approach to make data sets distill this data set into something smaller, take this model, compress it into something smaller, figure out how it moves to production. So like you want to rephrase how your system works And so you just change the prompt and it runs, it's good if in production, it's good to go. If you want to do this in house, you basically say you're like you want the system to take a more neutral tone. You have to go through all this iterative approach to make data sets distill this data set into something smaller, take this model, compress it into something smaller, figure out how it moves to production. So like your iteration time goes from being on the order of minutes to potentially days or weeks. And that's kind of like this like nice trade off of like how much do you want to explore? How quickly do you want to be able to iterate on products with customers to how quickly do you want to be able to optimize and control your own state? your iteration time goes from being on the order of minutes to potentially days or weeks. And that's kind of like this like nice trade off of like how much do you want to explore? How quickly do you want to be able to iterate on products with customers to how quickly do you want to be able to optimize and control your own state? Um If running in house, what do you say, what size of team do you need? Just for the upside of, of running? Uh I don't know a few models but let's say two. Um If running in house, what do you say, what size of team do you need? Just for the upside of, of running? Uh I don't know a few models but let's say two. I, I guess it I, I guess it depends what, yeah, maybe it's a general question. It depends what type of model probably, but just maybe. Can you give listeners like a ballpark on, on the, on the amount of work that you had to do? depends what, yeah, maybe it's a general question. It depends what type of model probably, but just maybe. Can you give listeners like a ballpark on, on the, on the amount of work that you had to do? So, I would say the bringing these models in the house and making these models, there was roughly, there's about four of us that were working on, on anything from data modeling to I'm the compression guy. So that's what gets me up in the mornings. Uh But there's other folks who are working on the modeling side. And otherwise, I think overall at Eva directly working with models, there's probably around 10 of us. So, I would say the bringing these models in the house and making these models, there was roughly, there's about four of us that were working on, on anything from data modeling to I'm the compression guy. So that's what gets me up in the mornings. Uh But there's other folks who are working on the modeling side. And otherwise, I think overall at Eva directly working with models, there's probably around 10 of us. OK. And uh Mario Jarrett thoughts on the cross, OK. And uh Mario Jarrett thoughts on the cross, uh I can maybe give like uh a little bit of like a few, few thoughts on um how you can control costs before we actually move anything in house. So we don't really like, we don't train or find you in our lens ourselves. So right now, we are pretty much on the consumer side. Uh just calling API S. Uh and it's tricky uh it's expensive and um I think even Da Vinci uh can be a very expensive model depending how you use it. uh I can maybe give like uh a little bit of like a few, few thoughts on um how you can control costs before we actually move anything in house. So we don't really like, we don't train or find you in our lens ourselves. So right now, we are pretty much on the consumer side. Uh just calling API S. Uh and it's tricky uh it's expensive and um I think even Da Vinci uh can be a very expensive model depending how you use it. And um And um most of the time, like we just first do pretty much feasibility exploration of, is it possible to do what we want to do um in some like really isolated cases. Uh like, can we summarize one conversation, for example? And to like to assess that capability, we would take the best model first, probably see whether that's possible and like, try to go down down, you know, just try to uh pare down the model most of the time, like we just first do pretty much feasibility exploration of, is it possible to do what we want to do um in some like really isolated cases. Uh like, can we summarize one conversation, for example? And to like to assess that capability, we would take the best model first, probably see whether that's possible and like, try to go down down, you know, just try to uh pare down the model uh to the lowest model that gives us results that are OK, like Daniel mentioned, that's often like, I think uh producing games uh 25 years ago where everything is too slow and you are totally like limited by how much you can process and uh kind of like cuts both, both from like cost and latency perspective. Like you just can't summarize all the conversations in the because we'll bankrupt straight away. Um There's like too many of them. uh to the lowest model that gives us results that are OK, like Daniel mentioned, that's often like, I think uh producing games uh 25 years ago where everything is too slow and you are totally like limited by how much you can process and uh kind of like cuts both, both from like cost and latency perspective. Like you just can't summarize all the conversations in the because we'll bankrupt straight away. Um There's like too many of them. So, but like one thing that's one thing that's interesting for like cost reduction is just basic engineering thing which is just call it class. And that's not always possible. Of course, I, I think like, like probably like in Daniel's case in that's a little bit more tricky if you're doing semantic search. But like the systems we build are usually a complex set of things that talk to each other. And there are multiple, So, but like one thing that's one thing that's interesting for like cost reduction is just basic engineering thing which is just call it class. And that's not always possible. Of course, I, I think like, like probably like in Daniel's case in that's a little bit more tricky if you're doing semantic search. But like the systems we build are usually a complex set of things that talk to each other. And there are multiple, there are multiple components that are implemented with L L MS. And sometimes you can put like a really simple classifier before doing something to figure out whether it's even whether it makes sense to call L M at all. Um So like, that's, you know, that's one technique that I think can work quite well because not all tasks are so sometimes you can find a tool that's really simple. And it is going to be able to say whether the more complex and smart tool is going to has has any chance of finding an answer. there are multiple components that are implemented with L L MS. And sometimes you can put like a really simple classifier before doing something to figure out whether it's even whether it makes sense to call L M at all. Um So like, that's, you know, that's one technique that I think can work quite well because not all tasks are so sometimes you can find a tool that's really simple. And it is going to be able to say whether the more complex and smart tool is going to has has any chance of finding an answer. Uh So I think that's one technique that helps a lot. Uh So I think that's one technique that helps a lot. Yeah. And I just echo Mario's first point there that like the first step. Yeah, you could do these bringing kind kind of bringing it locally or whatever. But if you're using these uh large foundation models from open A I or whatever, the first step is just knowing what's going on and kind of just building version one and kind of see understanding how the cost works. And this is like, uh it's a bigger step than you'd expect and just understanding which prompts in your system, which parts are expensive and like Yeah. And I just echo Mario's first point there that like the first step. Yeah, you could do these bringing kind kind of bringing it locally or whatever. But if you're using these uh large foundation models from open A I or whatever, the first step is just knowing what's going on and kind of just building version one and kind of see understanding how the cost works. And this is like, uh it's a bigger step than you'd expect and just understanding which prompts in your system, which parts are expensive and like knowing where to go from there is uh is a pretty important actually. So just echoing that otherwise. knowing where to go from there is uh is a pretty important actually. So just echoing that otherwise. Yeah. So there's uh some great tricks you can employ when uh when starting off and then maybe later when you scale uh uh to bring it in house. Ok. Let's talk about latency. It's uh hm. Yeah. So there's uh some great tricks you can employ when uh when starting off and then maybe later when you scale uh uh to bring it in house. Ok. Let's talk about latency. It's uh hm. Ok. I just have one more thing, I think 11 thing that I think was, was implicit, but I don't know if it was said explicitly was, you know, there's, there's plenty of open source ready to go models that can do. I think Daniel mentioned, you know, T five plan, for example, they can refine for users and people are realizing that you don't, you don't do not need to pay the open A I or coherent tax. If there's a, a use case that you know, an open source model can do well and you can much more efficiently in a platform that you control, that you might not need to build on your own. That that already gives you a very, Ok. I just have one more thing, I think 11 thing that I think was, was implicit, but I don't know if it was said explicitly was, you know, there's, there's plenty of open source ready to go models that can do. I think Daniel mentioned, you know, T five plan, for example, they can refine for users and people are realizing that you don't, you don't do not need to pay the open A I or coherent tax. If there's a, a use case that you know, an open source model can do well and you can much more efficiently in a platform that you control, that you might not need to build on your own. That that already gives you a very, very significant room for cost savings there, right? So just know your use case, you know, you don't kind of like this analogy of use AC P U because it's general, but the C P U is expensive but then use a GP U to the book of the computation because that's more efficient. I think there's a direct analogy here where you have these uh you know, general expensive models like GP T four, they have to pay to use. But then for things that you know, that you wanna do and you can specialize it and use an open model, you're probably gonna pay a lot less but probably orders of magnitude. So very significant room for cost savings there, right? So just know your use case, you know, you don't kind of like this analogy of use AC P U because it's general, but the C P U is expensive but then use a GP U to the book of the computation because that's more efficient. I think there's a direct analogy here where you have these uh you know, general expensive models like GP T four, they have to pay to use. But then for things that you know, that you wanna do and you can specialize it and use an open model, you're probably gonna pay a lot less but probably orders of magnitude. So ju just to make it interesting and push back a little bit, I'd actually say that. Yeah, I the way I see it, if you're, if you're, I mean, it depends where you are in the stages of development. If you're building version one, if you're trying to ship something where honestly most people working with L MS today are trying to ship version one, I'd say don't bother with open source models, get something working on GP T or something. Yeah. Yeah. ju just to make it interesting and push back a little bit, I'd actually say that. Yeah, I the way I see it, if you're, if you're, I mean, it depends where you are in the stages of development. If you're building version one, if you're trying to ship something where honestly most people working with L MS today are trying to ship version one, I'd say don't bother with open source models, get something working on GP T or something. Yeah. Yeah. Yeah. But I Yeah. But I know you use, yeah, I want the page. But yeah, but when just like, hey, if you're writing code here, you write in the C P U, you make sure it works and then you have flow to the efficient compute unit that does what you need to do. Similar thing here, you start simple with a model that you know, is more expensive. And as you understand what you actually need from it, then I'm I'm fairly confident you can find an open source cheaper model that does what you need to do, right? Question answering summarization and stuff like that, right? know you use, yeah, I want the page. But yeah, but when just like, hey, if you're writing code here, you write in the C P U, you make sure it works and then you have flow to the efficient compute unit that does what you need to do. Similar thing here, you start simple with a model that you know, is more expensive. And as you understand what you actually need from it, then I'm I'm fairly confident you can find an open source cheaper model that does what you need to do, right? Question answering summarization and stuff like that, right? I I I think the part that's important to consider in the cost reduction also just is staffing a team like the story I have here is I've always worked in like the search world of like big search engines. So there's has always been a lot of money behind this. There's teams, everything else. My brother used to work at finetech and at some point they were looking for something that could classify businesses. And when they poked around and figured out like what would this cost to have our own system? I I I think the part that's important to consider in the cost reduction also just is staffing a team like the story I have here is I've always worked in like the search world of like big search engines. So there's has always been a lot of money behind this. There's teams, everything else. My brother used to work at finetech and at some point they were looking for something that could classify businesses. And when they poked around and figured out like what would this cost to have our own system? You have to hire people, you have to hire at least one M L engineer, probably an info engineer. You have to have to run it. And pretty much their bottom barrel costs would be on the order of, I don't know, 5, 600 grand a year just for like, the team and they could make 99% like what Alice was talking about. But in reality, they just needed something that was 70%. And like, if they just plugged it into a system, they never had to hire an N LP team and they never had an M L team. They had the system that was beautiful because like You have to hire people, you have to hire at least one M L engineer, probably an info engineer. You have to have to run it. And pretty much their bottom barrel costs would be on the order of, I don't know, 5, 600 grand a year just for like, the team and they could make 99% like what Alice was talking about. But in reality, they just needed something that was 70%. And like, if they just plugged it into a system, they never had to hire an N LP team and they never had an M L team. They had the system that was beautiful because like they could go in and someone who could just tune the prompt until it works well enough. And it just like by using this large system, they save money because they didn't have to hire expensive people like me. And it doesn't understand the use case because they were only pushing 10,000 examples through the day. So like their cost is not as high versus on like a search engine side or when you're like dealing all the interactions that intercom has. Like if you wanted to do all the summaries, then that kind of cost calculus change. they could go in and someone who could just tune the prompt until it works well enough. And it just like by using this large system, they save money because they didn't have to hire expensive people like me. And it doesn't understand the use case because they were only pushing 10,000 examples through the day. So like their cost is not as high versus on like a search engine side or when you're like dealing all the interactions that intercom has. Like if you wanted to do all the summaries, then that kind of cost calculus change. Yeah, it's uh it's a question of scale. So clearly, if you, if you use it like the X O plus type of thing with the human in the loop, uh then using it as the API will be always more cost efficient. It's a good point. Yeah, it's uh it's a question of scale. So clearly, if you, if you use it like the X O plus type of thing with the human in the loop, uh then using it as the API will be always more cost efficient. It's a good point. Ok. So, um moving on to latency, this is another huge problem. So just getting uh or maybe I'm a bit um you know, spoiled from normal classification models, but just moving into the L M space and seeing that just one token takes zero point Ok. So, um moving on to latency, this is another huge problem. So just getting uh or maybe I'm a bit um you know, spoiled from normal classification models, but just moving into the L M space and seeing that just one token takes zero point seven seconds on P 90 was uh shocking and then you normally you need like a whole sentence. Um So that could take several seconds. And how do you deal with that if you have a user on the other hand, who is like you very impatient things after one second, the application is broken. seven seconds on P 90 was uh shocking and then you normally you need like a whole sentence. Um So that could take several seconds. And how do you deal with that if you have a user on the other hand, who is like you very impatient things after one second, the application is broken. Um What are typical? How do you go through the life cycle? When do you start, you start out maybe not looking at speed. What types of speedups can be? Um a are possible to expect. Do you do the tradeoff between model size and, and, and speed or like how are you thinking about this? And maybe also throw your best tips like if you had can do one or two things uh on speed. Uh What do you do if you don't have a lot of time? Um What are typical? How do you go through the life cycle? When do you start, you start out maybe not looking at speed. What types of speedups can be? Um a are possible to expect. Do you do the tradeoff between model size and, and, and speed or like how are you thinking about this? And maybe also throw your best tips like if you had can do one or two things uh on speed. Uh What do you do if you don't have a lot of time? Literally, Literally, yeah, I I can walk through directly that we didn't even, we talked about is our initial summarization model. And uh initially use a large language model, you get the summaries, you're basically looking at about three seconds per item and we're doing a batch of 10. So we're basically spent 30 seconds. yeah, I I can walk through directly that we didn't even, we talked about is our initial summarization model. And uh initially use a large language model, you get the summaries, you're basically looking at about three seconds per item and we're doing a batch of 10. So we're basically spent 30 seconds. You, we did this and figured out that like, oh, it turns out this is actually useful and very good it imprompt and then brought this into a smaller model in the house just going into a T five large model. But serving that on like a bat size of 10 on an A 10 still took like eight seconds naively. So it kind of it went about all these ways. So first off, there's a great library called faster transformers, NVIDIA kind of supports it. They don't fully support it. So like you got to fiddle with it a little bit, but just moving from You, we did this and figured out that like, oh, it turns out this is actually useful and very good it imprompt and then brought this into a smaller model in the house just going into a T five large model. But serving that on like a bat size of 10 on an A 10 still took like eight seconds naively. So it kind of it went about all these ways. So first off, there's a great library called faster transformers, NVIDIA kind of supports it. They don't fully support it. So like you got to fiddle with it a little bit, but just moving from uh native pie Tort serving to faster transformers basically brought us down from like eight seconds to like 1.8 seconds. So huge gains just on changing how it served. It worked on the GP U had a funky build process, but it worked super well. And then from there, we threw in some notion of structured pruning and asymmetrical pruning on the encoder decoder side just because like the encoder uh native pie Tort serving to faster transformers basically brought us down from like eight seconds to like 1.8 seconds. So huge gains just on changing how it served. It worked on the GP U had a funky build process, but it worked super well. And then from there, we threw in some notion of structured pruning and asymmetrical pruning on the encoder decoder side just because like the encoder uh produces this contextual representation on a sequence sequence model. And you want the best contextual representation possible and it also only runs once. So the cost is pretty much doesn't matter versus the decoder runs again and again and again. So we heavily compress the decoder. So we basically took this model that started off with 24 layers on either side and brought it down to uh produces this contextual representation on a sequence sequence model. And you want the best contextual representation possible and it also only runs once. So the cost is pretty much doesn't matter versus the decoder runs again and again and again. So we heavily compress the decoder. So we basically took this model that started off with 24 layers on either side and brought it down to four layers on one side and four layers on the decoder. And that meant that like by the time that we were done with faster transformers. This compressed model, we got down to roughly like 300 milliseconds for a batch of 10, which that allowed us to basically say, hey, it now will cost us $20,000 to create a summary for everything in our index. And we can actually do that because we'd like decrease these costs versus when it was 10 seconds, 30 seconds, I think just not trackable. four layers on one side and four layers on the decoder. And that meant that like by the time that we were done with faster transformers. This compressed model, we got down to roughly like 300 milliseconds for a batch of 10, which that allowed us to basically say, hey, it now will cost us $20,000 to create a summary for everything in our index. And we can actually do that because we'd like decrease these costs versus when it was 10 seconds, 30 seconds, I think just not trackable. Yeah. Yeah, I just, I just want to underscore making sure to use a very best kernels, the very best binaries for the given harder that you're running it on because that's if you can do without changing your model at all or any batching parameters, any other system parameters start with that and see how far you can go. You know, if you and the problem is that this is dependent on the harder target, very specific in the harder target. So the more you can Yeah. Yeah, I just, I just want to underscore making sure to use a very best kernels, the very best binaries for the given harder that you're running it on because that's if you can do without changing your model at all or any batching parameters, any other system parameters start with that and see how far you can go. You know, if you and the problem is that this is dependent on the harder target, very specific in the harder target. So the more you can um automate that, the better I'll put, I'll put a plug in for what T M L does. We can help you search automatically for what's the right library or compiler using tensor I T or TV M or uh you know, D N N for NVIDIA or the equivalent for other hardware targets. And then after that, you know, change the batch size, sure, you might pay a little bit more harder utilization. But if latency really matters to, you reduce the batch size, pay a little bit more So, um automate that, the better I'll put, I'll put a plug in for what T M L does. We can help you search automatically for what's the right library or compiler using tensor I T or TV M or uh you know, D N N for NVIDIA or the equivalent for other hardware targets. And then after that, you know, change the batch size, sure, you might pay a little bit more harder utilization. But if latency really matters to, you reduce the batch size, pay a little bit more So, but I'm really happy to, but I'm really happy to, sorry, go on again. Were you? sorry, go on again. Were you? I'm really happy to hear like what Daniel and they are talking about because like, if you're in our world, we just talk, like talking to some external api your hands are so tight, so effectively, you have like two ways to uh improve the latency. One is by shared capacity. Um If they, if they can give it to you, uh and then you, you know, you can play a little bit more with like true latency I'm really happy to hear like what Daniel and they are talking about because like, if you're in our world, we just talk, like talking to some external api your hands are so tight, so effectively, you have like two ways to uh improve the latency. One is by shared capacity. Um If they, if they can give it to you, uh and then you, you know, you can play a little bit more with like true latency um curve and like where you want to sit there. And the second one is maybe like what I said, like if, if you're OK with not finding an answer and if you can detect it quickly, maybe you just don't call it and return the answer straight away. Like there are U X paradigms, like I think it's becoming quite for like um curve and like where you want to sit there. And the second one is maybe like what I said, like if, if you're OK with not finding an answer and if you can detect it quickly, maybe you just don't call it and return the answer straight away. Like there are U X paradigms, like I think it's becoming quite for like chat applications, you know, just piping the tokens directly from L L M to the user. So user is, you know, see straight away that something is happening or sometimes you just can't do that streaming effectively. Yeah, like streaming like directly from L L M to um to, to the end user experience. Like sometimes you want to check what's, you know the answer when it's fully uh generated. And in these cases, chat applications, you know, just piping the tokens directly from L L M to the user. So user is, you know, see straight away that something is happening or sometimes you just can't do that streaming effectively. Yeah, like streaming like directly from L L M to um to, to the end user experience. Like sometimes you want to check what's, you know the answer when it's fully uh generated. And in these cases, like you have to wait for it. And for example, for GP T four um shared uh pool of capacity uh like the the latency I'm seeing is something like 100 millis per token. Um, the, to, for English, it's usually one word per token. They change the token as for GP four, but it's extremely high. Uh, so I'm really, I'm really happy to hear what Daniel and, uh, who you are talking about because I hope they will be able to actually, like you have to wait for it. And for example, for GP T four um shared uh pool of capacity uh like the the latency I'm seeing is something like 100 millis per token. Um, the, to, for English, it's usually one word per token. They change the token as for GP four, but it's extremely high. Uh, so I'm really, I'm really happy to hear what Daniel and, uh, who you are talking about because I hope they will be able to actually, you know, move some of these things. Um, like once you understand, like which use cases are actually the most important for us and which use cases make sense to uh to optimize and bring in house. you know, move some of these things. Um, like once you understand, like which use cases are actually the most important for us and which use cases make sense to uh to optimize and bring in house. Uh There's, there's some tricks you can still do on um on when you, when you call an API you can work with a prompt because the main agency is actually in the output, not in the size of the input. Uh There's, there's some tricks you can still do on um on when you, when you call an API you can work with a prompt because the main agency is actually in the output, not in the size of the input. Uh So you can, you can tell it to be precise or uh give you back a number of um paragraphs like be a bit shorter in the output. So this is something I mean, we can clearly see from the panel that uh if you want to be really fast, you need to move it in house. But let's say you're on this first stage where you still need to understand the cost and, and, and the speed. Uh there's a small amount of trick like semantic cashing and what Mario mentioned uh that you can do Uh So you can, you can tell it to be precise or uh give you back a number of um paragraphs like be a bit shorter in the output. So this is something I mean, we can clearly see from the panel that uh if you want to be really fast, you need to move it in house. But let's say you're on this first stage where you still need to understand the cost and, and, and the speed. Uh there's a small amount of trick like semantic cashing and what Mario mentioned uh that you can do uh maybe Jared. You also have a dashboard where users can track uh latency they observe. Is there some insights that you can share, like how do they like, what do you observe uh in terms of your users with respect to latency? uh maybe Jared. You also have a dashboard where users can track uh latency they observe. Is there some insights that you can share, like how do they like, what do you observe uh in terms of your users with respect to latency? Yeah, I guess uh the interesting thing here yet is what is the low hanging fruit to kind of help with latency? You mentioned, one is obviously like how long the response is, but the other one which is also very obvious but falling back to different types of models. So obviously, if, if you can detect uh how, how badly you need GP D 4/3 10.5, if there's a use case where you can fall back to 3.5, a faster model or even something Yeah, I guess uh the interesting thing here yet is what is the low hanging fruit to kind of help with latency? You mentioned, one is obviously like how long the response is, but the other one which is also very obvious but falling back to different types of models. So obviously, if, if you can detect uh how, how badly you need GP D 4/3 10.5, if there's a use case where you can fall back to 3.5, a faster model or even something kind of, although that might be the fastest one they have now but or maybe Claude or something like that kind of there is becoming uh in fact, there already is probably this whole axis of which engine do I use for this problem as opposed to kind of just saying we're using GP D for, for everything or we're using our in-house for everything. I think these are all tools that have pros and cons and one of the pros and cons of each is latency. kind of, although that might be the fastest one they have now but or maybe Claude or something like that kind of there is becoming uh in fact, there already is probably this whole axis of which engine do I use for this problem as opposed to kind of just saying we're using GP D for, for everything or we're using our in-house for everything. I think these are all tools that have pros and cons and one of the pros and cons of each is latency. Yeah, Yeah, I I can't resist bringing up again. Like one, I I can't resist bringing up again. Like one, one thought it was mentioned very briefly, you know that streaming the output like I'm, I'm more on the user side these days interacting with his boss. So they go and buffer the entire handset before they show it to me, a simple thing that just helps human patients is to say it's being typed or streaming it to the user because then you see that something is happening, right? So he his black bots here it for the entire output before it shows it to me. Like if you just said, hey, the bot is typing, just let this GP U work hard for you that will already go a long ways. one thought it was mentioned very briefly, you know that streaming the output like I'm, I'm more on the user side these days interacting with his boss. So they go and buffer the entire handset before they show it to me, a simple thing that just helps human patients is to say it's being typed or streaming it to the user because then you see that something is happening, right? So he his black bots here it for the entire output before it shows it to me. Like if you just said, hey, the bot is typing, just let this GP U work hard for you that will already go a long ways. Don't change the legacy. Just manage human expectation. They will probably go a long way Don't change the legacy. Just manage human expectation. They will probably go a long way as well. Uh That's such a good point. The actually U I experience um there are some drawbacks uh in terms of like safety of the response. Like if you stream it, you can check the response so that could be a concern for some type of applications. But I like your second idea of like typing, like when you see someone typing in some different types of like a human, I don't know different share applications that helps you see their type. It's a good idea. as well. Uh That's such a good point. The actually U I experience um there are some drawbacks uh in terms of like safety of the response. Like if you stream it, you can check the response so that could be a concern for some type of applications. But I like your second idea of like typing, like when you see someone typing in some different types of like a human, I don't know different share applications that helps you see their type. It's a good idea. OK. I think uh just I'd add real quick. Yeah, I think there's a lot of U X ways to do this to solve this problem. Um That's one thing and then also just one other addition is these latency amounts of like GP D four, these latency amounts of the models being offered are not constants either. Um We've like been looking kind of at our high level, like OK. I think uh just I'd add real quick. Yeah, I think there's a lot of U X ways to do this to solve this problem. Um That's one thing and then also just one other addition is these latency amounts of like GP D four, these latency amounts of the models being offered are not constants either. Um We've like been looking kind of at our high level, like user data and the latencies are like very spiky of things like GP four and like, it's very laggy some days and not other days. So if you really care about optimizing, that's probably another thing to look at. user data and the latencies are like very spiky of things like GP four and like, it's very laggy some days and not other days. So if you really care about optimizing, that's probably another thing to look at. Yeah, that's definitely an issue. They don't support any S A yet. So I, I guess that's hopefully soon to come, we'll be Yeah, that's definitely an issue. They don't support any S A yet. So I, I guess that's hopefully soon to come, we'll be having a latency page to kind of help with that, that soon. So stay having a latency page to kind of help with that, that soon. So stay tuned tuned on, on that. Something we talked about earlier. I mean, is like bats is hugely important both for like this streaming component. But also on, on that. Something we talked about earlier. I mean, is like bats is hugely important both for like this streaming component. But also yeah, if you think about like when you're actually doing this like greedy token by token decoding, if there's any difference in the length of each batch, everything basically defaults to the longest batch. So like there was some literature out of Washington. I think that in machine translation, they found there was like 70% of tokens and batches were affected yeah, if you think about like when you're actually doing this like greedy token by token decoding, if there's any difference in the length of each batch, everything basically defaults to the longest batch. So like there was some literature out of Washington. I think that in machine translation, they found there was like 70% of tokens and batches were affected the useless tokens because it was driven by the longest padded sequence. And so like, especially if you have a high variability in your outputs doing anything but a small bat size can like be bad because it's like if you have a short response, you don't get to enjoy that. Basically, your, all your responses are limited by your longest output. the useless tokens because it was driven by the longest padded sequence. And so like, especially if you have a high variability in your outputs doing anything but a small bat size can like be bad because it's like if you have a short response, you don't get to enjoy that. Basically, your, all your responses are limited by your longest output. Hm Hm Makes sense. Do you have, do you control for that and your prompt somehow or do you, how do you handle that the control of a size Makes sense. Do you have, do you control for that and your prompt somehow or do you, how do you handle that the control of a size I I'd say on our side, we, we've done a bunch of optimization. They're even as you mentioned on like making the GP D three output more concise which then we kind of can distill from. But uh it's really just kind of fiddling around with it. Like one of these things that we saw was uh with this kind of asymmetric pruning that we did I I'd say on our side, we, we've done a bunch of optimization. They're even as you mentioned on like making the GP D three output more concise which then we kind of can distill from. But uh it's really just kind of fiddling around with it. Like one of these things that we saw was uh with this kind of asymmetric pruning that we did when you move the batch sizes larger, all these games went away. And that's because like when it was smaller, you really could like highly optimize it because you're not waiting on the decoding versus when you have a large batch size, when you move the batch sizes larger, all these games went away. And that's because like when it was smaller, you really could like highly optimize it because you're not waiting on the decoding versus when you have a large batch size, you're just basically dominated by the longest sequence. So all your compression is completely kind of just broken. Hm you're just basically dominated by the longest sequence. So all your compression is completely kind of just broken. Hm Interesting. That makes total sense. OK. Let's talk a little bit. Also about monitoring. I'm very curious uh what you guys found useful both like uh uh human based mbas modeling, any other um type of techniques that you found useful? Interesting. That makes total sense. OK. Let's talk a little bit. Also about monitoring. I'm very curious uh what you guys found useful both like uh uh human based mbas modeling, any other um type of techniques that you found useful? Um I can jump in here with like a high level overview and I know I've talked to Mario personally about this, so I'm sure he has some things to add too. But kind of what we've seen our users doing kind of breaks down into three big categories. First category is which I think is the real source of truth is end user giving you a thumbs up, thumbs up Um I can jump in here with like a high level overview and I know I've talked to Mario personally about this, so I'm sure he has some things to add too. But kind of what we've seen our users doing kind of breaks down into three big categories. First category is which I think is the real source of truth is end user giving you a thumbs up, thumbs up down or you can use behavioral based, maybe like clicking refresh or closing out as a thumbs down as a negative signal on like how the prompt is doing. That's one category. Another category is this whole m hiring out to click workers to do. Very boring, very boring way to do it. In my opinion, the third category which I think is really kind of where the future is here is synthetic evaluation. So how do you actually down or you can use behavioral based, maybe like clicking refresh or closing out as a thumbs down as a negative signal on like how the prompt is doing. That's one category. Another category is this whole m hiring out to click workers to do. Very boring, very boring way to do it. In my opinion, the third category which I think is really kind of where the future is here is synthetic evaluation. So how do you actually use an L M to rate how good it is? And the one interesting thing I'll add here, which is kind of a thesis we're developing uh based on our conversations, that prompt layer is kind of via negative is the right way to do this. Instead of trying to understand which prompt you have how good they are, it's better to reverse the problem and say use an L M to rate how good it is? And the one interesting thing I'll add here, which is kind of a thesis we're developing uh based on our conversations, that prompt layer is kind of via negative is the right way to do this. Instead of trying to understand which prompt you have how good they are, it's better to reverse the problem and say how can I evaluate what's failing in P prod which users are getting bad results? Like is the chatbot being rude is kind of the most trivial example of this? But that, that's the way we see it. how can I evaluate what's failing in P prod which users are getting bad results? Like is the chatbot being rude is kind of the most trivial example of this? But that, that's the way we see it. Yeah, to, to be direct on that, that we're calling this critic modeling and it works extremely well, especially with the most recent models we found that. So like Yeah, to, to be direct on that, that we're calling this critic modeling and it works extremely well, especially with the most recent models we found that. So like on a seven point scale, the GP T four is off by an average of one label which I used to do a bunch of human evaluation. That's pretty much the best you're gonna get off like M Turk train judges and it works across fluency accuracy. Other things, the one thing here is on a seven point scale, the GP T four is off by an average of one label which I used to do a bunch of human evaluation. That's pretty much the best you're gonna get off like M Turk train judges and it works across fluency accuracy. Other things, the one thing here is there's huge positional biases. So we found that if you ever try to compare two things, the outputs don't work as well, so it's best to try to focus on individual comparisons and then potentially using some like weak supervision signal to figure out pairwise things. there's huge positional biases. So we found that if you ever try to compare two things, the outputs don't work as well, so it's best to try to focus on individual comparisons and then potentially using some like weak supervision signal to figure out pairwise things. Can you give us a concrete example when you mean critic? Like what is the prompt you're giving the critic or how does it Can you give us a concrete example when you mean critic? Like what is the prompt you're giving the critic or how does it work? You're writing a thing of, hey, uh engine, your task is to evaluate how good a summary is for something you are evaluating in a term of accuracy. Something that is completely accurate has X characteristics as a seven something that is slightly missing something is a six, work? You're writing a thing of, hey, uh engine, your task is to evaluate how good a summary is for something you are evaluating in a term of accuracy. Something that is completely accurate has X characteristics as a seven something that is slightly missing something is a six, all your description of labels, give it to the model. The model will respond pretty damn well. But I will say like the don't try to say, hey, I have these two summaries, which is better because the model will uh there's position bias depending on what you put first or second, the model will favor what's left. And at the same time, the outputs aren't as usable. It's kind of like the old Costco thing where all your description of labels, give it to the model. The model will respond pretty damn well. But I will say like the don't try to say, hey, I have these two summaries, which is better because the model will uh there's position bias depending on what you put first or second, the model will favor what's left. And at the same time, the outputs aren't as usable. It's kind of like the old Costco thing where you never ask someone why they prefer something because if you ask them to justify it, then they will not actually give you their preference. They will give you an answer which they can justify. So don't ever ask the model to justify. you never ask someone why they prefer something because if you ask them to justify it, then they will not actually give you their preference. They will give you an answer which they can justify. So don't ever ask the model to justify. Yeah. Yeah. Yeah. Yeah. OK. Interesting recently. But uh it's also known in human decision making, by the way, like if you give a man like a verbal uh statement of, of choices, people just forget very similar to what we observe in the network. The first option is to tend to pick some of us in decision design. Very interesting behavioral research. OK. Interesting recently. But uh it's also known in human decision making, by the way, like if you give a man like a verbal uh statement of, of choices, people just forget very similar to what we observe in the network. The first option is to tend to pick some of us in decision design. Very interesting behavioral research. Um OK. Anyone else on monitoring and performance evaluation? Um OK. Anyone else on monitoring and performance evaluation? Yeah. Yeah. So just like I said, we've done a bunch of stuff in house and we've found similar to what Jared mentioned where like the tail lacs can get very out of hand where you can see massive spikes and they can be caused by these weird corner cases that you never thought of. Like we were seeing massive tail agencies occasionally Yeah. Yeah. So just like I said, we've done a bunch of stuff in house and we've found similar to what Jared mentioned where like the tail lacs can get very out of hand where you can see massive spikes and they can be caused by these weird corner cases that you never thought of. Like we were seeing massive tail agencies occasionally and that turns out we were running a model on a sliding window and occasionally it hits super long documents and like the solution was like, oh, what do you do? Just truncate after? I don't know 10,000 tokens, but sometimes, and the only way that you'd see this is like you look at what these outliers are and in most cases, any time we've seen an outlier, the solution is super simple. It's just like, oh, we didn't know about this weird behavior. This is the easy fix. We no longer have that. and that turns out we were running a model on a sliding window and occasionally it hits super long documents and like the solution was like, oh, what do you do? Just truncate after? I don't know 10,000 tokens, but sometimes, and the only way that you'd see this is like you look at what these outliers are and in most cases, any time we've seen an outlier, the solution is super simple. It's just like, oh, we didn't know about this weird behavior. This is the easy fix. We no longer have that. I call, I call this terrible uh have a worse name, but I call this terrible experience debugging, which I regularly do, I wasn't recommender systems. So I also monitor it like the really bad experiences and I just picked one every week to, to deep dive on the root cause this is very similar to, to you saying like monitoring the bad experiences like the case is the long late. See and then you, you find like groups of problems that you can fix one by one. I call, I call this terrible uh have a worse name, but I call this terrible experience debugging, which I regularly do, I wasn't recommender systems. So I also monitor it like the really bad experiences and I just picked one every week to, to deep dive on the root cause this is very similar to, to you saying like monitoring the bad experiences like the case is the long late. See and then you, you find like groups of problems that you can fix one by one. Uh Very good. OK, cool. I think we're out of time. I just have Uh Very good. OK, cool. I think we're out of time. I just have one more thing if you don't mind. Just so uh putting a a plug here, come try the profile, let you with a single line of code, you can get your model running across different hardware and you see the profile of the latency it takes for different options and so on to give a very actually running the model and profiling it for you. So we love to hear from you. So one more thing if you don't mind. Just so uh putting a a plug here, come try the profile, let you with a single line of code, you can get your model running across different hardware and you see the profile of the latency it takes for different options and so on to give a very actually running the model and profiling it for you. So we love to hear from you. So yes, please try out. Uh The interesting um Can I yes, please try out. Uh The interesting um Can I just jump out? Can I, can I jump in for like 30 more seconds? Like I'd be interested to hear from others, like probably in chat, like how they deal and labeling becomes too expensive as well. If you need like GB four for labeling, then you know, it's it's expensive again. Um So whether people have like some smart sampling techniques or maybe like pre filtering initially, are there solutions out there for that? It seems so. So nascent field to me just jump out? Can I, can I jump in for like 30 more seconds? Like I'd be interested to hear from others, like probably in chat, like how they deal and labeling becomes too expensive as well. If you need like GB four for labeling, then you know, it's it's expensive again. Um So whether people have like some smart sampling techniques or maybe like pre filtering initially, are there solutions out there for that? It seems so. So nascent field to me uh that pretty much like all the tooling is just in the process of building. So, uh, yeah, I'd be happy to hear in chat of like, what people use for things like that. uh that pretty much like all the tooling is just in the process of building. So, uh, yeah, I'd be happy to hear in chat of like, what people use for things like that. Excellent. Excellent. Yes. Yes. Snorkel.

+ Read More

Watch More

Streaming Ecosystem Complexities and Cost Management

Posted Apr 04, 2025 | Views 456

# Batch Systems

# Cost Management

# Streaming Ecosystem

# Tecton

FrugalGPT: Better Quality and Lower Cost for LLM Applications

Posted Aug 22, 2023 | Views 564

# FrugalGPT

# Fine-tuning LLMs

# QuantumBlack

# Stanford University

PyTorch's Combined Effort in Large Model Optimization

Posted Nov 26, 2024 | Views 1.3K

# PyTorch

# Torch Chat

# Meta Platforms