MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Want High Performing LLMs? Hint: It is All About Your Data

Posted Apr 18, 2023 | Views 1.3K
# LLM in Production
# ML Workflow
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Share
speakers
avatar
Vikram Chatterji
Co-founder and CEO @ Galileo

Vikram is the co-founder and CEO of Galileo, the first data-centric platform for model debugging. Vikram previously led Product Management at Google AI where he painfully realized the criticality of good quality data for good quality model outcomes, as well as the highly manual nature of ML data debugging.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Building LLMs that work well in production, at scale, can be a slow, iterative, costly, and unpredictable process. While new LLMs emerge each day, similar to what we saw with the Transformers era, models are getting increasingly commoditized – the differentiator and key ingredient for high-performing models will be the data you feed it with.

This talk focuses on the criticality of ensuring data scientists work with high-quality data across the ML workflow, the importance of pre-training, and the common gotchas to avoid in the process.

+ Read More
TRANSCRIPT

Intro

Hey, yo. We got Vikram joining us. So what's up dude? Hey. Good to see you again. You weren't ready for that intro on there. I love it. You weren't expecting cows, were you? But I loved it. It's the only way. So it's been a while since we chatted. I love what you all are doing at Galileo, and we're gonna give a huge shout out to Galileo if anyone wants to try it.

We're gonna drop all kinds of links. You're gonna show us all about what you all have been working on. And I gotta say thank you so much because you all decided to sponsor this event. And that is huge. It means a ton. It keeps us in business. It means that I can do these things and I can dedicate more time to it.

So thank you so much Vikram. Thank you so much. And there are Galileo socks as well. No way. All right. I didn't realize that. I'm gonna be giving them away then. Anybody in the chat let's. Lou just wrote in the chat that that last video moving and that is how you get some yourself, some Galileo socks.

All right. Let's do it. So Vikram, I'm gonna share your screen right now, man, jump into it. I'll see you in like 25 minutes and for anyone that is watching along, feel free to ask questions in the chat and we'll get to 'em at the end. Sounds good. Thanks, Demetrios. Yeah, and I'll try keeping about five minutes in the end for, for any questions.

Um, Just a quick introduction about me. So I'm the co-founder, CEO of Galileo. The company itself and the product is completely geared towards your data and how you can figure out what the right data is to work with. I just heard about, just before this, we were talking about the terrible experience debugging and, when you work, whether you're working with recommendation systems or whether you're working with NLP models or now with n.

But the whole reason we started the company was because before starting Galileo, I was heading up the product management team at Google ai. And while building out models there, we faced this problem on a daily basis. Like literally days and days of just working with Excel sheets and Python scripts to debug the model and figure out where the failure points reside.

We used to call this data detective work. So a lot of scars from that, and that's kind of what I also wanted to talk about over here with. All the chatter about LLMs now, which is to what Alex was mentioning before, it seems like this is the new phrase that we've all anchored around. And it used to be foundation models just like literally a second ago, and transfer learning based, transformers before that.

Want High Performing LLMs?

So it's interesting how fast the entire industry's moving. But what's interesting also is that the first principles still stray true. And that's what I wanted to discuss. The first principles are that, ever since machine learning ever came about, for those of us, like me who've been in this space for the last decade and more, it's always been about the data.

And if that doesn't change, but what I think changes is, what does that mean in the context of LMS, in the context of generative AI for people who are trying to work with GPT or any of the other wonderful open source models that are out there. So obviously, unless everyone's or someone's living under a rock, we all know that AI in general is having quite the mainstream moment right now, powered by this huge leap in modern development essentially.

Having, being able to train, partners on a lot more data. Huge carpets, like a lot of other pieces here that have come in into making this happen today. But almost every day you see things like, Y Combinator having a good, almost more than one third of their entire class, or one fourth of their class being a generative AI apps.

We're seeing these other accelerators as well, hearing about how this could be transformative for the enterprise. We're also starting to see a lot of. Worry and repetition in the market and, you know, how do you navigate generative AI for customer service? Is it a bubble? Uh, that I think Dan's talking about this, uh, it's a, it's overall just a really interesting time in space.

And the interesting thing is that all of these different kinds of headlines are, have come in just the last three or four months, right? Like to December. And, uh, a lot of this wasn't even in the zeitgeist, which is ridiculous. So, uh, of course, like we are all having. This huge moment now where AI has gone mainstream and, uh, everyone in the street just kind of knows about G P T and it's kind of, uh, brought it out there.

However, I think the piece which I wanted to really flag is that it's for, for those of us who've been in the industry for a long time, which it seems from the chat and every, every, all the other talks that I've seen so far, It's a lot of us, so we kind of know that we, we've seen this movie play out before.

It's just now there's so much more hype and someone's done their marketing game really well because, uh, you know, everyone's talking about it. But essentially if you, if you resigned all the way back to, uh, 2012 ish, uh, it was, I think thir it was 30th, September 20th, 2012. And there's this, this, uh, cnn, uh, called Alex net, which, uh, which came out and, and won the, won this Image Net challenge.

And, you know, people were talking a lot about that. And there was this article in the, in the Economist at the time, which in my head is kind of when you're reach. Mainstream, uh, where, where, where they talked about how suddenly people started to pay attention, not just within the small AI community, but across the technology industry as a whole about what's going on here, how can you classify these images in such an interesting way?

This is gonna, have to be a huge breakthrough. And that was of course on the heels of a lot of interesting developments in the GPU space and, uh, from compute and storage, et cetera. But all of that's what led to that happening at that point in. You fast forward from there, just a couple of more years and you get to, um, the, uh, the, the huge transformer paper, like attention is all you need, which came out of a sister team of mine at, at Google, and this was huge.

Right. And you know, just that, that's what led to, uh, our teams at Google creating Bert. And then, cause Bert was open source and anybody could build on top of. It lead to increasingly these models becoming commoditized entities. And, uh, we saw this huge explosion in terms of distilled bird coming, about Roberta coming about, uh, span Bird and so many others.

We saw companies like PayPal coming up with what they call PayPal Bird. If someone's thinking about G P T, uh, what's called Bloom GPT, it's very similar, right? Like PayPal basically said Bert is great. It's trained on the entire Wikipedia corpus. Not good enough for us. We're gonna fine tune this and train this on specifically PayPal's data and that's what we need for our use cases to actually work out.

So awesome. I'm gonna pick up Burt just off the shelf. It's super commoditized and, but I'm gonna fine tune that on this specific data that I have and that's gonna work for me and my martyrs and my users and my use cases. Uh, similarly, stock new sentiment analysis with Finberg came about. So this whole entire, um, explosion of models essentially led to the commoditization of these models as entities.

And so my team at Google essentially was also just picking up bur off the shelf and that was the job done. We would experiment with maybe a couple of variations of work, but for the most part it was again, what we called data detective work, where we would figure out what's the right data for this, for us to train with, um, once the model's in production, you to again, kind of check for whether there are any data failure points for the model, et cetera, et cetera.

So it always comes down to that first principles and that's what happened there. Uh, this was, I think back in 2017 to 20 20, 20 21. Um, so that was like in my head the first wave of, uh, large language models because in, back in 2018, I guess like Bird was a very large language model. If you, I guess now we're dealing with larger language models and that's the only difference.

Because of that, it's much more powerful. And because of that, the stakes for, in terms of what can go wrong are also much, much higher. Right? And so that's why we have to be very careful. Right after this, we started to see this huge wave of what, um, data scientists kind of already knew and were talking about.

But I think engineering coined this really interesting phase phrase of data-centric ai, uh, where, uh, it was, it was interesting because when we, uh, when we talked to, uh, more and more data scientists, they started to really gravitate towards this, this phrase because they realized that already. When they were working with their, with their portal tools had become commoditized and good quality data.

And knowing what data to train with not quantities of data became the barrier as well as the moat. Cuz some companies just did not have enough data. What do you do there? Some companies that did have proprietary data that just becomes the moat for their, for their business, becomes a massive differentiator for themselves.

So the whole industry started gravitating towards this idea that, great, these models are commoditized. I'm gonna start to focus more on the data flywheel that I can create here. And if I don't have enough data, can I synthetically generate some of this data? Uh, can I spend a lot of money on labeling the data?

How do I audio do after that when the labels are incorrect? How do I fix it? All of this starting to come under this umbrella of data centric ai. Uh, which I'm doing, uh, uh, coined a couple of, uh, couple of years ago. Um, and this has also essentially been at the heart of what Galileo does. It's essentially a very data centric platform for, uh, for being able to build great machine learning models across the ML workflow.

Whether you're starting from, before you even label your data right after you labeled them and you're training and iterating on it over and over again. We are in cahoots with the bottle and we. Uh, pickle product automatically tells you what the model failure points are, so you're not spending weeks and weeks in data detective work.

That's the enemy for us. And then once your model is in production, again, they're telling you what's the right data to train with next, uh, pick up next. So, uh, very much in line with this entire, uh, this entire movement, however, Um, back to the present again. What we're seeing now is a new mo wave of models that are emerging.

Super exciting times, of course, like I think all of us are feeling that. Um, and we've again, increasingly seeing at a much faster pace than what we saw in the last iteration of this a couple of years ago, that the models of becoming commoditized. Way faster. Right. And, you know, uh, G P T it feels like just came about a little while ago, and then Lama came about.

Google has its own many applications built on top of that to further market these different models. Um, and then a whole host of different open source, uh, uh, models coming up right on top of that. But with this commoditization, essentially, uh, what we've started to see is this is going mainstream. And so, um, again, in The Economist, about 10 years later, after this first quote that came up in 2012, again, there.

This quote about how foundation models can do these magical things and that there's huge breakthroughs. And this was in June, 2022. So the Economist had not caught on this new wave of it being called large language models yet, but they were calling it foundation models, similar to what Alex was mentioning before, where you can call it either.

But I feel like it's good for the industry to gravitate towards, uh, one term, but essentially at the heart, they're kind of the same. They are the basis of what you can build on top of. So it's really interesting how we've gone through the cyclical process in the first iteration where an interesting ML breakthrough happens.

People talk about it, a lot of attention gets into the mainstream and then gets commoditized very quickly. And then again, now we are getting into this phase where people are realizing that with, um, these, these models are great. A lot of them are, uh, are, are closed source.

So now how do I use this for. How do I figure out what to train my models with? So a lot of these different kinds of open source models have been coming about recently and, you know, some of, uh, we've, some of this has been mentioned in, in a few talks before, but it's, it's interesting how, um, you have Databricks coming up at Dolly, for instance, you have alpaca, uh, you know, this come out of Stanford.

Just super interesting to. Uh, you know, they've basically been already trained with much less amounts of data, much, much smaller amounts of money as well. They're basically like fine tuning the, the larger model, uh, based off of data that's coming from a large language model. Uh, Bloomberg, G P d, again, like, you know, very similar to a couple of years ago, PayPal, Bert came about, and now you have Bloomberg G P D coming about where they've trained it on their own financial.

The model itself has a lot to be desired, and this is just the beginning, I'm sure Rev one, and after this, they're just gonna keep iterating on it over and over again with better data and making sure that the prompts are better as well. Um, and uh, another example which I, which I came across was iel, which is an instruction to German fam, l m family.

So it's interesting how every single uh, use case, they're folks are trying to find data so that they can train their. Create open source models where then just hone this part themselves and there's gonna be another proliferation. Very similar to what we saw with the bird era of doing things.

We're seeing that again with the G P D era of doing things. And the difference is the bird era started with open source for the most part, versus now we are seeing this both from, uh, have a bifurcation between the closed source of APIs versus the open source. And, uh, both of that, I feel like it's a healthy competition and.

Things will happen in the market, but there's just gonna be this huge proliferation of open source models, Anders, of that, a lot of customers and companies, especially in the enterprise. Um, uh, trying to, trying to curate their own models for themselves so that they can have a differentiation and have that more to be built out.

Again, what we are noticing here is, uh, uh, yet again, the good quality data is emerging as the, as the big blocker and, and the, and the big moat for, for the industry. Uh, so we, this is by Greg Brockman. This is an interesting, uh, quote, which really hit home for me. Uh, this is, uh, for those who don't knows the CTO of OpenAI.

He was, he a co-founder of OpenAI and he was the CTO of Strike Before. But not unfamiliar at all with this problem. And, uh, we've been mentioning this before about how, you know, 80 to 90% of what a data scientist does today is basically just staring at the data, trying to figure things out, trying to fix it, and then iterate on it over and over again.

Right across the end, ML workflow. It doesn't matter if you are just starting out, you start out with the whole corpus of data, or if you're just iterating on the trading cycles or if you're on the modern production side of things, you're just constantly iterating with. And it's a, it's a, from a lot of our customers and partners today at Galileo, we've heard of this being referred to as the most soul sucking part of the job, but yet the most critical advantage part of the job.

And this, this whole notion that this is the manual inspection of data has got to be the, have the highest value to prestige ratio of any of the other activities is true. It's a really, really hard problem to solve, and it's, it's typically very, very manual. Which is exactly what we are in, uh, are trying to flip the narrative off.

It doesn't have to be manual. We can automate a lot of this. We can make this much, much easier. Um, and the same thing, the same principles hold true, uh, even, even today. Um, so the other piece here is, is that when you're building LLMs data becomes extremely critical and I think more and more people who are in the weeds of building out, uh, ML apps using.

You know, whether it's the, uh, predictive models or generative models, what have you. You, you very quickly realize that, uh, you know, the data is a very important piece. And now with these generative models, which I guess what you're calling lms, um, it becomes important for a couple of reasons. So maybe touch upon that first, like why it becomes really important.

And then I'll talk about, um, you know, which aspects of the entire flow when you're creating these apps are where you should be really, where we should all as a community really. Thinking more about how we can invest in where the re, where research needs to can be pushed in, pushing the envelope on. And, uh, where tools can be built out to actually help out developers in really turbocharging their, their, uh, workflows and giving them superpowers.

So the first part is, um, what, uh, why this is really important. Um, so as we all know that, you know, mortal hallucination is a very real thing. We actually saw that in the, uh, I noticed that in the, uh, prompt injection piece that we were doing, where very quickly we were running into these mortal hallucinations, and that happens all the time.

It reminds me of these, of how fake news starts to spread across social media. You just don't know what to believe anymore. And so the border is kind of similar. It's like, I dunno, here is an answer and it's very confident about the answer, but how do you know, uh, whether to believe in that or not?

But with the right kind of data, with the right kind of prompting that hallucinations can reduce, it's just a matter of being very, very cognizant about that. So this is really important. The, obviously the, the super popular story about this, this, the case study around this is Google's AI chat bot bar that was making, that made that factual, um, error in its first ever demo.

They've been tweets by Sam Altman and others about chat g PT as well, where it's like, do not believe necessarily in everything. It says Google Bing in its new, um, uh, in its new, uh, release for, its, for its chat part. Also said that, you know, take this with, take these results with the green of salt. So it started to, uh, the mitigations around hallucination have started to emerge in legal text in these applications.

But as app filters ourselves, we can go, we can, we. Fix this by trying to be better about the inputs that are going into the, the, the l LM itself, which I'll talk about just after this. Um, The other piece is, you know, becomes very critical when you're fine tuning your, uh, your LLMs to your use cases. Again, the Al Stanford Alpaca was a great example of this.

We've seen a whole host of these, to be honest. Koala came out of, uh, Berkeley recently. It's a really interesting paper if those have not, if you've not read it, it's about a dialogue model for academic research, specifically trained, I believe, on just a bunch of academic research papers. Um, uh, there's a Portuguese, Portuguese fine-tuned instruction, uh, version of LA.

There's a Chinese, uh, instruction following a version of that as well. So there's the power of just providing an an A model out there so that you can fine tune it is exactly this. It leads to a huge explosion of very individualized kind of models for anybody else to use in the world, and that's where innovation can begin.

And now you can start to focus on your data on top of that so you can fine tune your LLMs. And all of these folks basically had some level of some corpus of data, which they could fine tune their, their model on top of. The, uh, the other, and the third reason is, uh, to ensure the body response. Are predictable and of the highest quality because at the end of the day, especially when it comes to enterprise use cases, what we keep hearing from our, uh, from customers at Galileo and our partners is that I don't, I can't launch a model unless it's, unless I have trust in it.

And this whole notion of, you know, the, uh, what is it, 10 or 20% of models reach, model production, it's, it's, it's not really about the lack of tooling at the way. In fact, I would argue there's too much tooling along the. To build great models, it's, it's more about how do you make sure that the model is of a very high performance, um, and that's the biggest bottleneck.

How do you get there? And again, for that data has become the biggest bottleneck to get the model to that high performance where, uh, a data science team can feel like I, I can truly, uh, put this out there in production. And then once it's in production, uh, can make sure that I can, I have the wherewithal to be able to make sure that it's always performing really well.

There are a couple of things in which we can do to, uh, to be better when it comes to, uh, create doing, uh, more data-centric l m development, uh, going back to engineering's, uh, coinage of the storm data-centric, uh, and applying that here to, uh, LLMs. It's, it's many different ways to do this. The way we've been thinking about this at Galileo has been that it always comes down to the inputs, right?

And before, earlier when you. Foundation models or transformers. The inputs used to be the bottle, the code, and the data. And, uh, what we noticed was the code was minimal. The model was commoditized. The data becomes the a hundred thousand pound gorilla in the room that you have could fix. Right now, what you're noticing is it's similar model is commoditized, the code is minimal, but the big inputs now are the pr, the, the data, but also the props.

Right. And, so I would, I would, my, my argument here has always been, We need to figure out how to, how to, uh, balance between both. You do need to fine tune your, your, your models with a lot of data. But again, like that can be difficult because a lot of the practical reality is sometimes you just don't have enough.

And enough data. Uh, and sometimes if you have a lot of data, you don't know whether it's of high quality or not. So you can also just use, um, uh, uh, prompting. But to be able to go that extra mile and be able to, uh, fix your, make sure that the model output is good, depends on the, on the use cases that you're working with.

However, it's really important to make sure that you're taking control of your inputs. And what I mean by that, And you think of prompts itself, right? It's very critical to optimize for a few things and keep that top of mind. One is the structure of the prompt itself, right? So something which is really popular these days is just zero shot prompting and, prompting the model that way.

But there, there's a whole host of research that's being done around how we can be better about the structure of the prompts itself. And I think this is somewhere, Um, you know, a lot of the premier research labs that we talk to and, you know, half of our team at Galileo is ML research that's constantly, uh, pushing the envelope on exactly this kind of stuff.

We keep thinking about this, like, what can we do, which is better? And one, one paper, which we read recently, which is really. Really interesting, which came out of my previous sister team at Google Brain was, uh, this idea of chain of thought prompting, uh, which essentially it's a very simple concept.

It just means that instead of just saying that the in your prompt template input, uh, instead of saying the, uh, the example answer for a question like this where you're doing a almost a math calculation and saying, Hey, the answer is 11. Instead of that, if you just reframe the, the answer to actually.

The model, the chain of thought that you used to came to the, to come to the answer, the model outputs can be much, much better. So that's a slightly different structure towards your prompts. Uh, and it's a, it's a really interesting paper where they tested this out against, uh, whole host of very large language models and they noticed that as compared to standard prompting, uh, it does it, uh, by orders of, by an order of magnitude, it does much, much better.

So this is an example of where. Uh, urge, uh, all of us to be a bit more cognizant about the structure that we are using in the prompts itself and look out and for better research and maybe share across the community what the right kind of structures are, um, for these prompts so that we can all benefit from it holistically.

The other piece is context and retrieval. Of course. Uh, the more context you can give the model, the better, more examples you can give it, the more retrieval you can get, the better it is. Pine Cone is an example. Uh, this is certainly not the official logo of Pine Cone, but, uh, you know, uh, those who use retrieval mechanisms for their, uh, during prompting will know that it really helps in augmenting the, the outputs of the model.

But again, the question there becomes when you're, when you are using a Vector database, Some kind of a, a repo of data for your examples and giving context for the, to the model. The question becomes, is that the right kind of context? If you're using a Vector database, the question becomes, is there other kind of context that you could use, which is maybe similar or close, close by in the embedding space that you could pick up?

How can you be better about that? Uh, so that kind of being better about the context that you're providing and tweaking that, and maybe even fine tuning that over the course of many different prompts is super critical. And the third one is just instruction, uh, for, for these prompts and just, uh, you know, being able to give the right kind of examples and, uh, and tweak that over and over again, um, is, is, is super critical.

So there's a lot there and just prompts itself, which is why perhaps there's this new term around prompt engineering that's, that's coming about. But the whole reason for that is because there's a lot here in terms of what you could. For, uh, for improving your, your, uh, the output of the models and making that a little bit more predictable, uh, before you put it out there in the wide.

Apart from that, you can, uh, of course just fine tune with label data. And while doing this, it's number one, extremely expensive for the most part to use, uh, a labeling tool or, or, uh, or, and also very time intensive sometimes. Uh, but at the same time, it's extremely critical here when you're fine tuning to identify whether you can, you are seeing any performance regressions caused by, uh, caused by the fine tuning of the model.

And especially now with LLMs, right? This is much more exacerbated because you're dealing, these models have been, have been trained with so much data and the like, uh, the corpus of data is so much larger that when you fine tune it with certain kind of data, the model performance that you see, it was not gonna be that great.

So you have to keep doing it over and over again. Um, uh, to be able to get to the point where you can, you can have a model that works really well for your use case, whether it's finance or the contact center or what have you. Um, The second piece is of course, incorrect round truth. No matter what products you use, no matter if you have in-house labelers, if you have humans en looped, if you're doing reinforcement learning, if you're doing R lhf, there will always be incorrect round truth information and that this becomes a hair on fire issue.

How can you automate this? Are there ways to do that? How can you get faster about figuring out what the incorrect ground truth is? A lot of this. Um, very similar again to what used to happen with the world of Transformers and still does for, for, uh, predictive ml. And those same principles and first principles still apply.

And the last piece is maintaining data freshness and repeated fine tuning. Cuz again, the models in production, that's not the end of the day. Like, you're not done dusted, you can't move, just move on to the next thing. You have to keep monitoring it. You have to keep figuring out whether you need to fine tune again.

What's the right data to fine tune it with. That's again something. Exists today with the world of foundation models as well, and with transformer models. But it, it is, again, super important as we are building LLM apps across the enterprise. And there's a whole lot more of things that you have to do on the fine tuning side too.

So, TLDR here is when we, when you think about being data-centric, so to speak, in the world of LLMs, it's not just the fine tuning of. Of your actual date label data, but also looking at the prompts and analyzing that from multiple angles and making sure that you're good on both fronts. Um, So I know I'm at time, uh, maybe even beyond, um, sorry, Demetri of I am.

But, essentially LLMs, we all know are ushering this new wave of baseline open source and flow source models and the enterprises benefiting from this. We all are, uh, so very exciting times. So I would say like we should just get our data ready and have our first principles hat on and gonna make sure that we are, um, making, we are, uh, optimizing the, the, uh, the different.

Prompts that we are using as well as the, the detail that we're constantly fine tuning it with. So, I'll, I'll stop here. Uh, sec. I don't know if there are questions at all, but I'm happy to take any, there were a few questions that came through. Let me, let me find them real fast because I thought it was really nice.

So let us see. There was one that I wanted to find and now I can't find it. Of course, of course. I'm also available in Slack by the way, so if anybody has any other questions, feel free to reach out to me there too. There we go. There we go. So, uh, we can continue this conversation in the community conferences channel, but I think there was, um, uh, where there was some, there was some really good ones.

And of course, in the chat itself, I struggle. So in the meanwhile, if anyone has any questions, throw 'em in, uh, because there.

So we had some people singing the praises of Galileo. Uh, then they have, okay, there was one person talking about how when they did fine tuning of Burt in the past and then they saw the results, they never thought we would get to this stage that we're at right now. Uh, ba ba ba. So, oh man, I, there was a good.

But now I can't find it. Okay. Anyway, for those people who have questions, go into Slack. Yes, please. Because I guess we're not gonna get it here. I've failed you as the moderator. I had one job, man. I had one job and then No worries. You can see, you can see how good of a moderator I've been. I mean, um, okay.

We got one coming. Now that it's, it's been a minute. Have you seen regression in prompt performance when updating models? How do you manage that? We, you do. Uh, you do. And it's a lot of experimentation to be honest, right? Because you're constantly changing the kind of models you're using. You are kind of, you have to test and evaluate those.

Those, those prompts. Um, I think the way to manage this, again, gets back to first principles. You have to have the right kind of metrics at your disposal. You have to have the right kind of tools to be able to check for, you know, which model did I use here? What was the, what was the response from the prompt for what input?

And then be able to make decisions around that. And that is something that we've been hearing again and again from our, from industry practitioners, that it's been really hard for them to manage. Um, you know, this entire. Uh, the, the, uh, the, uh, the comparison between how you look at different prompts, because it gets into the hundreds very quickly and then the thousands, and then you have multiple prompts.

Multiple models, totally. The matrix is just huge. And so, um, it is partly a software engineering challenge, I think, which is exciting for, uh, you know, the ML ops community. Uh, but I also think it's a little bit of a research ML research challenge, and that's why we keep talking to the research community because we need better.

Metrics, in my opinion, to evaluate the prompts from a model perspective to know like which one is doing well and which one is not. And honestly, like right now, I've been asking the, the different research agencies, my teams at Google before, and there is very little like, uh, out there right now in terms of what, what metrics you can use to evaluate these prompts across the board.

So people are kind of defaulting to the usual. Blue scores and other things like that. But, um, I think something new needs to come about and, and help there. But for now, software engineering tools are, are, are, are the best bet, dude. Well, the big question that I have is, do you feel like prompting is then state?

Like, are we not gonna just move past prompting eventually? Move past it. Um, there's a big debate there, di dementia, like I think you open up a can of worms that they'll literally, amongst the people that we talk to, uh, there, there's, there's, there's two different camps. There's some people who say that that's all you need.

You don't need to find unit models anymore. That like, you know, the parties have arrived, throw any data into this. We just need to do inference with prompts. Some folks say prompts are stupid. You just have to always find you model. I believe like the reality of today is that you need to do a bit of both.

It depends on your use case. If you're building a very, um, high, uh, high intensity app, which has, where the downside of getting something wrong is really. Uh, you need to do a lot of fine tuning to make sure that you can get to the, the, uh, the 99 percentile or the 90 50 percentile. But, I think the here and now is prompting is very, very important.

It is a, it is a real part of the workflow, so you, we can't ignore it. I do think it's gonna stay for at least a while until models just get that good, which is probably gonna happen. Again, the market is changing so fast. We have to kind of index on the year and now, and I think at least for the next one to two years, it's gonna be a thing, but it's just gonna get better and better and tools are gonna get better and better for helping people do prompting faster, and maybe it'll be automatic at some point.

Here's the prompt you should use. Yeah.

+ Read More

Watch More

1:01:43
How to Systematically Test and Evaluate Your LLMs Apps
Posted Oct 18, 2024 | Views 14.1K
# LLMs
# Engineering best practices
# Comet ML
Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.7K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io