Sign in or Join the community to continue

You Can’t Improve What You Don’t Measure — Evaluating Quality and Improving LLM Products at Scale

Posted Mar 11, 2024 | Views 523

# Evaluation

# LLM

# Slack

Share

speakers

Austin Bell

Staff Software Engineer, Machine Learning @ Slack

Austin Bell is a Staff Software Engineer at Slack focused on building out text based ML and Generative AI products. Previously, Austin held positions at Fennel, an ESG modeling and mobile investment app, as the Head of Engineering and various roles leading teams developing Data Science solutions and ML products at the Analysis Group, an economic consulting group specializing in the legal, economics and healthcare industries. Austin began his career as an Economist focusing on competition and healthcare economics.

Austin has also led initiatives leveraging machine learning techniques to better diagnose, evaluate and analyze skin disorders at Columbia University as well as collaborated closely with the Haitian government and leading non-profit clinics in Haiti, to conduct research in predicting patient success in HIV treatment as well as supporting modeling the spread and understanding of COVID-19.

Austin graduated with a Bachelor’s Degree in Economics and obtained a Master’s Degree in Computer Science and Machine Learning from Columbia University.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

How do you know that your prompt change or different pre-processing technique improved your LLM output? And if you don’t know, how can you deploy product changes with confidence to ensure that you’re consistently improving your generative products?

+ Read More

TRANSCRIPT

You Can’t Improve What You Don’t Measure — Evaluating Quality and Improving LLM Products at Scale

AI in Production

Demetrios [00:00:05]: We've got Austin coming up to talk about what you don't measure, you don't improve. Or I may be paraphrasing. Is that what it is, Austin?

Austin Bell [00:00:14]: Yeah. Overall goal is to essentially say if you're not measuring the quality of your LLM outputs, it's tough to continue to improve or know that you are actually improving them.

Demetrios [00:00:25]: There we go. So, for those who do not know, the interesting piece about what Austin is doing, he's working on Slack, or at Slack, I guess, would be the proper preposition. And the cool thing here is that I use this every day. We do have a very large community presence in slack. So if you're not in it, here's the moment where I show you a QR code to join our MLO ops community in Slack. So all this cool stuff that Austin is going to talk to us about right now over the next ten minutes will be us putting it into practice and we're going to go try and break it. Austin, whatever you are measuring, we're going to go try and mess that up for you just to make your life a lot more fun. So I'm going to share your screen right now, and I'll jump back in ten minutes, man.

Austin Bell [00:01:21]: Perfect.

Demetrios [00:01:24]: All right, see you in a bit.

Austin Bell [00:01:26]: Great. Thank you for that introduction and tough presentation to follow. A brief introduction about myself. My name is Austin. I'm a staff software engineer at Slack on the machine learning modeling team, helping build out some of the generative AI products that we're building out here. And some of what I'm going to be talking about today is sort of how we evaluate the quality of a lot of our generative AI products and how we use these sort of automated evaluation techniques to sort of continue to improve our LLM products at a sort of scale. And so to start, just wanted to give, like, a brief motivation of why this is an actual problem that we care about. The first being we sort of all improve our generative products through a variety of different approaches, whether it's through sort of prompt engineering, using different pre or post processing techniques, leveraging different machine learning models.

Austin Bell [00:02:22]: And we want to be able to evaluate the impact that these different approaches leads to on the outputs from our generative based products. And the difficulty sort of comes from that with each user experience. It's very random and sort of unique, but it's sort of this unique experience that actually we want from AI. You can't just develop sort of ten examples, evaluate on those, and say, you're kind of good to go, but this leads us into sort of like a catch 22, where without being able to sort of properly evaluate the quality of our changes or of our generative based products, it's very difficult to sort of drive continual products at sort of scale and continue to improve your sort of generative based products. So the rest of this presentation is kind of going to go through a variety of topics first, kind of very briefly discussing why it's sort of difficult to evaluate a lot of these llms, how we are thinking about LLM evaluation, and then kind of taking this and bringing it all together to kind of highlight what is our sort of LLM development lifecycle, kind of look at at a 10,000 sort of foot level and sort of to set the stage. Our focus is primarily across two generative based products that we're building. The first is around sort of summarization of user messages. So think about the ability to summarize an entire channel or a period of a channel or a thread, or enabling natural language through search.

Austin Bell [00:03:53]: So having a question answer product where you can actually ask a question to your slack search bar and receive an LLM based response. And so why is this sort of evaluation sort of difficult? I think it comes down to sort of a variety of different reasons. The first is largely subjectivity. What I think of as a good summary may be very different than what you think of as a good summary. So where I may want sort of a very short, concise, just a couple of bullet point summary for a particular channel or a particular thread, another user may instead want a very comprehensive summary that prevents them from having to go and read all of the underlying messages. Both of these could be good summaries or good responses in a sort of question answer system. And the evaluation should largely be driven by sort of context and the user needs for that particular product. Outside of this sort of subjectivity, there's a lot of more objective based measures that we should be sort of evaluating on and that are needed to sort of drive and determine what is a good summary or what is sort of a good question and answer response.

Austin Bell [00:05:05]: For example, ensuring that your summary or response is actually accurate. It is coherent, like it has a good language, it has a good flow, it's not making grammatical errors, or it's actually relevant to the initial user query, for example, in a question and answer that you want to ensure that your answer actually is responding to the user's underlying question. And so how do we think about this? We actually take these larger concepts and we start to really break them down into much smaller and more tractable problems, some of which you may be already familiar with, such as tackling hallucinations. Llms tend to sort of make stuff up, rather than saying that they don't know. But we also try to tackle a variety of slack specific needs. We want our llms to be able to integrate with the rest of the slack ecosystem. So ensuring that it's outputting sort of the correctly formatted user ids, the correctly formatted channel ids, or even message ids, allows us to sort of integrate the outputs of these summaries or responses into the rest of slack. And so the goal here is to sort of take these sort of like really specific and small problems and start to develop sort of a set of automated quality metrics that we can then run on each individual summary or question answer to generate either I composite or broken down quality score for a variety of our sort of generative based products.

Austin Bell [00:06:34]: And so, diving into kind of what this could look like from a hallucination standpoint, I think you'll be going into depth about this in a lot of the following presentations. But just to highlight this quickly, a lot of research has shown that being more specific around sort of your evaluation measures will lead to sort of better evaluation outcomes. And so essentially what that means is instead of focusing on capturing hallucinations as a whole on your generative outputs, how can we break this down into a variety of subcategories that are particularly relevant to our needs and particularly relevant to each product's needs? So highlighting a few of these could essentially be focusing on what we call sort of extrinsic hallucinations. So oftentimes we don't want the LLM to generate text outside of the provided context. We don't want it to go into its underlying knowledge base, and we leverage machine learning models to actually capture whether or not it's doing this, or is the LLM making providing incorrect citations. So in a question and answer system, you may be actually providing the right answer, but citing to the wrong message with the wrong reference, preventing a user from going down their own sort of rabbit hole. And so we tackle these questions for hallucinations through a variety of different approaches. One is very common, leveraging machine llms as sort of evaluators.

Austin Bell [00:08:00]: These big llms are kind of huge though, and maybe you don't have the compute or scale to be able to evaluate at the scale that you want, but there are a variety of sampling techniques that allow you to get a good score from these. We also use sort of natural language inference modeling to be able to drive these answer these questions at a much higher scale. And so once we can sort of evaluate and generate these sort of quality metrics for each individual summary response, the question then becomes, where, at what part in our sort of slack system do we actually want to run these evaluations? And so we have identified around three different areas where we actually run these sort of quality metrics. The first is what we call sort of golden sets, which is a small sample of slack messages that we can actually see the underlying data and resulting summaries. That is very, very small and allows for sort of very quick prototyping. The next is a much larger and more representative validation set, typically comprising around 100 to 500 samples, where we can no longer see the underlying data, but we can rely on our automated sort of quality metrics to understand whether or not we're actually driving continual improvement. And then the last is something that I think we're all familiar with, which is leveraging sort of a b testing and using quality metrics to understand whether or not the experiment is actually driving continued product improvement. And so what this does is it sort of gives rise to a vertical evaluation process that allows us to ensure that at each sort of stage gate we are driving continual product improvements, and that we're moving forward with sort of confidence.

Austin Bell [00:09:35]: And by having this sort of stage gate and prototyping capability, it allows us to fail fast and prototype a variety of different approaches very quickly and only pushing forward the ones that we actually think will drive value according to our quality metrics. And so what does this sort of look like in practice? For example, we recently introduced extractive summarization as a prior preprocessing technique for LLM summarization when the context sizes are way too large and leveraging sort of our integrated automated quality metrics, we were able to demonstrate sort of a significant reduction in quality while also showing significant improvements in the format capabilities of a lot of our llms and overall sort of improved user experience. And the goal is to sort of take this continual approach and continue it with every single sort of generative product improvement that we try to sort of implement. I guess I'll stop there with just a few sort of takeaways. That sort of valuation is very key to being able to ensure that you're driving incremental improvements, and that standardizing your sort of development cycle, ensuring that you can prototype and fail fast, will allow you to sort of continue building better and better products. Thank you.

Demetrios [00:11:10]: Ao, dude, you're preaching to the choir. That is so good to see, because as a lot of people on here know I've been pushing our evaluation survey that we're doing. Did you see this yet, Austin, I don't know if you saw this.

Austin Bell [00:11:28]: I did not.

Demetrios [00:11:28]: Can I show you?

Austin Bell [00:11:29]: I'll be very interested. Yes.

Demetrios [00:11:31]: All right, man. Well, here you go. We've got a QR code, and I will drop the link in the chat for anyone that wants to do it. Take five minutes. Fill out this survey on how you are evaluating your system, not just your models. Right. It's so much more than just, like, evaluating the output of the model. You got to think in systems, not in models.

Demetrios [00:11:54]: And that's pretty clear from what we've been seeing. There's some preliminary data that if I could, I will show you some of the stuff that people have been saying. I just have to share my screen and hopefully share the right screen, because, yeah, I tend to not do that and then later regret it. So here we go. I'm going to kick off your shared screen, and I'm going to do the fun part of trying to get my shared screen up. Let's see if this works real fast and I can share it.

Austin Bell [00:12:35]: I think, just to echo your point, this end to end system evaluation is becoming a huge priority over just model evaluation to understand whether or not your generative products are actually driving value.

Demetrios [00:12:49]: Yeah, dude, that's it. That is so true. It's just like, you can't take one model and be like, oh, wow, look at how well this did on the benchmarks. 100%, and then expect that to actually matter when it's plugged into your system. So I think I've got this up. Here's some of the preliminary data, and I'll go through it with you real fast. And then next up, we've got Philip, who is going to be coming on and talking to us about going from proprietary to open source models, and I don't think I can share because of some permission issue. And the funny part is, here's the irony of all of all you speakers.

Demetrios [00:13:40]: I make you sit through stuff to talk about how you got to get your permissions, right? And then I now cannot share my screen because of permissions issues. I don't think it is. Oh, what a bummer. So I'll figure that out, and I'll do it. Austin, sorry, I'm not going to be able to do it with you, but I think you get the idea. I'll work this out in the meantime. But I would love it if you filled out this evaluation survey, too. Austin, some of the cool things that I can tell you about are that we're asking people about their data sets that they're using, how you're coming up with these data sets, what kind of benchmarks you're using, what kind of metrics you're tracking, all of that.

Demetrios [00:14:24]: So it's very conducive to what you're talking about.

Austin Bell [00:14:31]: That's awesome. We'll definitely fill it out. And super interested in hearing the results. Such a new field that we're all learning and so excited to hear what everybody else is saying.

Demetrios [00:14:43]: Cool, man. Well, we're going to keep it rocking, and I'm going to kick you off the stage and keep on going. There are some questions that came through in the chat for you. So if you happen to meander over to the chat, we'll let you answer those in there. And for now, we're going to keep on rolling, folks. Close.

+ Read More

Sign in or Join the community

Watch More

Ensuring Accuracy and Quality in LLM-driven Products

Posted Apr 27, 2023 | Views 1.4K

# LLM

# LLM-driven Products

# Autoblocks

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB