Sign in or Join the community to continue

Evaluating Evaluations

Posted Aug 15, 2024 | Views 296

# Evaluations

# LLMs

# AI

# Notion

Share

speaker

Linus Lee

Research Engineer @ Notion

Linus is a Research Engineer at Notion prototyping new interfaces for augmenting our collaborative work and creativity with AI. He has spent the last few years experimenting with AI-augmented tools for thinking, like a canvas for exploring the latent space of neural networks and writing tools where ideas connect themselves. Before Notion, Linus spent a year as an independent researcher, during which he was Betaworks's first Researcher in Residence.

+ Read More

SUMMARY

Since our first LLM product a year and a half ago, Notion's AI team has learned a lot about evaluating LLM-based systems through the full life cycle of a feature, from ideation and prototyping to production and iteration, from single-shot text completion models to agents. In this talk, rather than focus on the nitty-gritty details of specific evaluation metrics or tools, I'll share the biggest, most transferable lessons we learned about evaluating frontier AI products, and the role eval plays in Notion's AI team, on our journey to serving tens of billions of tokens weekly today.

+ Read More

TRANSCRIPT

Linus Lee [00:00:10]: Excellent.

AIQCON Male Host [00:00:10]: So I really enjoyed Linus's bio. So he said he spent a few years thinking about the creative tools using AI for writing. And obviously a lot of folks using notion. Who uses notion? A lot of folks. Right? Our volunteer document was a notion. So evolid novelizations. Thank you, Linus. Take it away.

Linus Lee [00:00:29]: Cool. Excited to talk to all of you about evaluations. Before I get started, if I press this, will it go forward? Let me do two things. One is I'll start my timer. There we go. So with that, I want to talk about evaluations. My name's Linus. I work on AI at a company called Notion.

Linus Lee [00:00:52]: We build a collaborative, connected workspace for your knowledge work. Before that, I spent some time doing independent research in model interpretability interfaces and how to build reading and writing tools with AI. At notion, over the last year and a half, we built all different kinds of AI products. We started out with something we call AI writer, which lets you call Chacha Pt style conversational writing assistance inside your doc. We have something like that for notion databases, which lets you do that with structured data. We more recently launched a thing called Q and A, which gives you power to ask questions that retrieve over information in your notion workspace. If you haven't tried it, you should really try it. And then lately.

Linus Lee [00:01:30]: Lastly, a couple weeks ago, we launched a thing called AI Connectors, which lets you use notion AI to ask questions about other data sources you have on your team, like Slack and drive. And over the course of building these tools, we've built up the muscle to do evaluations at all different stages of the pipeline, from a model that came out of research all the way down to production. And I'm not going to talk about each of these in detail detail, although you can ask questions about it at the end. If you're curious, what I do want to talk about is, in the process of ending up with this collection of tests, I want to share with you some things about what have we learned about doing evaluations as a team? That's building an AI product that I think might be transferable to what you want to do inside your teams. So I think if you ask what are evaluations for inside a product team, I think we have to start with research and then sort of walk backwards to three perspectives that I think are effective ways of thinking about evaluation. The first is evals as a coordination mechanism between teams. Second, evaluations as a way to communicate within a team about what's important. And then the third within this AI community for thinking of evals as a reflection of our values.

Linus Lee [00:02:38]: So let's talk about evals as a coordination mechanism. When academia first started working on deep learning models, they had this problem that there were a lot of papers and people doing research on how to build better models. But they couldn't compare research project a with paper b because there were a lot of things that were different. There was a different model architecture on a different data set, evaluated on a different metric, and they needed a way to coordinate, solve this coordination problem, to say, if we want to evaluate an architecture, hold everything else constant, and then just vary this one thing so that we can really tell whose architecture is better. And so evaluates are solving a coordination problem. And if you look at academic datasets or academic evaluations like imagenet for computer vision for SST and SNLI for natural language understanding, or more recently, things like MMLU, all of these, the point of all of these datasets is not necessarily to generally pick out the best model. It's more to say, here's a specific task that we care about on a domain of information that we care about, and let's use it to coordinate our research work so that we can all agree on the, the standards that we're measuring or work by and what's important and what's not in a particular research project. So the point of academic coordination, academic evaluations, is to define and work around a clear, well scoped research goal.

Linus Lee [00:03:51]: Of course, these are human labeled data sets, and there's problems and flaws with that, and more recently, looking at more generative models that are more capable. There are also statistics about statistical evaluation methods that people have been using, things like perplexity and FID and other kinds of metrics like OMCDem. These, instead of using human label datasets, they reference the model's output against some real world distribution, distribution of text or images, or distribution of embeddings or other kinds of data points. But all of these academic benchmarks, they all have one thing in common, which is that you assume that the evaluations are the thing that you want to optimize towards. You assume the evaluation is truth. Of course, these are all about evaluating model quality. There's also lots of academic benchmarks for evaluating other things. Instead of holding everything else constant and evaluating the model, we can hold other things.

Linus Lee [00:04:41]: You can hold the model constant and evaluate whether a dataset is better or worse for a particular task, or we can evaluate whether a piece of infrastructure or software framework is better or worse. Using benchmarks like MLPeRF, you can use evals to tell which algorithm is better, which optimizer is better, or which reinforcement learning algorithms are better. And in all of these cases, within academic benchmarks, uh, academic benchmarks exist under this implicit social contract that evals are your source of truth. If you actually look inside a lot of these data sets, there are missed labels and there are kind of weird, very hard to tell edge cases. But that doesn't matter, because in academia, evals are a way to coordinate work between groups, and we're not necessarily trying to build the best model for a particular real world use case. When we're building AI products like we are at notion, there's a different way to think about evals that I think it's a bit more effective, and that's to use evals to communicate within a team about what's important for our users in product teams. Evals are how we communicate, we being ML researchers, engineers, product engineers, product designers, how we all communicate about the goals and requirements for a particular user and a particular use case of an AI product. So let's take an example.

Linus Lee [00:05:52]: I spent a while at notion, looking at summarization as a problem we wanted to solve for people, and the naive way to go about this. Maybe you say, hey, we need to summarize a bunch of notion pages. And initially I might have said, okay, no problem. There are a bunch of models that are trained for the task of summarizing documents. Let's take one of these models, and then a user or product manager might say, it's missing some detail in the meetings. It's also adding unnecessary detail. The outputs feel a little bit too journalistic. And then we're like, okay, that's fine, because the model's probably doing that because it's trained on news datasets and it's trained on English, and we can fine tune it on some of this other data to augment it.

Linus Lee [00:06:27]: And then the user might say, okay, this is better. But it didn't really work for my meeting docs, it only worked for my web bookmarks. And I can go back and forth and there's this communication problem between what the users want and what the research models that we started using. We start out using what the products have been trained on. So there's this gap between more academic benchmarks that our general purpose that we see these models being published with, and the specific use cases and tasks that we care about as a team building a product. So if you're looking at a summarization model, you might see the models published with benchmark scores and data sets like CNN and Daily Mail, which is about news and has a very specific style of summarization that news articles prefer. There are data sets like archive and PubMed, which are for evaluating summarization performance on longer documents, but they also obviously are a very specific kind of writing and a notion. We really care about a very different kind of document, which is we care about summarizing meetings, where the first five minutes of a meeting may not matter at all because people are catching up about what's happening in their lives.

Linus Lee [00:07:31]: But who said what really matters? Numbers really matter, decisions really matter. If we are summarizing notifications, then it might be okay to drop some unimportant information. And brevity is what's really important. If we're summarizing status reports, then numbers are really important. Progress updates are really important, but maybe the overall structure is less important. So there's a gap between what the users want, what, you know, with our understanding of what the users want, what we want in a product, and what academic evaluations say. And as people are building products, I think we really need to recognize this gap and then work towards how can we evaluate, how can we use evaluations and build evaluation data sets in our teams that speak more to what the users want and less to lessen the jargon of academia? So with this insight, the process might look more like, okay, we start out with we need to summarize notion pages, and instead of jumping straight to a model, we could instead say, no problem. What kind of notion pages are we summarizing? What makes a good summary? Do people want bullet points? Do people want all the detail? What's the ideal length of a summary? What languages do we want to support? And then we could have a productive conversation about what's the specific shape of the task that we want our model to perform, which may be different for different use cases, different kinds of products.

Linus Lee [00:08:36]: It's not just summarization, it's summarizing meeting notes, or it's summarizing notifications. And based on that, we can build an evaluation data set that accurately reflects those expectations and find score metrics that lets us preserve our insight about the problem that we want to solve for our users. So in effect, this is the rough kind of process that I found the notion AI team follow when we're trying to go from an idea to a deployed AI product around evaluations. First we start out by defining a use case. So for summarization, maybe we want to summarize meeting notes. And then based on our initial intuition about what people want to do with their products. We can gather some initial inputs, just engineers sit around a table. Let's generate a bunch of examples, 10, 20, 50 examples, about what kinds of documents we think people will be summarizing, or what kinds of questions we think people will be asking about our documents.

Linus Lee [00:09:25]: And based on that initial pool, which is probably wrong, but good place to start, we launch a version of our product internally. So notion runs a lot of our operations on notions. We have a pool of 600 700 people we can initially test with. And then we have this iteration loop where we deploy a kind of early version of the product internally, see how people use it. We have various methods of collecting failures that happen with a product, and those failures critically inform our understanding of what is the task that we're trying to do. So, for example, if we're trying to summarize things, we might realize there are lots of kinds of documents that people want to summarize that we didn't initially expect, or maybe we expected a different kind of language distribution than we were testing with. So in this loop, we iterate on collecting data, iterate on the system, iterate on architecture and prompts, until we feel like we have solved our current version of our understanding of the task, deploy it internally, collect failures, and then repeat this loop until at some point we look at our evaluations and say, one, our evaluations accurately reflect what the user wants the model to do, and two, we're scoring well on our evaluations, comfortable, making us comfortable with releasing the product into the wild. And in this loop, I want to talk more specifically about what I think is where we've kind of gotten the most wins in this process, which is how we collect failures.

Linus Lee [00:10:45]: When we deploy a version of one of our AI models internally or to an alpha group, there are a few different places where we look for it to collect failures and learn more about the problem, the task that we want our models to do. We have continuous evaluations and logs, which I think is probably the highest leverage thing that we've built in house. We have evaluations and logs, deterministic ones, that run on every single inference in production and language, model based ones, LLMs as a judge, evaluations that run on some small sample of all inferences that we serve. And these tell us if there are any obvious failures that happened, if the error rates go up or down. These are kind of the earliest warning signs of, ok, something might be going wrong and you might want to take a look. We also obviously lean a lot on user interviews and just understanding our customers, or even understanding our internal users outside of the AI team, but in the rest of the company about what are you trying to use the AI model for? What are you trying to use notion AI for? That it's not doing correctly? Or what did you not like about the summary? Or what did you not like about this kind of question? Or what are some things you want to be able to do but the model doesn't currently seem to support this give us higher fidelity kind of understanding of what the task looks like. We also do look at thumbs up, thumbs down, feedback from production like everybody else, but these are relatively low in number even at our scale. And so these are better for discovering kind of really edge casey tail of the distribution use cases and problems rather than just the first source we run to for understanding a model.

Linus Lee [00:12:08]: And then lastly we do some in house adversarial testing, by which we mean we have a bunch of people try to break the model and get it to do weird things that we didn't expect it to do. And all of this gives us ways of making the model fail or failure cases that we can discover to then go back to the first step and say, ok, we know these are the cases we want to support, we're not good at them yet. Let's iterate on the problems, or let's sit around on the pipeline to make them better. In building this pipeline, there are a couple of insights that I think we've gathered as a team. The first is, especially if the product that you're building is the kind of thing that a lot of the rest of your team would use, it really pays off to deploy something that might not be quite up to par orally, internally, or to a small group of users and use them early enough and to refine your understanding of the task. The point of an evaluation really is to make sure that whatever you're measuring about your model is an accurate and detailed and full representation of the task you want the model to perform. So the better you understand the task that you want the model to do, the better it is for your product. Second, we started to really view defining the evaluation data set and defining the behaviors you want to eval for as a part of the spec of our product.

Linus Lee [00:13:18]: When you build a traditional software product, you might say, here are all the things that can go wrong, and here are all the error screens, and here's the ways that we recover from those. When you're defining a behavior for a language model, I think there's an equivalent way to say, here are all of the edge cases that we might encounter with like weirdly long documents or weirdly formatted documents, or weird kinds of questions that we don't know how to answer, here are the ways that we want our model to behave in those situations. And so defining, defining all of those edge case behaviors is a part of how we view specifying what a product should feel like to use. And then thirdly, more on the engineering side, we found it very, very valuable to have a logging system that can let us build a comprehensive logging system that can take any single failure in the wild and fully replay, fully reproduce the error, fully reproduce the failure mode. And so within our team, we have a logging system that lets us do something like if you come up to me and say, hey, I tried to use notion AI and I asked it a question about holidays and the error was kind of wrong in this specific way. I can go back to our logs, find all of the specific inferences that we sent to one of our model providers and fully reconstruct the entire pipeline, play it back, add that to our evaluation data set and say okay, this is a case that now we're going to be able to use as part of our evals to make sure we never regress on this again. And this logging system with deep level of observability and debugging from production failures, I think has been really instrumental in working on our AI quality. So in product, unlike in academia, evals are not your source of truth.

Linus Lee [00:14:44]: Instead, your users are your source of truth, and you should talk to them all the time and try to understand what they want and how your task looks like in the real world. And the people that work with your users, and the people that work with your ML systems use evals as a way to communicate about what's important, what behaviors you should expect, and what the edge cases that matter are. And I think we're starting to see some of this get picked up in academia as well. So if you look at some benchmarks like the sweebench, which measures agent performance on coding tasks, or the ELO ratings on Chatbot arena, these are much closer to product experiences that people want to put in their model, their AI products, and less about iterating on a set specific benchmark data set. And I think we're starting to see momentum pick up around these more product oriented evaluations, even in academic contexts. Which leads us to a broader view, which is, I think as a community, we're not just using evals to say things about our product and which models edge each other out on specific tasks, but as our models become more generally capable and more of a community and an industry effort, I think we're using evals also as a reflection of our values, what we value about intelligence and about ourselves. And so I'm stealing a page from one of my favorite talks of all time, which is Brian Cantrell's platform as a reflection of values, where he outlines how programming languages and operating systems and platforms are a way for their creators to reflect their values of what they think is important in their software systems. And I think in a similar way, when we build AI evaluations as a community, we're using evals to communicate to each other something about what we value in AI systems, whether it's knowledge with benchmarks like MMLU, or we value problem solving with programming benchmarks, or problems like math.

Linus Lee [00:16:33]: Maybe we value multimodality because we have evals about multimodal question answering. Maybe we evaluate personality, and that's something that people want to evaluate. Maybe we value generalization beyond what the model has learned, which is where benchmarks like arc AGI come from. But in all of these cases, we're using an evaluation as a community to say, here's a thing that we value about intelligence, or about creativity, or about whatever we want the model to do. And if the model performs well in these evaluations, that means that the model is intelligent by some definition that we have. And perhaps we're not just saying this about our models, but maybe this is a way of saying the same thing about ourselves. Increasingly, the models are becoming more capable, more likely to take over or augment a lot of our jobs in our workplace. And so when we say, these are the evaluations that matter, I think that's not just a way for us to say these are the evaluations that matter for our product, but also a way to say in the jobs that we care about, in the tasks that we care about, this is what really makes uneffective productive collaborator or an effective productive team, things like multimodality and knowledge and creativity that we've started to measure.

Linus Lee [00:17:40]: And so as a community, I think evaluations are expressions of values in addition to just a communication method. And these are really the words and the standards that we'll use to will our future AI systems into existence. And if you are not evaluating something that probably means either you don't value it, or if you do value it, maybe you should write an evaluation for it to make sure we're continuing to pay attention to it and improve it. So with that, thank you very much.

+ Read More

Sign in or Join the community

Watch More

Evaluating LLM-based Applications

Posted Jun 20, 2023 | Views 2.4K

# LLM in Production

# LLM-based Applications

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io

Evaluating Language Models

Posted Mar 06, 2024 | Views 1.4K

# Evaluation

# LLMs

# LTK

Evaluating LLMs for AI Risk

Posted Oct 31, 2023 | Views 931

# LLMs Evaluation

# AI Risk

# Robust Intelligence