MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Measuring Quality With Open-ended AI Output

Posted Mar 15, 2024 | Views 287
# AI Applications
# Open-ended Output
# Tome
Share
speakers
avatar
Sam Stone
Head of Product @ Tome

Sam is currently the head of product at Tome, a startup using AI to improve storytelling for work. Previously, he was a Senior Director of Product at Opendoor, where he oversaw AI, analytics, and operations products. Before that, he was a co-founder of Ansaro, a startup using data science to improve hiring decisions. Sam holds degrees in math and international relations from Stanford and an MBA from Harvard.

+ Read More
avatar
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
SUMMARY

AI applications supporting open-ended output, often multi-modal, are becoming increasingly popular, for work and personal purposes. This talk will focus on how developers of such apps can understand output quality from a user perspective, with an eye toward quality measures that feed directly into product improvements. We'll cover topics including user-generated success signals, "edit distance" and why it matters, modality attribution, and when to backtest - and when to skip it.

+ Read More
TRANSCRIPT

Measuring Quality With Open-ended AI Output

AI in Production

Slides: https://drive.google.com/file/d/123C2gbDpfCBFgiYipoG-dTUjcQOaEVd1/view?usp=drive_link

Adam Becker [00:00:04]: We have a very appropriate next speaker.

Adam Becker [00:00:08]: Let's see.

Adam Becker [00:00:08]: Sam, are you around? Hi there, Sam. Does it feel like, do you feel the thematic consistency?

Sam Stone [00:00:16]: I very much do.

Adam Becker [00:00:18]: Nice. Today we're going to talk about so what we've just spoken about. It feels to me like we've just spoken about almost like a bottom, bottom up kind of approach to evaluation. You can go in component by component and making sure that each of the legs of the rag or the application is well evaluated and is rigorously tested. But then there is also just a question of how that ends up impacting the user at the end and how we actually make sure that what the user is receiving is something meaningful which may or may not involve kind of like breaking everything down into different components. Is that what we're going to talk about today?

Sam Stone [00:00:54]: That's exactly right. It's going to be very user focused.

Adam Becker [00:00:58]: Okay, awesome. So the stage is yours. I'll be back in a little bit.

Sam Stone [00:01:04]: Okay, terrific. Thanks so much, Adam. Hi, everybody. I'm Sam. I am the head of product at Tome, and I'm excited to speak with you all today about measuring quality with open ended AI output. So quality can mean many different things and it tends to be pretty specific to the user and the application. Before generative AI, it was a lot more straightforward. We had measures like recall, precision, R squared.

Sam Stone [00:01:40]: But now that models are outputting complex things, text, images, videos, audio, combinations of these, defining quality is harder. There's a lot of things in the previous talk, there was a lot of focus on factuality. Some users will care a lot about relevance or creativity or coherence or consistency. I'm not actually going to talk about any of these things. I'm going to talk about one pattern that I think kind of is relevant to all software environments, and then two approaches that I think are useful for figuring out what are the right quality dimensions for your environment. I'll spend a minute talking about latency, then I'll talking about how once you already have a product that is launched, you can work backwards from user satisfaction to assess quality. And I'll end by talking about, before you launch either a product or a feature, how you can assess quality without live user feedback. Before we get into latency, I want to make sure everyone is familiar with Tome because I am going to use Tome to illustrate examples throughout my talk.

Sam Stone [00:02:58]: So Tome is a presentation tool. You're actually looking at a Tome right now. And what Tome does is allows people to tell stories for work, mainly presentations, by co creating and editing with AI. So most of our input looks like text from users. That will be generally of the format of make me a presentation about something, and the output will either be a presentation or a component of a presentation, like a page or a segment of a page. And that output will include text, images, drawings, charts, tables, and this concept of layout, how you put it all together and make it look beautiful. So it's a very multimodal and very complex and open ended type of application. So the first topic I want to touch on is latency.

Sam Stone [00:03:56]: I think this is a really kind of under indexed importance of generative AI quality, and I think what a lot of developers approach the problem with is this belief that if I'm going to save you as a user something like 10 hours, you must be willing to tolerate a ten second response time. These are order of magnitude estimates. It would be illogical to not wait that long to get that kind of savings. And I think the reality is, actually, this isn't how users really behave. I think there's a couple of reasons why something like a ten second response time could actually be make or break in a bad way for a generative AI application. First, users don't necessarily trust an application to save them a lot of time, and this is especially true if they're new users. And most generative AI applications today have mainly new users. Second, if an application takes multiple seconds or double digit seconds to respond, users are likely to start multitasking.

Sam Stone [00:05:06]: They're going to open up some other app, and if that happens, there's a good chance that you might just never get the user back. And finally, users know that they will have to iterate. They know that AI for real work probably isn't good enough on the first shot. And so they're expecting to use a workflow maybe a few times, maybe many times. And if you have an unbounded number of times that you have to go through a workflow to get it just right, and it's 10 seconds for each of those. That can be a very intimidating proposition. So let's get into quality and how to understand that from the point of view of user satisfaction. And I really like the framework of working backwards here, where we start with the thing that we want our application to produce, which is some indication of user satisfaction.

Sam Stone [00:06:03]: For Tome, as a presentation tool, the clearest indication of satisfaction is someone actually shares their presentation. We then ask the question, well, how can our app capture some digital indication of satisfaction? There's actually many ways to share a Tome. I could pull up my screen and show it to my colleague live and that's actually very hard to capture digitally. But then there's some digital paths that are pretty common. So we realize that people will send URLs and so we want to log viewers by IP address who are not the author of the presentation. We also realized that a lot of sharing is happening because people want to export their Tome to a format like PDF. So we created a function like that with good logging. And then the last step is to say, well, what manual work was required to achieve that satisfaction? So how many edits did it take for a user to go from the AI output to that point at which they were indicating satisfaction, in this case actually sharing their tone.

Sam Stone [00:07:12]: Now, just counting up the number of edits, that can give us a sense of like week over week, how well are we doing and maybe where are we relatively good. Where are we relatively weak. But I think it's really the type of human edits that is most insightful. And so at Tome as a presentation tool, we see there's different content types that our users are editing. Sometimes they're really focusing on editing the text, sometimes it's the images, sometimes it's the tables or the drawings. And that gives us an indication of where are they kind of voting with their actions and telling us they actually think our AI is relatively weak, and thus we most need to improve it. The last topic that I want to talk about is how you might assess or estimate quality before you've actually launched a feature. It's great to be able to get real user feedback, but most developers want to have some basis for saying that a new candidate model or a new feature is better than the status quo before they deploy it to their general user base.

Sam Stone [00:08:31]: So I think that the first thing to consider here is deterministic evaluations, kind of just old school software, things that are relatively easy and not dependent on AI to classify about output. So we've already talked about latency, you have functional errors, are you getting a 400 or a 500 error length? You can do some keyword based assessment. And then we get into topics and sentiment, which is starting to get more towards AI. But still you can use kind of open source packages that are not dependent on the newest foundation models. And I think that this kind of suite of deterministic eval approaches is valuable, mainly because you can run it quickly on large sample sizes. And if you get summary statistics that move dramatically between a status quo model and a new model that's a candidate for deployment, that's probably an indication that something is worth investigating, and maybe something is worth wrong. But the deterministic evals probably won't tell you that much about why the new candidate, model or feature is better. So that brings us to human evals, which are at the opposite end of the spectrum.

Sam Stone [00:09:49]: So human evals obviously take time. They're costly, even if you're doing them in house, and so it's harder to do them on high end samples, but they're much better conditional on evaluators who have some amount of standardization and training at assessing things like factuality, taste, structure and bias. At Tome, we lean really heavily on the human eval side, despite the cost, despite the fact that it's hard to do at really high end. And we've generally found that even just doing 20 to 100 human evals will often clearly illuminate whether a new feature or new model is better than the status quo. And I want to go into an actual example of what this looks like at home, because we're doing something that is pretty basic, but I think it's really powerful. And so we're doing this in a spreadsheet. We will have our candidate model, our status quo model will generate, call it 20 to 100 artifacts from each of those with the same inputs. We will randomize and blind those outputs, and then we'll get a couple of people, sometimes as few as five to ten, from our team, to go through those inputs using links in a spreadsheet and rate them on just a couple of categories.

Sam Stone [00:11:22]: So here we have text quality, image quality, and page layout quality that's obviously specific to our application environment, what our users care about. But I think this general framework and the simplicity and kind of speed and ease with which it can be deployed makes it pretty good for a wide range of use cases. It's not engineering dependent. You can have a product manager or even a designer run a process like this. And the vast majority of the time it's very clarifying very fast. The last approach to prelaunch quality assessment is kind of in the middle between deterministic and human evals. And so that's actually using LLMs to evaluate foundation model output. I think it's potentially the best of both worlds.

Sam Stone [00:12:16]: You can run it high end, you can iterate on it quickly, and you can potentially get nuance. But there is a caveat. If there is a kind of root issue with your generative model, it's very easy for that issue to affect the evaluator model. One way to mitigate this is use different models. Maybe you're using OpenAI for generation, use anthropic or llama for evaluation. But even this doesn't necessarily solve it, because a lot of those models, they've been trained on similar highly overlapping data sets, and maybe the root cause is the data set. So I think this is an interesting area. But what I found kind of in practice is that human evals have tended to be really fast and really effective at getting us where we want to be.

Sam Stone [00:13:13]: So, to recap, my advice for all of you who are working with kind of complex generative AI output, especially when it's multimodal, number one, keep your latency low. Number two, figure out that user satisfaction signal that is powerful and work backwards from that. How do you capture that in your application? And then how do you measure the progress users are making towards it? And then third, combine different evaluation approaches. Use humans, use deterministic software, and maybe if you're really adventurous, use LLMs themselves. So that's it for my presentation, and I'm happy to answer questions.

Adam Becker [00:13:57]: Sam, thank you very much. If we have any questions, I'll send you the link to the chat, because we got to go on to the next one. But it feels to me like there was just so much product thinking and product wisdom in everything that you spoke about. It's even just something like, you could save users 10 hours, but they won't spend the 10 seconds on it. This reminds me a little bit of that adage of for software developers, I'm willing to spend 7 hours debugging, but not ten minutes reading the documentation. It's the kind of thing that makes no sense until you see users actually continuing to do this exact same thing. And I felt transfixed by that table. I know we ran out of time, but I could have just stared at that.

Adam Becker [00:14:45]: Both elegant, simple, but straight to the point. So, Sam, thank you very much for sharing it with us.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Driving ML Data Quality with Data Contracts
Posted Nov 29, 2022 | Views 2.4K
# ML Data
# Data Contracts
# GoCardless
Operationalize Open Source Models with SAS Open Model Manager
Posted Oct 21, 2020 | Views 845
# Presentation
# Model Serving
# Sas.com
What is AI Quality?
Posted May 03, 2024 | Views 462
# AI Quality
# Quality Standards
# Kolena
# Kolena.io