Hunting for Quality Signals: Is My AI Product Succeeding?
After being burned one too many times by unexpected model performance in mission-critical production scenarios, Gordon co-founded Kolena to fix fundamental problems with ML testing practices across the industry. Kolena is doing this by building an ML testing and evaluation platform that tells you what you need to know before your model hits the real world.
Prior to Kolena, Gordon designed, implemented, and deployed computer vision products for defense and security as Head of Product at Synapse (acq. by Palantir) and at Palantir.
Kolena CPO and CO-Founder, Gordon Hart, will discuss AI Quality Standards, how to establish gold standards, and how to create a framework to evaluate AI systems for performance rigorously.
Gordon Hart [00:00:10]: All right, hello everybody, and thank you for coming here. My talk hunting for quality signals. We'll see if we can get it up on these big screens. It's going to be a little bit of a pace shift moving away from the panels and towards the presentations, which we'll have all afternoon. And in this talk, I'm going to go through a lightweight way that you can identify failure modes in genai generations and measure data quality with lightweight classifiers. So a little bit about me I'm the co founder and CPO of Calenda. We're a company that's obviously focused on AI quality, and in the past three years since founding Calena, I've been thinking exclusively about AI and ML quality and what that means and how you implement it across different domains and industries, different data modalities and different task types. There's a picture of me in 2018 installing an automated threat detection system in an international airport, and I can tell you this is a product that could have used a lot more emphasis on quality, and this is quality everywhere along the process.
Gordon Hart [00:01:09]: A lot of things have changed since then, but a lot of things haven't. And one of those is that you need to focus on everything from the base layer at data quality all the way up to monitoring quality in production. We're going to learn a little bit about how to do that here. So how can I tell if a generative AI product is succeeding in the wild? Let's look at this use case of an image generation service where a user might come to us with a query like this. In a hut with a fire, an old gray haired man talks with his son, and our system might respond with a generation like this, which I would say is a pretty good embodiment of that prompt. But we know that it doesn't always go that way. Sometimes you get a generation like this, which not only is it a cartoon instead of a realistic photo, but it also has a watermark of a stock photo burned into it, which is clearly not something that the user wanted. This is a failure mode that we want to be able to identify, mitigate, and ideally remove from our product entirely.
Gordon Hart [00:02:02]: Another use case here would be a personal finance assistant, so we might have a chat style interface where a user comes to us with questions like this, which stock should I buy? And they could be excused for thinking that this is a question that we could answer if our system responds with specific stock picks by ABC, Def and XYZ. This is another failure mode, because we could be liable for providing bad financial advice. What we want to do is be able to identify these failure modes such that we can stop them before we show them to our users. So the quality signal approach is a way to train lightweight classifiers to identify these sorts of failure modes and mitigate them. In this particular scenario, this image generation service, we could use quality signals to define all of these different failure modes, such as accidentally rendering a stock photo, rendering AI hands, or messed up physiology. And we could use these signals to flag, hey, this is probably a bad generation. We probably want to either regenerate it or bail out and apologize to the user because we couldn't comply with their request. Similarly, for this chat setup, we can get quality signals that are looking for all different sorts of failure modes, both adversarial usage on behalf of the user, or bad generations from our model.
Gordon Hart [00:03:14]: And with this we could flag, hey, this is probably financial advice that we don't want to be providing, and we could similarly bail out. So AI quality throughout the process requires understanding the data that is flowing through your system. This is an interesting thing with ChenAI, because not only is our AI system now consuming unstructured data, but it's also producing it, and we want to focus on that data that's being produced the same way that we would the data that we're consuming. So in this talk, we're going to introduce this concept of quality signals. We're going to run through how to create them efficiently, and we're going to learn ways that we can leverage these signals for quality throughout the development process. The TLDR here is that quality signals are lightweight classifiers that you can use throughout your process to identify failure modes of genai, or add meaningful metadata to data that you're either collecting or generating. Signals are defined interactively in minutes and trained in milliseconds, and we're going to go through what that process looks like in this talk, and then you can use signals upstream, downstream, in deployment, in development, anywhere along the way to ensure that you're building a quality system. So let's focus on this case study of identifying the stock photo watermark failure mode in generated images.
Gordon Hart [00:04:28]: We're going to look for this at the pick a pick v two data set where we've got 84k user prompts like tiny cute kitten Pixar style, and the corresponding generated image like this one. And quality signals are needed in an application like this because there's a huge set of possible failure modes. We can generate an image of the wrong style, we can generate garbled physiology and limbs coming out of nowhere. We can accidentally watermark a photo, and having a way to measure these different failure modes is critical. Measuring that quality can be challenging. Recurring theme throughout today. I think it will be that good evaluation metrics beyond general benchmarks MMLU are kind of challenging to get for lots of generative applications. And human evaluation, whether at the small scale is kind of informal and at the large scale is very slow and very expensive.
Gordon Hart [00:05:18]: Relying on human evaluation from your users in production, you know, thumbs up or thumbs down signal can be pretty unreliable. Users might not always provide the right signal to us, and furthermore, this deferral of quality to our users in production can be a bad thing for our application. Maybe we don't have the ability or the willingness to rely on our users to tell us if our product is succeeding or nothing. And then in generative AI applications, latency is often a massive concern as well. The model itself can be slow enough. It might take many seconds to produce its results, and we typically won't have the budget to introduce really expensive quality measurements that we can use for things like guardrails when we're actually running our product in production. So how can we detect failures like this automatically? Ideally, we have a function like this where we push in a new image and we say, hey, the probability that this image is watermarked has this failure mode is 0.99. We can use that to flag it and then take action.
Gordon Hart [00:06:14]: So we're going to look at a couple of different techniques and plot them out on this accuracy versus effort scale, where the first technique would be something like training a custom classifier, collecting a bunch of data, treating this like a normal ML problem, going through the traditional mlops process, and expending a lot of effort to get a very accurate classifier. The next way that we could approach this is with something like zero shot classification. You can use, for example, multimodal embeddings extracted from a clip style model to search for things like stock photo watermark, and get, for very low effort, a classifier that kind of works, but is pretty low accuracy. And I want to show really quickly what that might look like. So we're here on Colena and I can go to natural language search and type something like stock photo watermark. And this will search through this 84k image dataset for images that are similar to this stock photo watermark search vector. And you see, we've got, some of these images are great. The top images do have this failure mode indefinitely, and we've retrieved those well.
Gordon Hart [00:07:16]: But also the accuracy is not great. On this first page of search results we have some images like these which are not stock photo watermarks. They're not this failure mode. So this is a very low effort, low accuracy way for us to attack this problem. So what we're going to talk about in this talk is a few shot learning technique, the quality signals where for pretty much the same effort as zero shot classification, you can get pretty much the same accuracy for something that is a properly trained and collected data process. I should also mention here that similar to this purple dot would be running these images through a large multimodal model, for example, and asking, hey, does this have a stock photo watermark? That'll be quite accurate and quite, quite low effort for you to put together, but the latency will be off the charts. If you have a third access to this chart, the latency would be all the way over here, whereas quality signals are very fast and able to run online or in batch. So the way that we're going to train this quality signal starts with extracting embeddings.
Gordon Hart [00:08:13]: Embeddings, or often vectors, are compact representations of real world concepts that we've extracted from our inputs. In this example here, you might see this photo of a burger has a watermark over it and this little watermark banner. And so its embedding should represent these things like burger, photorealistic, beer, and also that watermark concept. There's lots of ways to get good embeddings. In this case, we're going to use an open source Siglip model, which is a clip derivative with a different loss function. So you can use open source pre trained models or proprietary APIs. You can use self supervised learning techniques to train an embedding extractor on your own data without applying any additional labeling. Or you can take your current models and take one of their internal representations and use that as embeddings that are probably pretty well modeled for your space.
Gordon Hart [00:09:00]: Embeddings tend to capture really subtle concepts and features, but then we often use them in pretty unsubtle ways. This vector similarity comparison is one way that embeddings are commonly used. This is what we just saw with that search feature. And this can get you kind of a rough idea, but it's very hard to come up with the exact correct vector to use to search. And therefore, vector similarity comparison can often be pretty crude. And then dimensionality reduction techniques to take this, say 256 dimension embedding and flatten it to two or three dimensions so that we can visualize it spatially necessarily eliminates a lot of that subtlety, because we can't visualize and explore high dimensional spaces using these embeddings instead as inputs to a classifier is much more effective way to leverage them for tasks that require subtlety. So step two here when we have embeddings, is to go through our data set and select some positive and negative examples. We want to define exactly what we're looking for with this quality signal, which requires us selecting some positive examples that have this feature and negative examples that don't have this feature.
Gordon Hart [00:10:03]: If we can select similar images that only differ in the presence of the feature or not, that's even better. Like in this example, we could select this burger on top as a positive example and this burger on the bottom as a negative example. Where we're learning here, we don't mean burgers with this quality signal. We don't mean a nice sunny day. What we're actually looking for is this watermark that is present in this top embedding, but not present in this bottom embedding. So finally, once we've labeled a handful of these images with positive and negative, we can train a small classifier here. I'm talking tiny with few shot learning, we want to learn the boundary and embedding space that is separating these positive examples from these negative examples. The keyword here really is small.
Gordon Hart [00:10:44]: I'm not talking deep learning here. This can be a single hidden layer multilayer perceptron model with order of hundreds of parameters. And this is why we can train it with so few training examples. The requirements here really are tiny. You just need a handful of examples for data. The training process takes milliseconds on commodity cpu. So I hesitate even to use the term training because training comes with connotations of lots of time, lots of expensive data, lots of expensive hardware, and staring at those loss functions for quite some time. So let's go through this process and see what it looks like with a handful of examples.
Gordon Hart [00:11:18]: We can start with one positive example, start with one negative example, which is an image I selected at random, and it just doesn't have this watermark on it. As we go through and select, we can see we've selected one positive and one negative, and we can train the classifier. And in this case, we start with a very low f one score. And we'll see. As we continue to label this f, one score will go up and up. An f one score of one means that we're perfect at this task. As we continue to select these positive and negative examples, we see our performance improves, and by the time we've selected just five positive and five negative examples. We already have a classifier with an f one score of 50.
Gordon Hart [00:11:56]: This is pretty good. You can use that for pretty real world tasks that maybe don't require super high metrics. But as we continue to select images, we're narrowing down further and further on the specific watermark concept that we're trying to isolate here. By the time we've selected just eight positive examples and eight negative examples, we've trained a classifier here with very good performance. This has 98% precision, almost 90% f one score. It has a negligible false positive rate with just three false positives in the entire 84,000 image data set. And it has an accuracy that is almost perfect. So this classifier now is quite good.
Gordon Hart [00:12:34]: It's got quite high precision. We can use this with pretty high confidence for this task of identifying this failure mode. So let's put it to use. We have this signal that can generate the probability that an embedding contains a watermark. And we can use this throughout the AI development lifecycle, whether that's in the data analysis and preparation phase, whether it's in the model development phase or the model deployment deployment phase. We can use this to analyze past production data to learn if this failure mode is new or if it's been in our product for a long time. We can use this in the case that we're training our own model, we can use this to clean our training data set by removing examples that have watermarks in it. That's the easiest way to get rid of that behavior.
Gordon Hart [00:13:14]: We can use this during the development process as an automated offline evaluation metric where any new model we can now say, hey, does this generate stock photos? And if this metric says is higher than zero, then it might have this failure mode. You can run this live as a guardrail to redact any generation that's flagged as having this failure. You can create monitoring dashboards that show you how much you're failing through time, and you can create alerts that are alerting you when failures happen too frequently. Maybe you've rolled out a new model that you didn't realize had this failure. Maybe your users have started prompting your system in a different way, and now you're seeing this failure a lot. It's good to know when that's happening in production. And then for non generative examples, you can define quality signals that look for features that say, characterize an edge case of your model. If you're doing an object detection system, you can look for a specific object or a specific configuration and define a signal for that.
Gordon Hart [00:14:07]: And you can then use that signal to mine through your unlabeled data to find similar examples of that edge case and use those for training and testing. So these quality signals can be used throughout the pipeline for more than just this genai failure mode. Identification we can take anything that we can extract embeddings from here I've got images and chats, but of course, anything that you can extract embeddings from could be an audio file, could be a video, could be a longer document. You feed this into an embeddings extractor, regardless of what type of extractor that is, and then you can define a ton of quality signals on top of this extracted embedding that give you a very full bill of metadata that you can use to really understand automatically what you have inside that data. The process of extracting the embedding can be expensive. You only run that once, though, and then the process of running inference on your quality signals is very, very cheap. It's a negligible marginal cost. So you can define as many of these as you want to look for as many different things as you want, and transform this unstructured data, this generated image maybe into a bunch of different features that you can then use to understand and analyze your data set.
Gordon Hart [00:15:11]: So, in summary, quality signals are few shot classifiers that we can use to add application specific metadata by processing pre extracted embeddings. These are data efficient and compute efficient ways to better leverage these embeddings that we may already have. If we have a rag system, we probably already have embeddings extracted for all of the data that we want to look at. We can use this to transform unstructured data into structured data that is much, much easier for us to analyze with tooling. And this approach is applicable to both generative use cases where we might want to run this for both model inputs and model outputs, as well as non generative use cases where we can run this on any of these unstructured model inputs. This works on any data modality that we can extract embeddings for. So the magic here really isn't in this approach, but it's making this approach easy and accessible. As AI engineers, we probably know that you can do pretty much anything with stuff that's off the shelf these days, but we definitely don't have the bandwidth to do everything.
Gordon Hart [00:16:08]: So it's not doing the thing once, but it's having the right tooling and processes to do this repeatedly and efficiently. That'll get us to quality. If we have these tools, as engineers, we can go back to focusing on our true objectives, on improving our models, on making our generated images better and not on side quests like identifying these specific failure modes in our data. So that's where Kalena comes into play. We provide a platform that gives you tooling for all of these different things, exploring your data, adding additional metadata, running this quality signal process, evaluating models and. Yeah, thanks, everybody, for listening.
AIQCON Male Host [00:16:49]: Thank you, Gordon. We have still some time to do some questions. I just want to take a few minutes to see if anyone in the audience has any questions for this wealth of knowledge right here, Gordon, about AI quality model evaluation.
Gordon Hart [00:17:01]: Oh, thank you for volunteering. Hey, do you host these like, tiny models? We store them so you can then go download them back, but you can run them locally in python and that's typically how you'd use it.
AIQCON Male Host [00:17:17]: All right. Hey, I know you.
Q1 [00:17:19]: I know you do. Hey, Gordon, one question I had was for, do you build a single classifier for detecting all the quality signals, or do you just, do you have separate small models to detect each one of them?
Gordon Hart [00:17:36]: Yeah. So in that case, you would go through that process of selecting a handful of positive and a handful of negative examples for each of those different signals. The process takes a couple of minutes and then you have that trained classifier. And so for any new feature that you want to look for, you just go through and select a couple positives, select a couple negative, and then you have that signal that you can use to build out this full bill of metadata.
AIQCON Male Host [00:17:59]: Great question. I'll get to you. And then I see your hand up over there.
Gordon Hart [00:18:02]: Thank you.
Q2 [00:18:05]: So imagine some people are wondering how you go from randomly initialized weights and biases to a meaningful classifier, even if it's tiny. With eight examples.
Gordon Hart [00:18:17]: I can show you the notebook that generated these numbers if you want. Really, what we're looking at here is just we're talking 100 parameters to learn, and so it can be learned pretty efficiently. I don't know what to say beyond that. Yeah, in this particular case, those numbers are shared with a single hidden layer of 100 parameters that can be learned with a handful of examples.
Q3 [00:18:42]: Hi, Gordon. Thank you for the wonderful talk. I meant to ask, let's say you've recognized all the failure modes. I'm curious, how would one go about automating the process of recognizing failure modes other than just getting human feedback, let's say, or user feedback?
Gordon Hart [00:19:02]: That's a great question, and I think that's going to be one of the big and enduring questions of genai applications. One way that I would do it is automated edge case identification in that embedding space. If you find things that are off on the side spatially and that have a lot of different features maybe, or are separated from the main body of your data, that's a good way for you to identify that. And it's not like everything that's far away from the rest of your data when you're spatially, exploring your embeddings is necessarily going to be a failure mode. But if you can quickly flip through the top ten clusters of edge cases that you have, you should maybe see. Oh wait, hey, this one's a failure mode. I might want to be able to track that, and I probably want to remove that in the next version of my model. So asking your users is a great way to get signal, and there's really nothing you can do but look at your data the most efficient way possible.
AIQCON Male Host [00:19:55]: Awesome. We have time for a few more questions. I want to get some people in the back. Any questions back here? Good. I see a hand.
Q4 [00:20:06]: Hey, a lot of us who have not trained models before are entering the space. So what would be your suggestion on, like, how do we learn about training different small models? What, what is the starting point? What are some type of architectures or models we should look at to just test it out, basically show the notebook and then give some references so we.
Gordon Hart [00:20:23]: Can read it for something as simple as this. Scikit learn has a lot of great different classification strategies. We're looking at multilayer perceptron, which is very simple, kind of old fashioned machine learning architecture. But you can also do something like regular logistic regression. Works pretty well for a task like this because you're just trying to separate one region from another region in this embedding space. So I would look at some tutorials that are going through different ways that you can train binary classifiers, small binary classifiers. There's lots of different models and approaches out there.
AIQCON Male Host [00:20:57]: Great question.
Q2 [00:21:02]: Hi, thank you for the talk. My question is, this seems to be simple enough to be used by non technical folks like PM's QA. So why do we need AI engineers to do this, like, to identify failure modes and do all of this work? Can it be done by non technical folks?
Gordon Hart [00:21:21]: That's a very great question and definitely non technical folks. The devil is really in the UX details here. And if you make it easy enough for non technical people to say, hey, I don't like this thing that I see in this image, and you give them an interface they can actually use to explore data. Selecting positive and selecting negative examples is very, very easy and can take minutes. And then the training can be entirely abstracted away. So this is a very kind of numbers focused, machine learning, engineer focused thing. But this approach works really well for non technical people, QA analysts, product managers, people who do get their hands dirty with the data, but don't necessarily want to be programming or training models.
Q1 [00:22:01]: Thank you, Gordon. That was wonderful. I was just. It's a follow up to that previous question. Is it possible to generalize this so that, you know, anybody can use it for failure detection in any products and also to, you know, maybe data quality platform is it?
Gordon Hart [00:22:24]: I would love to continue talking with you about this afterwards. I'm going to try to give a quick answer, but that's a general question that can be hard to answer succinctly. I do think that having an I interface to actually look at your production data or your training data or data that you're collecting, that people who are not your MLN engineers are using regularly, is very, very important for quality processes. You have a lot of people, whether they're SME's subject matter experts, or data labelers, or quality assurance people or product managers, you have a lot of people that know what your system should and shouldn't be doing. And in a lot of processes, teams kind of keep the data at arm's length away from those people. You definitely don't want to be doing that. You want them to be interacting with it as much as possible. And techniques like this that allow them to apply ML approaches to understand that data are even better because they get to leverage a lot of the technology without really having to understand it.
AIQCON Male Host [00:23:19]: Great question. We got time for a couple more questions.
Gordon Hart [00:23:24]: Yeah, thank you for the talk. I have a question about the setup. Wouldn't this be highly dependent on the quality of your embedding model? Absolutely. This definitely depends on the quality of your embedding model. And we used an open source, pre trained embedding model here, which is actually quite good for lots of images, even domain specific images. I would recommend first looking at something like clip or siglip, because those embeddings are quite, quite good. But there are ways to fine tune embeddings extraction models on your own data without applying any labeling, self supervised learning techniques that will improve the quality of embeddings. Or if you have your own models, just use embeddings from an internal layer of that model.
Gordon Hart [00:24:05]: One of the nice things about this approach is that you're labeling just your images or just your data points. And if you drop in better embeddings at some point in the future, the retraining process is effortless, can happen automatically, and therefore you get a better classifier at pretty much the only thing you've done is swap out your embeddings extraction model. Okay, thank you.
AIQCON Male Host [00:24:26]: Great question. All right, I feel like Vanna White.
Q5 [00:24:33]: So thanks for the great talk. I wonder. So you showed several different filler modes for different modalities, and even within the image modality, there are several filler modes. I wonder, like is it the same difficulty or like as simple to build like good performance classifiers for all of these filler modes, or do you found some of them more difficult?
Gordon Hart [00:24:56]: Depending on how obviously a concept that you're trying to isolate is defined in embedding space, it might take more or less examples. This, I would say it was on the easier side where we're looking at really just ten, eight images for positive and eleven for negative allowed us to get to almost perfect performance. That might be on the easier side, but you can for harder concepts as long as they're represented in that embedding space. And that's, I think, where things are more likely to fail to break down than in the labeling your data with positive or negative. Not every failure mode will be represented in an embedding space. For example, the big h word hallucinations on everybody's mind right now, factual consistency is probably not going to be represented spatially in your embeddings. And you probably can't use an approach like this for hallucination detection. You can use it for things like style, classification, whether or not you responded with the right level of verbosity.
Gordon Hart [00:25:56]: What kind of question your user's asking you whether or not you gave some specific failure modes like financial advice or medical advice. Those sorts of things will be represented quite well in that embedding space, likely, but things like factual consistency, you know, did Dumbo the elephant swim across the English Channel? Your embedding is probably not going to encode the answer to that question.
AIQCON Male Host [00:26:16]: Great question. We have time for one more question before our next speaker. Any more questions? Oh, you're going to make me walk for it.
Gordon Hart [00:26:25]: Coming. Wow, I did a bad job here. I was planning to talk and leave. No time for questions. Hey, great conversation. How does this fit into data governance.
Q2 [00:26:38]: Processes at a large enterprise? How are you thinking about bringing that into that process?
Gordon Hart [00:26:46]: So governance can be a really multifaceted problem. I think having the ability to take large amounts of unstructured data, like a lot of images or documents that you're collecting, and extract features like this that say, hey, this is this kind of document. Or they're asking these kinds of questions, or we responded with this tone, or this contained a lot of, you know, this particular failure mode. Extracting these kinds of signals in aggregate on data, and that means having some efficient way to define them and extract them, can give you a lot of additional signal that helps you govern your data, because you know what? Your data is a lot more.