Stopping Hallucinations From Hurting Your LLMs
Atindriyo is the founder and CTO of the San Francisco-based Machine Learning company, Galileo. Prior to Galileo, he spent 10+ years building large-scale ML platforms at Uber and Apple. Formerly, he was an Engineering Leader on Uber's Michelangelo ML platform and co-architect of Michelangelo's Feature Store (the world's first Feature Store), which drove ML across the company.
This talk is all about quantifying hallucinations -- the most critical evaluation metric for LLMs. In this talk, we'll dive deeper into what Hallucinations mean in modern-day LLM workflows, and how they affect model outcomes (and downstream consumers of LLMs). We will also discuss novel and efficient metrics and methods to eagerly detect hallucinations early on in the process, so we help disinformation and poor/biased outcomes from large language models, thereby increasing trust in your LLM systems.
I'm currently working on the slides/content and should upload it in the next day or two. Let me know if you have any questions or need something else from my side!
Introduction
My next guest, I'm so pumped about, um, this is the CTO of Galileo. He's had, he has over 10 years of experience at some massive companies like Uber and Apple. He was on the Michelangelo team at Uber, and I think he was part of the team that created. Maybe the first feature feature store ever. Um, we'll have to bring him on here to, uh, verify that.
So without further ado, let's hear from Rio. Hello. How's it going, Lee? And nice to be here. Thank you for having me. I'm doing great. How are you? It's nice to see you. You're looking a little blurry. Maybe we'll give you a moment. Okay. I think you're, it's looking a little bit better. Uh, yeah. How you doing?
Thank you so much for joining us. No, likewise. Thank you for having me. This is very exciting. Yes. And very excited to share some new work on hallucinations that we've done over at Galileo. Yeah. I've been really looking forward to this talk. So let's let you get started and here are your slides. Super awesome.
Thank you so much. Uh, so yeah, first of all, uh, thank you for having me here and thank you for joining this Lightning Talk. Uh, there's a lot to talk about. We only have 10 minutes. So I'll just get right to it. Uh, just a bit about me. I think Lily already kind of mentioned, um, I'm one of the founders of Galileo, which is an ML data intelligence company.
And prior to Galileo I've spent most of my career building machine learning platforms and ML systems spanning post 10 years. Uh, worked on Siri at Apple for many years and as a staff engineer and leading a lot of the core components of the Michelangelo platform at Uber. And where I primarily focused on data quality and data diagnostics for machine learning.
Uh, and we also created the first feature store, uh, a few years ago as, uh, as Lily had mentioned. Uh, so I'll just get right to, uh, what we are gonna talk about today. Uh, it's all about LLMs and in particular, one of the key data quality issues that plagues a lot of these lms, uh, and, and can potentially be a.
Very big, uh, deterrent to, you know, productionizing, practical LM systems in the coming months and years. And that's, today. It's termed as hallucinations, uh, but wanna dive a little bit deeper into what they are, what they mean, and how you can sort of detect them and avoid them in your systems. So there's no question that today's llm, especially gbd three and above, they're extremely impressive in their responses.
But if you really squint, uh, they're. Wrong much more often than you'd think. And uh, these errors kind of include, you know, factual mistakes that they make. Uh, there's misleading texts that they spit out there then that may look linguistically correct, but they're more often wrong. Uh, and, and this is what typically sort of refers to hallucinations.
And, uh, what leads to these hallucinations? Uh, it is typically more of a data problem, as you would see. You know, there's, uh, when you train these foundational models, um, there's a lot of issues around overfitting the data to the model. There's, uh, hidden class imbalances and insufficient data of a certain type that's missing in the, in the TRA training and eval data sets.
Uh, often validation data sets have less coverage than their training baselines. Uh, encoding mistakes as well as poor quality. Prompting all this sort of is a drop in the ocean of the reasons why, uh, uh, an LLM could hallucinate, uh, and the outcomes of hallucinations is basically what you see on the right.
I don't think that needs, uh, much of, uh, an introduction. So now how do you tell whether a model is hallucinating or not? Or rather, can the LM themselves tell us whether they're hallucinating now, To answer this question, we need to look a little bit deeper into the model itself. Now, these LMS are essentially next token prediction machines, uh, where, you know, at each moment they're essentially choosing, you know, the next best token or the next best word to spit out from a collection of tokens where there's a probability distribution assigned to each of them.
Now this is a very, very like 20,000 foot view of what a sequential model, uh, typically looks like. Uh, you can double click into the transformer box and, you know, there's many variations of of it. But, uh, for the purposes of this talk, I think we can do with this simple view. But the key thing to remember here is that there's token level probabilities and their distributions that tell us a lot about what the, you know, model's outcome.
Or what the LLM thinks about its outcome and is one of the key indicators of hallucination. And we'll get into that a little bit in the coming slides. Uh, but first, uh, I wanna base this presentation on a lot of these LLM experiments that we've done on hallucinations and some very promising results that we've seen and how we are baking it into the gile product.
I touch upon a, a bit towards the end, but first we wanted to define a problem statement for our experiments. And, uh, and that essentially kind of came down to quantifying hallucinations through a metric which can automatically be computed for all LLM or all kinds of LLM responses. Uh, and and beyond that, it's important to point out is sub subtext level hallucination, which means that often these, these models would spit out blobs of, you know, text, uh, and often there's subtext within, within them maybe a sentence or two, which are hallucinated.
So this metric needs to be very granular. Uh, and, and the method that we sort of, we went, went over many methods internally, but the, the, the, the summary of the method that kind of came down to was the fact that we wanted to use an open-ended text generation where we would curate a set of inputs and LLM outputs as our dataset.
And we did a lot of exploration on standardized data sets being used. Uh, but then we would examine the token level probabilities for each of these completions from these models. Channel them to a third party, uh, neutral model to get some extra signal and then eventually determine if the base completion of the, the, the, the output of the base model was, uh, hallucinating or not.
So that's the, the, the summary of the methodology, uh, the's some assumptions here, which, uh, we've made. And one of the key assumptions is that all these models are state of the art. And by state of the art, I mean GPT 3.5 and above. Uh, and this is a very important assumption because. Uh, a lot of these models, uh, especially GPT three and above, they're trained on a wide range of knowledge and they're capable of answering, you know, many kinds of questions which would otherwise stump the older models, which are below three GPT three.
Uh, uh, so this new definition of hallucination that we are creating, they cannot be based on some of these. You know, older research that was done two years ago, which involved, you know, analyzing mistakes which those models made cuz those are essentially, and the newer models are pretty much immune to it.
Uh, and, and the goal for the metric is to, you know, focus on accuracy. It needs to be as accurate as possible in its hallucinations, as well as diversity because open-ended, like text generative models, they're, they can be used for a wide variety of tasks as opposed to some of the more limiting model architectures.
So the hallucination metric has to sort of diversify across different kinds of tasks. Now, when we start the experiments, we essentially employed a two model paradigm. There's a completion model, which is the output of interest, like the, the, the actual model, which spits out the output, and then there's the probability model, which spits out the probability tokens.
Now, in an ideal world, the, the, they would be the same model, but more often than not, especially with open AI models and uh, uh, some of these, these other proprietary models, there's often situations where you get completions but you don't get probabilities. So, uh, and not all models basically give you enough information, especially when you consume them through APIs.
So we had to do a bunch of experiments on, uh, figuring out which combinations of probability and completion models gave us the highest bang for the buck in terms of hallucinations. Talking a little bit about the data sets that we experimented on, there were, uh, a whole host of, uh, data sets that we kind of explored.
In general, this area is very new, so we had to kind of create some data sets on our own. Uh, but amongst the existing ones, some of the, the top candidates for us, which gave us, you know, a decent bang for the buck, where, you know, there's a self-check GPT Wiki bio dataset, which is essentially a dataset of Wikipedia biographies.
There's a selfs instruct human evaluation data set, which is a more like open-ended, uh, text generation data set. And the one which gave us most promise. And, uh, we, the one we found most challenging for some of these model newer models, uh, was the open assistant data set, which is, uh, one for a, an open source chat, gpt like assistant.
Uh, so these were the data sets that we conducted. Uh, uh, a good chunk of our experiments on. Um, and then we kind of got into the metrics and the baselines that we want to create for these, these, uh, these data sets. Uh, and the baseline, we essentially used three, uh, key metrics. Uh, one was log probs, which is essentially the, the, the log of the probabilities of each tokens that appears in the completion.
Uh, there's PPL five, which is a metric which was published by one of the, the papers I've referenced here in this, in this presentation. Uh, and that essentially measures the entropy of the, of the models probability distributions, but particularly in the top five lo uh, tokens. Uh, and then final is the pseudo entropy, which is a metric that Galileo has, uh, created, and it's a heuristic on top of Shannon's entropy.
Uh, but again, we look at sort of the top five, uh, token responses, uh, from, um, uh, from the l l M output. Uh, and then across these metrics we evaluated average versus minimum, uh, uh, because here we, a lot of these. These, uh, metrics that we get there at a token level. Uh, but the output of the LM overall is a blob of text.
Uh, so we've done a good number of comprehensive experiments, uh, around this, almost think of it as a cross join of all this. Uh, there's another set of baselines that we explored, which, uh, sort of term as multi-model baselines. And this is sort of driving the intuition that, uh, uh, the, a third party model can give us extra information or, or key signals about hallucinations because it's not biased by the its da own data.
Uh, and here we, we looked at, uh, uh, for example, there were couple, three API based baselines, which were all using GPT 3.5 or chat GPT. Uh, and, and here, uh, in the first one, cha Chat GPT we used the GPT 3.5 model to essentially write a question and an answer and then use the same model as a grader for, for, for the answer.
Uh, and, and we saw mediocre results from, from this particular method. Chat GPT token was the second sort of baseline we established where, uh, it was an improvement over just looking at log props. Because often what we noticed is that in a lot of these LLM responses, the initial set of log probs are the model's highly uncertain about some of its initial tokens.
Uh, so that kind of drives the min log prob, uh, to a much lower value than, you know, almost unnecessarily. So in order to avoid those kind of, uh, you know, phrase uncertainty biases, we use Chat GPT Token. And then we created this new method called Chat GPT Agreement, or Chat GPT Friend, which gave us really good results.
Uh, and here we basically leverage the 3.5 model to, uh, essentially do reruns on the base model and get multiple outputs, and then see if there's enough agreement between the completion and the reruns. And the intuition there is again, you know, if, if a model is hallucinating, then likely for more than a single run, it would give vastly different responses as opposed to something it's very sure about.
So we got some very interesting results using this particular method. Uh, there's some other multi model baselines that we've, uh, uh, also explored. Uh, which typically includes self-check mechanisms where you do reruns and then you compute the Bert and the blue scores to see the distances between the, the all pairs of completions.
And, so just, just to highlight, some of the experiments that we've done, uh, they involve, you know, all, all of these baselines. Now going a little bit deeper into the results, uh, just wanna quickly highlight the most promising results that we saw. Uh, the first one was, shout out one minute. We gotta, so the, the first one was, yeah, it gave us 63% precision, uh, when we used DaVinci, uh, as both the completion and probability.
And then using pseudo entropy, we got a 69% accuracy in detecting hallucinations. I've pasted some charts here for you to go over offline, but here's what we've found. These are outputs of state-of-the-art open AI models, uh, quotations from the Shakespeare's, the Tempes, which were completely made up, non-existent books, as well as URLs that do not exist on the web.
And this is a drop in the ocean of kind of things that we saw. So we've baked in hallucination as well as some of our other famous, well-known metrics, such as data error potential in our system. And we have two new tools. One is to help you create prompts and manage prompts. It's called, uh, the Prompt and Inspector.
And it allows you to select the best prompts. And finally, if you're fine tuning a foundational model, we, we use de as well as our new, uh, hallucination metric to really show you noise in your training data as well as hallucinations in, in your evaluation, as well as the models outputs. So, you know, just cutting to the chase.
These are the two products which will be launched. Just a quick shout out to the team, which has made this possible. They're incredible people. And finally, you know, we are launching LLM Studio very soon. Just go to run galileo.io/llm-studio and you know, just, uh, I can't wait for you to use the tool. Yeah, thank you.
Awesome. Thank you so much. Um, and definitely share any links in the chat as well. Um, to answer your questions. San Keith, um, all of these videos, all of these talks will be recorded. Slides we're collecting. We'll make sure to get those into your hands. So thank you so much. This was wonderful. Thank you so much.
Pleasure. Of course.