Sign in or Join the community to continue

Shrinking the Generation-Verification Gap with Weak Verifiers

Posted Dec 01, 2025 | Views 97

# LLM Verification

# Weak Verifiers

# RAG Systems

Share

Speakers

Jon Saad-Falcon

Stanford University @ CS PhD-MBA Candidate, Co-author

Jimin (Anna) Yoon

Tech Lead / Senior Software Engineer @ Statsig

Arthur Coleman

CEO @ Online Matters

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .

+ Read More

SUMMARY

Language models are getting better at reasoning but their ability to verify their own outputs still lags behind. This paper tackles that challenge head-on by introducing Weaver, a framework that combines multiple weak verifiers into a single, stronger verifier without relying heavily on labeled data.

Weaver uses weak supervision to estimate verifier reliability, normalize inconsistent outputs, and filter low-quality signals, resulting in a unified score that better reflects true response quality. In practice, this approach significantly boosts reasoning and math task performance rivaling models several times larger, such as achieving o3-mini-level accuracy using only Llama 3.3 70B as the generator.

+ Read More

TRANSCRIPT

Arthur Coleman [00:00:00]: The confusion and we'll get going and I think we only be a couple minutes late. We should be okay. Today we are covering a very interesting paper and I'll explain in a minute why I think that called shrinking the generation verif verification gap with weak verifiers. And we're honored to have the article's author or the lead author, John Sodfalcon with us to do the presentation. John is a PhD candidate in Computer science at Stanford, but more importantly he's a member of the technical staff of the Stanford Scaling Intelligence Lab. And I'm going to stop sharing for a moment and go over to this screen and show you this because I think it's important. This is an interesting group that's doing some very interesting work at Stanford and covering a lot of interesting areas. A group that if you're in the Bay Area you should probably plug into because a lot of the work they're doing is for practitioners and things that we're very much interested in.

Arthur Coleman [00:01:09]: So not, I'm not plugging them in the sense that. But I found them and it was like, oh, I need to do more with these people because they're doing some really leading edge stuff that I can make use of every day. Okay, back to the presentation.

Bauke Brenninkmeijer [00:01:25]: Okay.

Arthur Coleman [00:01:25]: And our. Anna Yoon and I are your hosts for the day. Adam Becker could not be with us. He had a conflict that he had to attend to. So it'll just be Anna and I hosting. I'm basically going to be sort of guiding. I'm. I'm on a single computer again.

Arthur Coleman [00:01:47]: Normally I'm in with multiple windows, multiple screens. It's very hard for me to, to lead the Q and A this way. So Anna will be leading the Q and A. And as a reminder. And I should put it in the. Anna, did you put in the new chat? We should put in the link to the Google Doc that we use to take questions. So the way it works is there's a Google Doc and you put your questions in there with your name and it's like a, a computer stack. It's first on, first off.

Arthur Coleman [00:02:18]: So we will turn to you and say, hey John, please ask your question. We appreciate it. If you turn on your video so that John can see who's talking to him. And I realize in many cases we're in our bathrobes and our, you know, like I look like Albert Einstein in the morning. So you don't want to be on video, that's fine. But it helps if we are. So that'll be. If the link isn't in there.

Arthur Coleman [00:02:44]: Now, as soon as I'm done talking, put the link into chat. There's also a row board which you will see and I'll put that link in the chat as well. The guiding principles of the reading group is that. And those of you who've been here before, you've heard this, you can tune out. But for those of you who are new, these sessions belong to you. The more that you participate, the more we are all going to get out of it. We've had some really intense interactive discussions that when people come away they go like, this is the best. This is just great.

Arthur Coleman [00:03:14]: So John will prevent present today probably for 35 minutes, 40 minutes maybe, and then we'll have 20 to 25 minutes for questions. There is no, there are no dumb questions, okay? This is a no judgment zone, please. We're all here to learn, so feel free to ask whatever it is. I'm usually the guy asking the dumb question so it's easy for me to say that. Your questions again go into Google Dot and the link is there, but I'll put it in the chat in a moment and as I mentioned, you will ask the questions. We will, we will recognize you and you will ask the questions and John will order them, answer them. Excuse me. Lastly, before I say one thing, don't forget to fill out the post event survey.

Arthur Coleman [00:03:56]: It's very important because we try to serve you. We want to make sure that we're giving you events that are good uses of your time and that we format them which is important correctly so that you get the most out of them. Now let me take a moment and say why I actually that John is my invite and that's unusual, but I was pleasant. I had a pleasant surprise when he was one of the keynotes at the Pytorch conference a few weeks ago. And I immediately heard his presentation and went this is a problem that's been bugging me for years and here's a solution. And as I was prepping this meeting I realized this idea of weak verifiers and the reason it's important in our world is that often in B2B like when you're doing an internal AI, there's not a lot of data sets. Like if you're trying to train an AI to work with workday and provide information to employees in a 5,000 person company and you want to train it to respond correctly, that's not a lot of training data. And so how do you train an AI when you have very small access to only small data sets.

Arthur Coleman [00:05:01]: And that's the problem that I faced over the years that got my attention. But as I thought about it, there's something even more important here. This is John's approach and his team's approach is implementing an AI, a process that we actually as a species have vetted over thousands and maybe millions of years. So let's take some examples in research today, like the scientific method. Someone in pharmacy doing a new drug will run a test on 25 or 30 people and it looks like that's a good drug. But you need to have independent verification. You have to have peer review. That is the idea of weak verification.

Arthur Coleman [00:05:40]: You have multiple small samples that ultimately verify a hypothesis and prove a point. Second example is language. Children learn language, they're empty. A child is an empty large language model, untrained. And how does it get trained? Well, it meets lots and lots of people. It hears things. There's a book called the Scientist in the Crib which you haven't read. If you're a parent, it's worth reading.

Arthur Coleman [00:06:02]: Children experiment, they throw things out in the experiment. They get weak verification from one person, they go to another person, they hear that. Okay, now I've got two points, three points. Weak verification is a method that has nature and we have already evolved, so it's a successful method. What John and his team have discovered, and it's not a small discovery, is that you can use the same method for chaining LLM. So I wanted to put that in context because I thought those were sort of interesting ways of looking at the problem. So, John, I've done my thing, I turn it over to you. I hope I've given you good context and set you up for success.

Arthur Coleman [00:06:39]: I'll let you take your machine.

Jon Saad-Falcon [00:06:41]: Well, thank you so much for the very warm introduction. Let me share my screen real quick.

Bauke Brenninkmeijer [00:06:51]: Perfect.

Jon Saad-Falcon [00:06:53]: But yeah, no, I think the introduction covered most of what I wanted to cover. It's just a quick refresh. I'm john. I'm a 30 year PhD student at Stanford now advised by Azali, Amir Hosseini and Chris Ray. Also, yeah, a member of the Scaling Intelligence Lab as well as hazy research for those familiar and broadly focused and interested in ML foundation models and ML systems. As of late I've been really interested in verification because as Arthur pointed out, it's one of the key bottlenecks of a lot of different areas. Not just different problems in reasoning and different problems in mathematics, but also just in general how we actually deploy these language models and deploy these AI systems towards Pharmaceuticals, towards finance, towards medicine, towards, towards. No, no shortage of things.

Jon Saad-Falcon [00:07:41]: Verification tends to be the bottleneck. And so this was a project near and dear to my heart. We'll be presenting it at Neurips very soon. But yeah, thank you all for coming. So, yeah, without further ado, I'll get right into it. So I don't think it's any surprise that we've seen like a lot of different applications for scaling inference compute as of late. This started in late 2024 with the scaling LLM test time compute optimally paper from Snell at Berkeley. But since then there's been no shortage of different approaches for scaling test time compute.

Jon Saad-Falcon [00:08:14]: So most famously there's the O1 and the O3 series from OpenAI. There's also all the different Claude models from Anthropic that are deploying inference time compute for a variety of different tasks. But there's no shortage of different systems for spending more compute flops at inference and getting better results across different tasks. For this paper we wanted to examine one of the most robust and simple versions of test compute, which is just best of N, in which you generate n different solutions and then you select the best answer amongst those different verifiers. Within this regime, an interesting phenomenon begins to occur which is common amongst all kinds of inference time compute techniques. But most importantly in the simplest version, which is this idea of a generation verification gap. So in this setting, language models are often capable of generating the correct response, but they fail to identify it. And so you get this increasing gap between a model's generation ability or it's pass at K and verification ability or its selection at K, its ability to actually choose amongst the different k responses.

Jon Saad-Falcon [00:09:19]: And so this is true across different tasks. This is true across different models, as you can see in these graphs on the right from the large language monkeys paper. But what we wanted to see was to what extent can we close this generation verification gap in a repeated sampling regime, AKA best of N. So we want to see to what extent we can use weak verifiers. And weak verifiers are attractive for a couple different reasons. One, they're a function that can score up responses imperfectly but still provide some positive signal. And so while they're imperfect, they're very cheap to scale and they're very cheap to train. And so there's lots of different examples of weak verifiers available today off the shelf.

Jon Saad-Falcon [00:09:57]: Some examples include reward models off of reward bench as well as LMJudges off of like Chatbot arena, which Is the kind of the standard for where people tend to look for LM rankings. What we found was when we were basically creating ensembles of these verifiers, so ensembles in this case is a bunch of different reward models and a bunch of different LM judges called at once. The weighted combinations of these verifiers did very, very well. Like they were able to outperform naive combinations or just naive weightings of these different verifiers. Now, the problem with using a weighted verifier ensemble, of course, is that you need annotations or you need labeled data points to be able to actually learn the weights of these different models. Otherwise, you wouldn't be able to actually learn any sort of weighting and you'd have to use naive just combination of their scores. However, getting annotations and getting ways of actually labeling on these data samples is very expensive, both in terms of time and money. We wanted to see if there's other ways to learn this weighting in an unsupervised fashion.

Jon Saad-Falcon [00:11:06]: We wanted to explore a branch of statistical techniques called weak supervision. These techniques are most useful for aggregating multiple weak and noisy labeling sources. The example that they often provide for weak supervision is predicting the weather. There's lots of different things that you can use to predict the weather tomorrow or a week from now. You could use different humidity signatures, you could use the weather from yesterday. You could use like the sunlight from today. You could use the time of year. And there's lots of different signals that could be combined, but they're ultimately undefinitive in the sense that you can't just rely on one or some some combination of them.

Jon Saad-Falcon [00:11:41]: You need to learn some sort of weighting or some sort of aggregation strategy for combining them. That's where weak supervision really shines. It allows you to do this aggregation without actually having huge amounts of labeled data. And so, by combining weak supervision with these weak verifiers, we decided to propose weaver, which combines these different weak verifiers by learning the optimal weights in an unsupervised fashion. And this enables us to actually combine inconsistent input formats as well as different verifier qualities, all while using only a small set of labeled examples, usually at the scale of like 5 to 10 data points. And so this transforms a set of weak verifiers into a single much stronger verifier that can improve success rates by 27.8% on reasoning and math tests, which is what we examined in this paper. So, so how, how does weaver work in three main Stages, scoring, weighting, and selecting. For the first stage, we collect all of the different outputs for a given problem.

Jon Saad-Falcon [00:12:44]: For example, the problem could be, what's the capital of France? We generate a bunch of possible solutions for that response. Once we generate all these different responses, we then ask the verifiers to go through these different responses and score them. Once we have these different scores, we can then normalize them to a common scale, for example, from 0.0 to 1.0, just so they're all normalized to the same range, and then filter out any low quality verifiers. These low quality verifiers tend to be verifiers that only output like a single score or like tend to be like very, very skewed or biased in their actual distribution. Now, once we have this consolidated set of verifiers, we apply weak supervision to estimate the different verifier accuracies using the different minimal labeled data. And so this minimal data again is like usually 5 to 10 data points. But from these 5 to 10 data points, we can calibrate the whole ML aggregation strategy such that we know which verifiers are correlated with each other, which verifiers are not correlated with each other, what's the actual respective accuracies of these different verifiers, what are they good for in terms of which tasks and which, which topics? And so we can extrapolate a lot from that very small set of data points that can allow us to generalize to a much larger set. And finally, once we have this refined set of verifiers and their aggregation strategies, we then combine the weighted scores to choose the highest confidence response.

Jon Saad-Falcon [00:14:07]: So what's nice about our approach is while you might have like 100 different responses for a given problem, you only need to pick one and just be confident in that single response. So what's really exciting about Weaver is it allows us to beat a lot of these frontier language models and shrink the generation verification gap between these language models and their actual Oracle performance. And so the first thing that I wanted to highlight was the performance of Weaver compared to majority voting. And so majority voting means just take 100 samples and then find the most common response in terms of what that solution is and pick that solution. Additionally, we tried just picking the highest scoring reward model, in this case, whatever is the highest scoring reward model on reward bench, and use that to select our optimal response. What we found was that Weaver was able to outperform these two pretty strong baselines by, by 14% or more, depending on the Task and do so pretty handedly. Additionally, we find that Weaver was able to help us close the gap with some of these Frontier foundation models such as GPT4O, Sonnet and Llama4Maverick and sometimes even exceed them. As you can see, with GPT4O and Sonnet we were actually able to move past them.

Jon Saad-Falcon [00:15:21]: Excitingly, we were also able to rival the state of the art reasoning model for that time which was O3 mini. That's super cool to see, especially because Weaver is only using open source language models like Llama and Qin. Additionally, we're able to close the gap with the actual frontier performance, the theoretical maximum of our performance which is Oracle verification in which if we were to have 100 samples, we'd always be able to pick the correct response every single time. Beyond our default setting for Weaver, we wanted to explore a couple other settings that could help us scale compute towards verification. The first one we wanted to explore was scaling the sample count, scaling the number of generations. Then we wanted to scale the model size so the actual model sizes used for generation and verification. Afterwards we wanted to explore scaling the verifier count, so scaling the number of models we actually use for verification. Finally we wanted to explore just scaling raw inference compute.

Jon Saad-Falcon [00:16:16]: So just spending more flops, more compute operations for generation verification across the board. So we found for scaling generations it continued to boost weaver performance to 256 samples and sometimes even beyond. And this was true across a couple different settings. So this was true for GPQA Diamond. For those unfamiliar. GPQA diamond is a benchmark meant to measure PhD level reasoning in scientific topics like physics, biochemistry, organic chemistry, inorganic chemistry, biology, astrophysics, so a variety of different topics as well as Math 500 which is like college and high school level mathematics. And finally MMLU Pro which is just general reasoning across a variety of different topics, so mathematics, business, computer science, engineering and more. Additionally, we found that Weaver can help us reduce the gap between different model classes and this is one of the most exciting results.

Jon Saad-Falcon [00:17:13]: This basically shows that when we're trying to close the gap between the performance of llama 8b and llama 70b so over like over a 7x difference in parameter count, Weaver is able to help us close that gap to get us to get llama8b to rival the performance of llama70b in this setting. Additionally, for llama70b, by applying Weaver, we're able to decrease the gap with O3 mini, which at the time represented the Frontier language model performance for these four tasks that we examined afterwards, we also studied How Weaver can benefit by adding additional verifiers. And so the crucial thing is that we want to add verifiers in a way such that adding additional verifiers continues to benefit performance. Right? We don't add verifiers if they necessarily cause some sort of negative signal or some sort of adversarial signal. And, and so we ordered the verifiers in order of decreasing accuracy. And what we found was, as we were adding verifiers in this manner, there seemed to be a sweet spot to the number of verifiers for these different tasks. And usually that sweet spot was around four to six verifiers. And so we found that was kind of interesting.

Jon Saad-Falcon [00:18:25]: So if you wanted to deploy a cheaper version of Weaver off the shelf and just apply just some ensemble of these different verifiers, it would make sense to just pick the three to six that are useful for your task and just ensemble them out of the box to just improve performance. Additionally, we wanted to study how Weaver can improve the accuracy compute Pareto frontier. And so for this Pareto frontier, we have the inference compute, we spend per query on the x axis, and then we have the success rate or the accuracy rate on the Y axis. And what we found was Weaver can help us push out the accuracy curve rather substantially. Depending on the task. It can range between like 10 and 15% in terms of overall improvement on these different tasks. And additionally, we found that by applying a distilled version of Weaver, which I'm about to talk about, we can even push the Pareto frontier further and push it up, further up and to the left, so becoming not only more accurate, but also more Pareto optimal and more efficient in how we apply inference. Computer Weaver distillation is a application of SFT in which we take our ensemble of different scores from Weaver and then apply it or treat it as a set of labeled data for training a distilled verifier.

Jon Saad-Falcon [00:19:44]: This distilled verifier that we explored for the study can range anywhere from 400 million parameters to 4B parameters. But overall, it's very small compared to the generator model, compared to the, to the language model verifiers that we use for this study. It's much, much smaller, like 10x smaller. But despite the fact that this verifier is much, much smaller, it tends to be pretty robust, particularly for, like, the settings in which it was trained in. And so what we find is that when we apply this verifier towards these settings, we're able to preserve 97.4% of the accuracy gains, sometimes, sometimes even more, While only using 0.03% of the compute we would have originally used with the full Weaver ensemble. This is really powerful because it shows that you can not only distill Weaver down into a very small model, but you can also deploy this in an online setting. So you actually don't need to have the entire ensemble of verifiers, which can be expensive if you're running a bunch of different models concurrently or successively. You can instead deploy this verifier on the fly in a model that can actually fit on your phone.

Jon Saad-Falcon [00:20:52]: Like it's very, very small. So that's, that's super exciting to see. Looking forward. We're really excited to kind of like complete the generation verification loop. RL is a very hot topic today and so we want to see how we could apply Weaver towards RL and use it as like a source of signal for training, for training and post training different models. And we also want to see how we can explore different computerware optimizations. So, so how we can on a more query by query basis, on a more fine grained basis, explore the number of verifiers that we actually need to apply such that we're not applying the same set of verifiers and the same generation budget per query. And finally, we want to see if there's any off policy ways we can apply RL so not just using the same models, but taking different verifiers and different generators and using it to train other models that aren't necessarily in the same model family.

Jon Saad-Falcon [00:21:45]: Um, but yeah, thank you so much for, for coming to the presentation. If you have any more questions, please let me know and I'm happy to, happy to discuss.

Arthur Coleman [00:21:54]: John, you want to just check one of the things, we have some time at the end to go over your appendices in the paper. Anything there that people should pay attention to specifically that you would point out to them?

Jon Saad-Falcon [00:22:07]: Yeah, let me, I can pull up the paper and cite a couple pages, cite a couple tables. Let's see. Yeah, I guess for those who are interested in particularly in like industry applications of the approach, we have some tables further down. Let's see. Actually, it might be a little higher up. Yeah, so yeah, we have some tables in which we explore training more like lightweight versions of Weaver. So for those familiar with statistical models, logistic regression tends to be a very powerful way of just aggregating these verifiers. So especially if you have a small labeled set at the scale of tens or hundreds of examples.

Jon Saad-Falcon [00:23:02]: It can be useful to train like an XG boost model or a logistic regression model off of these samples. It tends to be pretty, pretty powerful. So I'd recommend training that if you're looking for more like industry settings of aggregating different verifiers. But that was the. That was the only thing I want to highlight. Yeah.

Arthur Coleman [00:23:23]: Okay Anna, we'll turn it over to you to. I have some questions in there but I'll let you drive. I hope people will put their questions ahead of mine. Mine are the least important.

Anna Yoon [00:23:35]: Awesome. Thanks for the walkthrough, John. First of all. Yeah, quick question to the audience. How many of you are building your own LLM either in the academic setting or in industry? We'll love to learn.

Arthur Coleman [00:23:48]: Yeah, use the thumbs up reaction to show us. We'll count those.

Charlene [00:23:56]: Okay.

Anna Yoon [00:23:56]: We have one, two folks already. That is already amazing. Honestly that's 10% of the audience already. So as we mentioned before, there is this link that we shared. It's the first message that you find the chat thread and feel free to drop your name along with your questions that we'll select and highlight you.

Arthur Coleman [00:24:20]: The reason I asked that question is a lot of what John talks about is applicable to people who are building large language models. But John, what I want to do as we go forward given this, the breakdown in the audience especially is focus on people who are using large language models and how they do verification of the say they're doing rag. How they do verification in that case.

Jon Saad-Falcon [00:24:48]: Yeah, I guess happy to elaborate on that. So for, for people who are applying like language models towards RAG or towards any sort of like industry settings in which is like part of like a more compound system. I think verification can be a really powerful tool to like incorporate it into these workflows. I've. I've worked on RAG before coincidentally also trained verifiers to towards rag. I had another paper called Aries. I'm happy to highlight that as well. So I'd say adding a verification layer can be a really powerful tool in any of these systems.

Jon Saad-Falcon [00:25:25]: And it tends to. I'd say the main things that I'd recommend for building these systems is being very selective in how you pick the reward models in the LM judges. Being very prescriptive in the way that you actually build the rubrics, the grading rubrics for determining whether to keep a generation or whether to keep the model looping on a given problem. Just adding like very prescriptive guardrails or unit tests towards this direction verification and evaluation is like, is one of like the themes of my PhD. I've done like a few different papers on this, so happy to, happy to share many of those materials if it's useful. There's areas the one I mentioned and then another paper called Element.

Anna Yoon [00:26:07]: Okay. Some of the other questions that we have in the doc are first of all, how well does Weaver work as a post training signal, specifically RL and sft.

Jon Saad-Falcon [00:26:18]: Yeah, so Weaver works pretty well for RL using just vanilla like GRPO and dpo, which are kind of like the standards right now. For rl, it tends to improve performance of the generator models anywhere between 10 and 20% which is really exciting to see considering the fact it doesn't use any annotations. We're excited to keep exploring it for newer generations of models just because it tends to be easier to train these newer generations of models. But we're having a new paper coming out soon focused specifically on this direction. So excited to share it.

Anna Yoon [00:26:53]: Very exciting. One person from the audience, Bok. Correct me if I'm saying your name wrong, but yeah, the mic is yours. You can go ahead and ask your question to John.

Arthur Coleman [00:27:04]: Don't forget to turn on your camera if you can.

Bauke Brenninkmeijer [00:27:07]: I think I'm here. Right.

Arthur Coleman [00:27:10]: How are you doing, man? Good to see you again.

Bauke Brenninkmeijer [00:27:12]: Yeah, you too. It's. It's been a while to not waste everyone's time. So my, my question is, I'm trying to just kind of reply back to you how, how I'm understanding how to apply this in a practical setting. So yeah, see if I get it right and correct me where I'm wrong. So the understanding I have is you label with this weak verifier aggregate to essentially get the initial predictions you label with a human. You can then train a logistical regression, as you just mentioned, to kind of assess the weighting that you would need between those verifiers. You can then extend that and predict more data with that balanced aggregate to get a larger data set and then you can distill a smaller model on those results.

Bauke Brenninkmeijer [00:27:56]: Are those roughly the steps that you would recommend if you want to use this?

Jon Saad-Falcon [00:28:03]: Yeah, no, I think you summarized it well. Yeah. And then the main levers you kind of like pull towards making this even better is just like either more models in terms of like more verifiers you use or just like more like annotated samples.

Bauke Brenninkmeijer [00:28:17]: Yeah. Okay, perfect. Thanks. That was it.

Anna Yoon [00:28:24]: Awesome. Someone else in the audience just shared a question or I guess an ask to recommend some papers from you John? Yeah, they didn't tag the name so I don't know how to identify the person. But yeah, the question basically goes. I'd love to learn more about applying verification on the RAG application. Just what we talked about. Could you recommend some papers?

Jon Saad-Falcon [00:28:50]: Yeah, let me move from this screen. So a little bit of. Yeah, a little bit of like self promotion but I'd recommend this paper. This is towards like automated evaluations of RAG systems. It's called Aries. I can put in the. Actually.

Arthur Coleman [00:29:06]: Yeah, please do.

Jon Saad-Falcon [00:29:07]: Yeah, let's see. Yeah, so there we go. Yeah, so there's this one. I'd recommend a couple other ones that I like. See. Sorry, let me, let me stop sharing for a sec. Just. I can pull it up.

Jon Saad-Falcon [00:29:30]: This is another one from our lab also focused on this direction of like RAG evaluation and like LM evaluation and then a couple others that I like from other labs. Let's see. Yeah, so there's arise AI.

Arthur Coleman [00:30:09]: While you're doing that John, let me follow up on B's question. I, I keep going because this is set up here. So many of our attendees are small company startups. We're seeing more and more of those.

Jon Saad-Falcon [00:30:25]: Two people, three people.

Arthur Coleman [00:30:27]: It has to do with the way software development is about to happen in the future. Now I have to build an mvp, right? I'm. I'm sitting in, you know, starting up my company and let's say I have a game like application. This is a real, this is a real case for me. I'm not going to tell what my business is but it's very real and it's a consumer application and I have to build the prototype of the game. Game is well known. If you go out on YouTube or TikTok you'll see people who play the game. But I don't have a huge data set.

Arthur Coleman [00:31:02]: This is not something that someone actually puts a data set together for that I can go and just download and use. All I have is my ability to go play the game with my friends or to do it with the group of people etc, but if I don't get it right early on with limited verification because if the game is not fun, it doesn't work well the way it would be if people were playing it. The business will fail. I'll never get to scale to train the. How would you go about doing to. To B's question The very specific steps that I would use to. And it's, it's not my problem. I'm giving a problem that I'm facing but it's really A generic problem for anyone doing an MVP with an AI built in with limited data.

Arthur Coleman [00:31:46]: How would you recommend we go about that process?

Jon Saad-Falcon [00:31:50]: Yeah, so I guess could you elaborate on the question a little more? So is it the idea of like, like how would you build the verification system from scratch in the setting or.

Arthur Coleman [00:32:00]: Is it like how would I build it from scratch? Exactly, yeah.

Jon Saad-Falcon [00:32:04]: So the way I would approach it is I think I would start with just like seeing what are the failure modes of the system. Just like the most naive kind of like first approximation version of the system. And I'd like very heavily emphasized just like looking through the data and looking through the traces to basically see what those failure modes are. There's been many projects where I've just kind of ignored looking at the data or looking at the traces for too long. And then when you actually look through these traces, you kind of see very obviously what's going on. So basically being able to quantify and kind of like qualify these different error cases is like really important. Once you have these different error cases in mind, then what I'd recommend is like seeing what verifiers make the most sense for it. So for example, if there's like certain hallucinations or certain kinds of kind of just like making up of information, then I'd recommend looking towards better retrieval systems or better verifiers for like cross checking between references and those and those responses.

Jon Saad-Falcon [00:33:07]: And so like something like an alum judge would make sense more would make more sense in that case versus something where it's like, oh, it's messing up with like mathematics or it's messing up with like reasoning. And so in that case a reward model would make more sense. Once you have that intuition, then it would make sense to scale up the actual verifiers that you use for those cases and then say like five or.

Arthur Coleman [00:33:27]: Seven to five or seven.

Jon Saad-Falcon [00:33:29]: Yeah, anywhere. I'd say anywhere between like, like it realistically probably like three to five. Actually like once you have those, then it should be enough. If you want to like really go, then yeah, you could do like, you could do even more. You could do seven or ten. But yeah.

Anna Yoon [00:33:45]: Well, we got a new question in our doc. The question goes, what kind of scoring is used? Binary scale. What about subjective answers?

Jon Saad-Falcon [00:33:58]: So yeah, this is like a float score. So these reward models give a score in the range of like 0 to 1 once we normalize it. And so we use those float scores to basically help us select for the LM judges. It's binary as you said. So it's like a single, like true or false.

Anna Yoon [00:34:19]: Perfect. Another one here. How well does this generalize to newer reasoning models?

Jon Saad-Falcon [00:34:28]: This tends to work pretty well on the newer reasoning models. What's nice is this approach is very agnostic in terms of what's the actual generator and what's the actual verifier that's used. And so it tends to perform pretty well. From more recent experiments in which we've tried Kimi and GW models, it tends to increase performance by 10 to 20% as well. Yeah, that's great.

Anna Yoon [00:34:51]: I think someone else is typing on your question, but in the meantime, I also have a personal question. So I see a great use case for Weaver to be used in the offline evaluation stage. I do a lot of work in the online experimentation well as well. How well do you think the same concept can be delineated to that later stage?

Arthur Coleman [00:35:14]: Question. Excellent question.

Jon Saad-Falcon [00:35:15]: Got it. Could you say it one more time? Sorry.

Anna Yoon [00:35:17]: How.

Bauke Brenninkmeijer [00:35:18]: How?

Anna Yoon [00:35:18]: Well, like, how do you see Weaver being applicable to the online experimentation? So post offline ufiles stage.

Bauke Brenninkmeijer [00:35:27]: Got it.

Jon Saad-Falcon [00:35:27]: Got it. That's. That's a great question. Yeah. So in the online settings in which like, you're actually like applying it towards these applications, I'd say the Weaver distilled version is probably the most useful for these cases because it can be run very cheaply on the fly. I think it's especially useful once you know what your sources of signal are to then just distill it down into a smaller model because then you can just deploy it on the fly. To put this in perspective, a 400 million parameter model has a latency in the tens of milliseconds. And it's also very possible to run it on a very small GPU or deployment setting.

Jon Saad-Falcon [00:36:07]: So it's very, very expedient in terms of like its, its ability to be applied. Yeah.

Anna Yoon [00:36:13]: Awesome. That's great to hear. I'll look into that. Bao K had another question. Oh, thanks for I guess, moving it over from the chat for us. But yeah, the. The mic is over to you.

Charlene [00:36:25]: Yeah.

Bauke Brenninkmeijer [00:36:26]: So if Stephen wants to take over, it is his question, so I'm happy to give him the mic in my.

Anna Yoon [00:36:38]: Steven. Okay. I guess I can ask and repeat the question on behalf of Steven, but Stephen was asking the repo arise AI that you shared, are they using the verifier approach anywhere in the repo?

Jon Saad-Falcon [00:36:58]: Yeah, yeah. So they're using Aries and then like a couple like rag judges, like off the shelf. So I'd check it out. If you're just Trying to get familiar with different verifiers that you could use. But I think they implemented ares. So one of our papers. Perfect.

Anna Yoon [00:37:16]: Thanks for the answer. Another one here was the same prompt used for evaluation for all the weak verifiers.

Jon Saad-Falcon [00:37:24]: Yes. It was always the same prompt for all of the different reward models and also the same prompt for all the different LM judges. We did a couple ablation studies in which we tried different. Basically using the same model, but then just using different prompts to see if they could squeeze out more value. And it didn't tend to help that much for the reward models or the judges. It seemed like once you had a good prompt that it was more useful to get other models with different weights, different value, and different biases versus just using the same model repeatedly.

Anna Yoon [00:37:59]: Perfect. I love to see a stream of questions coming in in the doc. And Mina. Yeah. Feel free to ask your question to John.

Amina [00:38:10]: Okay.

Charlene [00:38:11]: Wow.

Bauke Brenninkmeijer [00:38:11]: Hi.

Amina [00:38:13]: Hello. Nice to meet you. It's great to be here. I have a simple question of. I mean, I'm pretty new to this idea in general of using a bird, but right now one of the tasks I'm doing is to create synthetic data to train a text classification. And in that case, there's not really a one truth, but we just want to make sure the synthetic data actually fits all the criteria that we want, whether for this classification task, whether it falls into both category A and B, but not in ca, C and D. But we want a lot of variation of that. So I was kind of curious whether you'll be able to capture that, and if not, what are the tweaks that I need to make to take the most advantage of it.

Amina [00:39:02]: Thank you.

Jon Saad-Falcon [00:39:03]: Yeah. For your setting, I think it could be really useful to think about what's the synthetic approach for generating the data. I'm not sure which setting you're looking at, but one of the big values of LMS today is you can use it very cheaply, Scale a bunch of synthetic data, and then train classifiers on it. Usually very simple classifiers off that synthetic data. That's the first thing I'd recommend. Additionally, I'd also just recommend trying to take Weaver off the shelf, taking some reward models off the shelf, taking some elemjudges off the shelf and just seeing how useful they are for your task and just being very clear about quantifying what works and what isn't, just seeing how your macro F1 looks, seeing how your accuracy looks, seeing which classes are performing better or worse. Um, and yeah, just like setting up Your evaluation harness, like very precisely, because that'll give you the most signal in terms of like, if things are getting better, things are getting worse.

Anna Yoon [00:40:01]: Awesome. Thank you. One more question from my side. Actually I care a lot about latency because I deploy my features and they are live. So typically what is like the compute time ratio between the generation step and verification step?

Jon Saad-Falcon [00:40:19]: Yeah, so for the generation step it's usually the most intensive one just because like for our study we did like 10 to 100 generations. But you can use less, it just depends what you want to do. The latency tends to be at the scale of like seconds. So it'd be like a dozen or two seconds depending on the size of the model. Once you have those generations though, the verifier can like then go through it very quickly, particularly if it's like the distilled one. The distilled one could like verify over them in like a couple seconds. I understand that can be difficult for like user facing settings. So that's why I'd recommend like either less verifiers or like less generations or just the distilled verifier.

Arthur Coleman [00:41:00]: Anna, just so I understand, you're using the verification in line in real time. So there's a generation of an answer you verify and if it looks like it's hallucination, you go back. Exactly. How are you using verification in that case?

Anna Yoon [00:41:15]: Gotcha.

Arthur Coleman [00:41:16]: No, that's a question to you. How are you using is it in real time? Is that what I'm hearing you say?

Anna Yoon [00:41:21]: Yeah, that's what I'm considering now because we're building this chatbot and I mean even generation stuff, it's just a one round trip between us and then some of the AI like providers, API, like OpenAI ChatGPT and honestly like hallucinations and just really getting the response quality up there is a hard work. And so that's why I'm exploring a lot of these different methods that exist in the literature and seeing like how applicable they can actually be for this life setting.

Arthur Coleman [00:41:55]: So what do you do? I mean this is fascinating. You just taught me something. Thank you for that. But now I'm going to pose back to you how then if the answer comes back and your verifier says, nope, this is a bad answer, what do you do then?

Anna Yoon [00:42:09]: Then I guess you should refire the query but then have that information in the context.

Arthur Coleman [00:42:15]: Interesting, thank you.

Anna Yoon [00:42:17]: That is our current setup at least, but we were also thinking about maybe spin up a standalone verifier agent or something that takes a final pass before showing the final message to the user. But our biggest concern is that latency. By adding that additional agentic step in our entire process, like how long does it delay for the user to get their response?

Arthur Coleman [00:42:49]: Interesting.

Anna Yoon [00:42:53]: Let's check. Yeah, I guess with the remaining time. John, would you mind, like I have.

Arthur Coleman [00:42:59]: One question you didn't get to Anna, so if there's no one else I will ask it. This is on the cost side. We've been talking about verification accuracy, but I want to talk about cost because that's a big issue again for small companies. When you talk about how you're getting your cost down, I realize it's about CPU time, but if you're implement, tell me how you've implemented your verification mechanic to maximize those savings. You talk about doing it, but you don't tell me how you do it.

Jon Saad-Falcon [00:43:32]: Yeah, so at least for our paper we wanted to just like study flops because in general like it's just more empirical to measure like flops versus cost because they can vary by company. But it should like be like a direct translation towards like actual cost. The way that we wanted to, to limit cost both in terms of flops and money was to just try to drive down the actual utilization of like closed source LMS and just, yeah, closed source alarms and just more expensive forms of compute in general. So that's why we did like for example, like the distillation approach. We wanted to see like, you know, how much we can distill down the utilization of all of these different reward models and LM judges into a much more like simpler single, single model versus like an ensemble of models. It's also why we used open source LMS versus like closed source lms. So these studies would have been at the scale of like tens of thousands of dollars, maybe even like more if we'd been doing this all with like closed source LMS. But because we were doing it with like Llama 70B which could run on like consumer GPUs, it drove down our cost to only like a few thousand dollars.

Jon Saad-Falcon [00:44:37]: Which is kind of exciting to see that like is really useful for, for like practitioners because it just means you can run these, these open source models which are much cheaper, much more like controllable, much more customizable and use them either on a company's GPUs or on some of these near cloud and inference providers. I think the difference between running llama 70b and running GPT 5 is 5x. I need to see what the latest prices are. But yeah, so to Answer your question. This would drive down total inference costs in the order of magnitude of 2x to 5x. I haven't done the math, but that's the back of the envelope calculation.

Arthur Coleman [00:45:28]: Back to you, Ana.

Anna Yoon [00:45:32]: Yeah, I guess until the next question comes through. I mean Q and A is my favorite part of this type of session. But until the next question, do you think we can go through some of the highlights in the appendix that you mentioned before?

Jon Saad-Falcon [00:45:51]: Sure. There isn't too much that I think would be like exciting for practitioners but I can highlight some of the more interesting ones. So for this, for the appendix we tried to do a bunch of different studies in which we wanted to basically push the limits of what was possible with Weaver. And so first thing we wanted to do was just highlight the scalability of the approach. And so we have some nice graphs which we show just like how the approach scales across a variety of different generation budgets. What we find is the budget tends to increase performance rather substantially and continues to benefit performance as you go beyond like 1000 samples. Excitingly also if you were to push out the budget you can also see these like these scaling laws for performance. So being able to actually like predict how like spending more compute would continue to benefit performance.

Jon Saad-Falcon [00:46:50]: What we find is we're able to like model it with like a Bernoulli fit. And so it's interesting to see that it's actually like possible to model the efficacy of putting testnet compute in these settings. But there wasn't too much, too much else from the appendix to share.

Arthur Coleman [00:47:07]: Charlene actually Charlene, come on line and the this is the kind of thing we love to do here. Just like Anna was sharing her experience. Tell us what you're doing and, and what your findings have been. Again if you want to come off and go on video that'd be great.

Charlene [00:47:27]: Hey sure. I'm a ML engineer working in conversational AI and I guess yeah I think I'm just interested and I think we actually just using off the shelf LM at the moment but we are turning soon to I guess looking to maybe think about hosting our own. And I guess in doing that I'm just here to learn more about yeah how to make that reduce latency, increase accuracy, all this stuff and just try to learn from you guys doing a lot of the cutting edge research. So yeah, I think in the interim I guess the LLM is very expensive to host but I think for people doing rag I think it's in order to avoid all the expensive I guess engineering work to align the context from the retriever, the generator. I think this is a nice approach, especially for the synthetic generation.

Arthur Coleman [00:48:36]: Are you seeing savings, Charlene, in what you're trying to do? You said you're doing something similar asynchronously.

Charlene [00:48:46]: Sorry, that was the follow on about the online evaluators. So I think a lot of people. Yeah. In order to, I don't know, like, kind of capture the error state in the system, we use another LLM, smaller one to asynchronously evaluate the response. And then after we quantify the error, as John suggests before. Yeah, I think Anna mentioned that to do it before we send it, but it's too slow that way, so.

Bauke Brenninkmeijer [00:49:20]: Yeah.

Charlene [00:49:21]: Yeah, but it's expensive because you're still using the LLM.

Anna Yoon [00:49:27]: Yep. I guess it's always a trade off between quality and all the other guardrail metrics that you want to defend, like latency, your cost.

Arthur Coleman [00:49:47]: All right, well, if we have no other questions, John, do you have anything else you want to add? We, you know, I. We normally go right to the limit and we're pretty close, so. Oops, we're having people come in even now. Anything else you would want to share from your experience that you learned that's not in the paper that you would warn us about?

Jon Saad-Falcon [00:50:09]: Yeah, I'd say verification is often the bottleneck of a lot of LM applications. I feel like it's something that people forget to institute first. I'd highly recommend that for whatever AI systems you're building, you be very opinionated about what is the right evaluation framework and what's the right verification framework to make sure it has the right guardrails for what you're trying to do. And sometimes that's LM judges, and sometimes that's reward models. Other times it's like unit tests. Other times it's rubrics. So just seeing, like, what's the right fit for your AI application is like the most crucial thing. And doing that first rather than doing that last can really help you build applications that are actually, like, robust.

Jon Saad-Falcon [00:50:55]: I think people tend to get. Tend to obsess over the methodology for, like, for an AI application before they actually, like, set the guardrails at the evaluation. And so that can lead to a lot of annoying downstream effects. And so we just highly recommend. Yeah, just being very thoughtful about how you build these things now.

Arthur Coleman [00:51:17]: Well, if there are no other questions, then we'll let everyone go early, I guess. And John, want to thank you very much. This has been very insightful. Really appreciate you taking it in a. What do you call it pragmatic approach and focusing on, you know, business applications as compared to the bigger LLM creation applications. And the papers are in the chat. Let me remind everybody again, you're going to get a post, post event survey. Please fill it out so that we can, you know, improve our performance for you as a, as a group.

Arthur Coleman [00:51:59]: And with that, I'll let everybody go. Be well.

Jon Saad-Falcon [00:52:03]: Thank you so much. It's been a pleasure. Thank you. Thanks Anna and Arthur as well.

Bauke Brenninkmeijer [00:52:12]: And thanks everybody for joining. See you next time.

+ Read More

Watch More

Investing in the Next Generation of AI & ML

Posted Jan 27, 2023 | Views 784

# Investors

# Capital G

# Developer Tooling

# Foundational AI

Building the Next Generation of Reliable AI // Shreya Rajpal // AI in Production Keynote

Posted Feb 17, 2024 | Views 968

# AI

# MLOps Tools

Bridging the Gap Between AI and Business Data

Posted Jun 20, 2025 | Views 174

# AI and Business Data

# LLM

# Snow Leopard AI