Sign in or Join the community to continue

Making AI Reliable is the Greatest Challenge of the 2020s

Posted May 06, 2025 | Views 123

# AI

# Machine Learning

# RagMetrics

Share

speakers

Alon Bochman

CEO @ RagMetrics

Alon is a product leader with a fintech and adtech background, ex-Google, ex-Microsoft. Co-founded and sold a software company to Thomson Reuters for $30M, grew an AI consulting practice from 0 to over $ 1 Bn in 4 years. 20-year AI veteran, winner of three medals in model-building competitions. In a prior life, he was a top-performing hedge fund portfolio manager.

Alon lives near NYC with his wife and two daughters. He is an avid reader, runner, and tennis player, an amateur piano player, and a retired chess player.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Demetrios talks with Alon Bochman, CEO of RagMetrics, about testing in machine learning systems. Alon stresses the value of empirical evaluation over influencer advice, highlights the need for evolving benchmarks, and shares how to effectively involve subject matter experts without technical barriers. They also discuss using LLMs as judges and measuring their alignment with human evaluators.

+ Read More

TRANSCRIPT

Alon Bochman [00:00:00]: So it's Alon Bochman. I'm the CEO of RagMetrics. Our website is RagMetrics AI, and I don't drink coffee. Demetrios. I, I'm. Sugar is my. Is my vice.

Demetrios [00:00:15]: We talk deeply about how to bring the subject matter experts into this eval life cycle, if you can call it that. We also talked deeply about evals, as you could guess, since we were on the topic. He is doing some great stuff at RagMetrics on LLM as a judge. Let's get into it. And a huge shout-out to the RagMetrics team for sponsoring this episode. Dude, I just want to jump right into it because you said something when we were just talking before I hit record, and it was all around testing each piece of the entire system. And I am always fascinated by that because there are so many different pieces that you can be testing and you can be looking at. And I know that a lot of folks are trying to go about this.

Demetrios [00:01:15]: You had mentioned, Hey, do you want to try and use big chunks or smaller chunks? Do you want to try and use one agent or 10 agents? So there's all this complexity that comes into the field, and especially when we start having more and more agents involved. How are you looking at that? How do you see that, and how do you think about testing all of that?

Alon Bochman [00:01:41]: Yeah, absolutely. So, first of all, it's. Let me just talk about the approach a little bit because I think that a lot of engineers out there are, you know, we're all new to this thing, and so it's natural when you're new to something to just kind of take somebody's word for it. So we read what, you know, maybe what the big labs have to say about how to prompt our models, or we read what the. What the vector database vendors have to say about how to set up, you know, vectors. Or we read, you know, and then there's maybe like a YouTube influencer that has like an idea. So, you know, if they're popular, if they're charismatic, then we listen to what they have to say and that's all fine. Like you, you know, we should listen to everybody.

Alon Bochman [00:02:26]: But you probably didn't get into engineering just to, like, take other people's word for it. You probably got into it because you enjoyed building. You want to see it for yourself. And this is an amazing opportunity to do that. Like, amazing. Because let me tell you, Demetrius, nobody knows what's going to work for your task. Like, definitely not me, but definitely not, like, OpenAI anthropic. They don't know you know we v8, they don't know.

Alon Bochman [00:02:51]: Like nobody knows we're inventing this as we go. And that is awesome because if you get into the mindset of I'm going to follow what the data tells me, then you will come up with some amazing solutions. And that is how like every magical product that we hear about, that I know of, when, you know, I, I haven't spoken with all the teams that are building magical products. But like every, every time there's a magical product, there's a data driven process behind it where the team that's building it is not taking anybody's word for it. They're actually like trying out the different models, trying out different prompts, trying out like different configurations of an agent. And sure, like you don't, you can't try everything. You definitely, you know, you can't just shut the world out. You have to learn, you have to build on the shoulders of giants, but you have to build like.

Alon Bochman [00:03:44]: And it's, it's awesome, it's an awesome privilege to build like, it's, it would be super boring if we figured it all out. We probably all move on to some other place. So. So this is like, this is the good part.

Demetrios [00:03:56]: Yeah. There would be a lot less hype influencers if it was all figured out already.

Alon Bochman [00:04:00]: That's right. You know, I think that let's take like a, I don't know, take a rag pipe, a typical rag application that you're building. So if you are following sort of this like empirical philosophy, you'd probably set up something that is pretty common, right? Maybe you'd set up the easiest one. So you have, you take the most popular model, take the most popular vector database, maybe you have like one prompt, one task and then you think through sort of what do you want out of that step in the process? Like, what outcome are you looking for? Regardless of what technology you use? Think about like, what does it mean to generate value from this? What are your users? Like, what are your requirements? And then in my opinion, you set up an evaluation. All evaluation is, is, you know, you set up a couple of examples. If a user asks me this, this is what I want out, this is what I don't want out. And you come up with like, it doesn't have to be a massive amount, it could be like 30, 40 of them. Like it's totally possible to write out an email in a couple of hours.

Alon Bochman [00:05:07]: It's not like a major project, but it is a major sort of journey of self discovery because you're thinking through like what you want. You're thinking through what you want. It's a way of taking what you want from your head and putting it on a piece of paper. It's not unlike writing a requirements document, but it's like requirements for AI. And then you have this great, now you have a benchmark and you can actually benchmark. Different models can give you really different results. You'd be surprised. And OpenAI and Anthropic and Mistral, they have no clue which of their models is going to be the best one for your time.

Alon Bochman [00:05:44]: I assure you they don't know. Nobody will know until, until you run this benchmark on your task. And the same thing with like all the other pieces of the, of the pipeline.

Demetrios [00:05:55]: Yeah, the, the funny piece is you have that trial and error and you have to test it out and see in a way like how few examples can you give to get a bit of a signal and at least know directionally were going the right way. But then later to get that last 10%, it is just mayhem. And that's why I think it's fascinating tracking all of the changes across the whole life cycle. And it reminds me a little bit of Jupyter notebooks and how you are, you know, you have final dot, right, Whatever, Ipub and, and here it's in a way that exploration is happening just like you were doing this data exploration back in the day and so now you're trying to track it and you want to know each little thing that's changed and you want to see against the eval set, how are we doing here?

Alon Bochman [00:07:05]: Right. I think it's not unlike building code, right? You, you start with, you just start with a happy path probably. Maybe you're starting with a demo and it's like famously easy to make a demo with AI, right? Because it says some crazy stuff. You try enough things, it's going to say something amazing. And if you just show that, like you don't show any of the other stuff, you're going to look fantastic. It's not hard to do, right? So you make a demo, maybe you get a little bit of buy in, you get a little bit of support, maybe some resources and. And then you have the next step, which, you know, maybe, you know, some people call it hill climbing, but it's just like, you know, instead of just one happy path, maybe you come up with like 10 or 20 happy paths. So whatever it is that your chatbot or copilot or AI step is supposed to handle, just think of a couple different examples.

Alon Bochman [00:07:53]: So it's not just one path. And that's usually also fairly easy. It's not as easy as the demo, but it's really not hard. You're just thinking through a couple of different, you know, examples. And then like you said, like, it gets bigger. Hopefully, you know, maybe you have a second dim where you have a bit more buying. Eventually you let users in. And when you let users in, it's just.

Alon Bochman [00:08:15]: It's going to hit the fan. It's going to hit the fan in the most amazing, beautiful way because they will put stuff in there that you just would not imagine in a million years. Like, I, you know, you always know. Like, the guy who was like, when testing software became like a popular thing, there was this guy was like, I don't need no automated testing tools. You just bang his head, bang his fingers on the keyboard, all like seven keys at once, and it would break, like, a lot of software that way. I'm talking about, like, 20, 30 years ago. Like, that was a really good way to test software. You just bang the keyboard.

Alon Bochman [00:08:46]: Well, that's what users do. Like, they will really bang up your bot in a way that you just, like, never imagined. And it gives you the scope to just, like, make it a lot better. And it's like you said, you know, the challenge is that it's very easy for. For an engineer to focus on the. The edge case that you're fixing and like, break a whole bunch of other ones. Because you don't. You don't know that you broke them.

Alon Bochman [00:09:13]: You just changed the prompt. The prompt used to work for 20 cases. You're focused on case number 21. You're working really, really hard on it. It's not easy to get, like, shoehorn AI to do what you want on case 21, and you finally get case 21 working. Yay. Oh, my God. Cases one through five just broke.

Alon Bochman [00:09:30]: You have no idea until the next user comes in, tries that. So, like, to me, the solution for that is to build up that. Build up that eval. So every time that you run an eval, don't just run it on case number 21, run it on all the cases. And then if you broke something, you will at least know right away. That's kind of like the first step to fixing the problem is knowing the problem.

Demetrios [00:09:56]: How many evals are enough.

Alon Bochman [00:10:02]: That is a bit like how long is a piece of string? It depends on, you know, it depends on what you're building. I mean, if you're building a, you know, I don't know, an entertainment Application poetry generator. Probably fewer than if you're building, like, I don't know, a government service. Depends on the stakes. Depends on the breadth of like, functionality in general. I would say, like, the, maybe the short answer is it's a dialogue with your users, right? So you want as few evals as necessary to keep your users happy. And the longer the thing works, like, the more feedback you get from users. So it tends to expand.

Alon Bochman [00:10:47]: It's like scope creep, right? It tends to. Users ask for more stuff, so it expands the functionality. And as you get more functions, you, you know, you add your evals. But another way to think about it, in my opinion, Sorry, long answer for a short question. But like, okay, let's, let's say that you're, let's say that you're comfortable with automating testing for regular software. Let's, let's park AI for the moment. And somebody would ask you, like, how many tests are, how many tests do you need for to, to like in your test harness? You'd probably say, well, I have some intuition because I have some idea about how complicated the code is, how many different code paths it needs to go through. But I'm also not so, like, literal and, and I'm not such a novice that I have to get 100% coverage.

Alon Bochman [00:11:36]: Like, it's not about checking the box. There are some paths that are more important than others. And so as a good software engineer, I'm going to make the trade off. I'm going to try to figure out the. I have an idea of what's most likely to break, what's the critical path that is like right on the edge of working. It's not so easy sometimes if I mess up a little bit, it's going to break. And I have like, they're kind of like veins going through the body. Like, I, I can trace them because I, I built the thing and if I can only write like 3, 4, 5, 10 tests, I know which are the critical cases that I need to cover.

Alon Bochman [00:12:12]: I think it's the same kind of intuition when you're building evals for an.

Demetrios [00:12:16]: AI system until you give it to the users and they teach you until that's right.

Alon Bochman [00:12:21]: And they realize, oh, no, like, what if I ask it to play Chopsticks?

Demetrios [00:12:28]: Well, I guess another way of phrasing the question is, is, can you ever have too many evals?

Alon Bochman [00:12:38]: I think you can. So first of all, I think that evals can be redundant in the same way that automated tests of non AI systems can be redundant. You can have this false sense of security if you've got, you know, a thousand tests, but they're all like, really testing the same, like, ten paths through your code. First of all, you're spending a lot of extra time that you don't need to, to get through your test harness. And just because you have that high count should not make you extra confident because you're not actually testing breadth. So in the same way, you can have redundant evals that are not actually going deep enough and testing the edge cases that your users care about. So that's one way of too many. Another way of too many is you think of.

Alon Bochman [00:13:23]: Evals are kind of multiple, more interesting than regular tests, in my opinion, because they, you know, in the same way that you, you use RAG, because you can't jam everything into the prompt, right? If we could jam everything into the prompt, we wouldn't need rag, but it's expensive and needle and haystack and all that stuff. So we use RAG to kind of expand the memory or the mind of the model. In the same way, I think evals are expanding the knowledge base or the area of responsibility of the model. And just like with any kind of learning process, you want to have a healthy intake that takes in like, new cases. And you also want to be able to forget things that are like, you need a cleanup. So, like, maybe some evos are no longer relevant. The facts have changed, the law has changed, your users have changed, the applications change. Right.

Alon Bochman [00:14:14]: It's a new year. Crazy things are happening. Some things that we believed are no longer true. We need to update our view of the world. So to the extent that evals are a shorthand for your view of the world, you need to have like, just a, like a healthy digestion. Digestion system. Stuff comes in, stuff comes out.

Demetrios [00:14:34]: Is that just every quarter or year or half year, you are going through the evals and you're checking them.

Alon Bochman [00:14:43]: So you definitely need a cleanup stage and the frequency of that stage, you know, it depends on how fast moving your industry is. Right. So if you're doing constitutional law, right, it's probably going to be really slow moving. If you're doing, I don't know, like news, then, you know, yeah, you're going to, you're going to need to clean it up a little faster. You don't want to evaluate current events based on stuff that happened like 10 years ago unnecessarily. So, yeah, it's basically like a, it's a, like an update speed. And it depends on how volatile and fresh your knowledge needs. To be, in my opinion.

Demetrios [00:15:25]: I like that visual too of the trash collector. The garbage collector. You come through. And you also need to remember to have the trash collector for your evals. Otherwise you could potentially be spending money where you don't need to.

Alon Bochman [00:15:39]: Here's another way to think about it. Think about if you're, you know, if you're planning out a product, you're writing out, you know, let's say you want to communicate with a team that's going to build this product for you. You want to write a requirements document. Now you could think of. There's many ways to go wrong when you're writing a requirements document. You could go too detailed. Yeah. Like if you write a requirements document that is super, super detailed.

Alon Bochman [00:16:03]: First of all, it's kind of obnoxious for the engineers that are working on it because you're specifying things that are really like probably they know better than you. They could probably, if you empower them, they could probably make better decisions than if you just like spell out like all the. How the coach should be organized and stuff like that. You can also go like not the. But most of the problem. That's usually not the problem. Usually the problem is it's not detailed enough. And then there's a lot of misunderstanding and engineer thought that you meant something, you meant something else, used shorthand, they misinterpreted it and they build something that doesn't work the way the customer wants.

Alon Bochman [00:16:38]: Like that kind of error. The non detailed enough error in my career has been like a lot more frequent than the too detailed error. And I think that when you're communicating with AI through evals, the overwhelming risk is not enough. Like I think it is possible to have too many. But you know, if we look at 80, 20, you know, more, more than that. Like 99, 1 are not enough. 99 of the teams that I know don't have enough evals.

Demetrios [00:17:07]: Let's talk about handcrafted evals versus the evals that you just ask an LLM to generate for you. How do you view those and the differences there?

Alon Bochman [00:17:19]: So first of all, like it's a matter of cost, right? So it, so let's, let's put that in the context of what stage you're in in the project. So maybe, and we talked about some of those stages earlier, right in the beginning you just have the one happy path. You've got a demo and then you've got maybe like a little plant. It's, I don't know, it's like a couple of weeks old. It's got a couple of leaves, and then you got this whole, like, tree, and it gets bigger and bigger. Right. So in the beginning, when it. You're just working on the happy path, you probably don't need a ton of evals, and it's probably fine.

Alon Bochman [00:17:53]: You know, when you're starting your evals, it's probably fine to have them synthetically generated. You want. You just want to move quickly. Like, you don't want to wait for other people. And it's not so. So critical that each one of them be letter perfect. You're just trying to grow the surface area quickly so that you get feedback faster, so that you can adjust faster. It's like a learning exercise.

Alon Bochman [00:18:15]: So the faster the better. Over time, as you get exposed to more users, synthetic, in my opinion, loses some value because you have. It becomes more important to cover the edge cases and there's more of them, and you have access to user data. When you have access to user data, it's usually better than synthetic. Now, you could still get some help, in my opinion, like just processing the user data into something useful. It's just a slightly different. I don't know what the right label for is. You're not asking a model to generate from scratch, which is what I think of as synthetic evals, but you could be asking the model to say, okay, help me scan through this, like, massive ocean of stuff that the model has said to people.

Alon Bochman [00:19:01]: And I want to. I want to boil it down into the ingredients. Like, you know, take the soup, and I want to get the ingredients out of the soup.

Demetrios [00:19:10]: Yeah.

Alon Bochman [00:19:10]: So I just want, like, give me the carrot, give me a celery, give me the brussels sprout. And I'm meaning that I want, like, a few examples that are different from each other. They're all orthogonal, and together they collectively will give me the same flavor as the soup. And then you end up with. So AI can actually help you with that. Can help you boil down this, like, massive log file into themes, groups, and. And then like, kind of reduce them so that you have, like, a few examples to work with. But it's different than generating it synthetically.

Alon Bochman [00:19:43]: Yeah.

Demetrios [00:19:43]: And. And probably also I've never tried this, but it makes me think about it as you're saying it, recognizing where we.

Alon Bochman [00:19:52]: Have.

Demetrios [00:19:54]: Less evals, where we're a little skimp on the evals, and can you just help us augment that from. Because we have a few users that have gone down this path, but we don't have a very robust eval set in that regard. So let's augment it with a few examples from the LLM.

Alon Bochman [00:20:14]: Yeah, absolutely. So AI can be like a sparring partner when it comes to growing your evals. Exactly. Like you said, maybe, maybe you noticed a couple of failures in a particular area that your eval is supposed to do. But there's only been a few user inputs. But you're anxious about it because you know how the thing is built and you feel like there's going to be more and I'm just not ready. And so AI could be a really nice, safe sparring partner that could just elaborate on those examples, come up with harder ones and it set up, it's, it could set up this sort of dialectic. Right answer, you know, help me ask some questions that are going to be hard for me to answer.

Demetrios [00:20:57]: Well, it brings up this topic that we were also talking about before we hit record, which was how can you loop in the subject matter experts, especially the non technical stakeholders, to craft better evals to create a much more robust system.

Alon Bochman [00:21:16]: Totally. So let me just maybe set up. This is my understanding of the status quo. This is how I think things work today. Most teams don't test, in my experience, they just don't test. So if they even, just even tune into this conversation, listen to like two minutes of it. Thank you. Just thank you for that.

Alon Bochman [00:21:36]: I appreciate it. Hopefully they get a ton of value from just, just set up the first eval with 30 examples. Amazing. That would be a win. But like the teams that do test, they get into these ruts where on the one hand, you know, maybe they generate the data synthetically. Usually the people that know the synthetic tools are the engineers. They don't have the domain expertise. And so if they show the synthetic data to the domain experts, the domain expert looks at that and it looks like a fifth grader did that.

Alon Bochman [00:22:06]: Like they don't, to them it looks like you don't even know what you're talking. I can't understand what this is. Like you're asking me questions about the footer and like what? And so it makes engagement hard for the domain experts. On the other hand, you know, if an engineering team asks, the domain experts say hey, can you help us with an evaluation's like what the heck's an eval? Like that is not the word that they use. So that hurts engagement. If you are able to have an hour with them and you show them what an eval is, show them the spreadsheet, then it's a blank page problem because like nobody wants to start With a blank page, the columns are a little unclear. It's very difficult to get domain experts to start like writing these evals. So the way that we think about it is it's more about creating a learning loop, a feedback loop.

Alon Bochman [00:23:00]: And the feedback loop should be that you have an LLM judge that evaluates what the application does. That's the typical ELM judge. But then you also have a human being that potentially evaluates what the application does. And that human being can either be the user or it could be the domain expert. It's up to them, like how much they engage. But it's really, really important to have the LLM judge scores alongside the human scores. So you have, let's say the user asked X, the model said Y. And then the user said, you know, Y is not correct, I actually want Y prime.

Alon Bochman [00:23:42]: And the domain expert said Y is not correct. I actually want Y double prime. So it's really important to have Y prime and Y double prime like on the same screen. So you can compare what the model said, what the user wanted, what the domain expert wanted, and they can review each other. That's what makes, in my opinion, the feedback loop really powerful. What does it mean to review each other? So the domain expert sees what the LLM judge said and gives feedback to the LLM judge to make the LLM judge better. So you know, hey, LLM judge, you keep focusing on the footer, but really you should be focusing on what's between the header and the footer because that's like the part of the document that we lawyers care about. Or you know, hey, LLM judge, you keep picking on my spelling, but just ignore my spelling and focus on the math, whatever is the right thing for your application.

Alon Bochman [00:24:36]: And so the feedback from the domain expert will to the LLM judge can make the LLM judge a better stand in for the domain expert the next time around. And that kind of stand in quality, there's a statistical definition for that. So you basically can, you can measure the correlation between LLM judge decisions and domain expert decisions. And the higher that correlation goes, the happier everybody gets. So domain expert trusts the LLM judge more. The Ellen judge is a lot more scalable and good things happen at the same time. Domain experts need the LM judge because sometimes the LM judge can notice things that just as human beings, we just don't have like the bam because like maybe there is an error in the footer. Like do you actually look at the footer? Like maybe it's got the wrong year in there.

Alon Bochman [00:25:24]: I Don't know, like you have all these websites that say copyright like 2022. It looks terrible because nobody looks at the footer. So my point is like sometimes the LM judge can notice things that as human beings, like oh my God, there's just like, I just don't have, I don't have the, there's not enough years until the heat death of the universe when we look at that. So, so the LM judge actually makes the domain expert better at their job too because they're noticing that and they say like, you know what? Yeah, you should keep that. So that feedback loop makes the LM judge better and makes the domain expert better. And the fact that they review each other improves the chance that your domain expert will actually engage because they're getting feedback now. Instead of a blank page, they give a thumbs up, thumbs down, they explain why. And the net effect in my opinion is that this is really like, it's a way of sucking in domain expertise into your application and there's not a lot of other ways for it to get there.

Alon Bochman [00:26:26]: If you think about it, all of these AI labs are solving for general intelligence. They want to make the best model. So they use general benchmarks like MMLU and theoretical math and physics. And when we're building stuff for customers who could care less about how well it does in theoretical, it just doesn't matter. So you need to know, I don't know the last app that you built. You know, you need to know how it's, you know, whether it's extracting the right bits from an earnings release or whether it is being rude to a customer or not, or whether it's using the up to date HR manual or not. That's really what you care about. And there's no benchmark for that.

Alon Bochman [00:27:10]: You need to kind of, you need to kind of make it. That's why it's your software, that's why you have to charge the money. That's why people come to you. So when you, when you're building one of those, the value that you build over time is going to be proportional to how much domain expertise you can bring into your, into your bot. And the domain expertise is going to come from your domain experts, your application, your knowledge, your inside, inside out view. And it's going to be different from mine. Even if we're in the same niche, like we might both be writing financial parsers, but you've got a different view on, you've got a contrarian finance view and that, you know, that's why you have alpha. So when I'm building mine, like, I can't just copy yours.

Alon Bochman [00:27:57]: First of all, I don't know yours. It's private to you. Secondly, like, my customers are looking for something different. Like, there's a reason why your company exists and why my company exists. And we have like, different business philosophies. That's why we're different companies. And so it's a beautiful thing.

Demetrios [00:28:15]: So this spurs a whole load of questions I want to start with. Yeah, are you just. When the expert is having that exchange with the LLM judge, are you updating the prompt every time? How are you making sure that the judge then is able to solidify that and do it every time?

Alon Bochman [00:28:37]: Yeah, there's a couple different update mechanisms. Let's let me go through the easiest one. The easiest one is just with fine tuning. It's really, it's like dead simple. So every time that a domain expert reviews what an LLM judge said and says, actually, you're wrong about this part, it should be that part, you can construct a better LLM judge response based on taking the domain expert view and the original and combining them into like an improved, like what, What I wish the lmjudge response would have been, now that I know what the domain expert thought about it. And so that becomes a question. An input output pair. And if you've collected, let's say even like 50 to 100 of those input output pairs, you can fine tune the model that you use for the lmjudge and it gets better.

Alon Bochman [00:29:29]: Like, it gets better and better and better. That's the simplest update mechanism. You can, of course, also update the prompt of the lmjudge. I think that it's a little bit quicker maybe than fine tuning, but it can get bristly. It could get brittle over time because the bigger the prompt gets, the slower the judge gets. And eventually you end up with this sea of edge cases and everybody gets scared to update that prompt. So, you know, I think it's really good for the early stages when you just have a few use cases, but over time you probably want to switch to fine tuning.

Demetrios [00:30:06]: Fascinating. Now, the other question that came up right away was when you have different subject matter experts, how do you normalize all of the feedback?

Alon Bochman [00:30:19]: Oh, that's amazing. That's. I'm so glad about. I had this problem. So I was, I was head of AI for this company called FactSet. And let me just give you this example. So FactSet sells financial data like Bloomberg. You've heard of Bloomberg?

Demetrios [00:30:33]: Oh, yeah.

Alon Bochman [00:30:33]: Imagine you're a financial analyst. You're building an Excel spreadsheet. And it's a financial model of a company. So you're, you want, you don't want to hard code the sales and the income, you just want to pull in like the last five quarters.

Demetrios [00:30:45]: Yep.

Alon Bochman [00:30:46]: And this way you could just hit refresh and it pulls the. So, so there's like an Excel formula you'll put into some cell to pull in the revenue and a different one to pull in the eps and a different one to put in the expenses. And after building this excel language for 40 years, you end up with half a million formulas. Half a million formulas. And like absolutely nobody knows them. Nobody can even remember them. Nobody can remember what they're called. So, okay, so then users start calling you and say, like, what's the formula for eps? The manual is giant and out of date.

Alon Bochman [00:31:19]: They can't. And like, and more and more users call. So like two thirds of the calls are from users just asking, like, what's the formula for X? It's a super annoying, you know, thing for you to answer because it's a, it's really, it's just like a lookup answer. But there's just a huge volume of users asking because, you know, nobody. The manuals can't keep up with half a million formulas. So then, and it's not so easy to answer because there are nuances. Like, if I ask you, what's the formula for cash, I might mean cash in the bank, or I might mean cash and marketable securities, like short term debt. And it's arguable which one.

Alon Bochman [00:32:02]: I mean, cash is kind of like a gray area word. Some analysts interpret it one way, some interpret the other way. So when we wanted to build a copilot for this, we were like really, really excited about, about it. And the domain experts told us, you know nothing about finance. And so how are you going to build a copilot? We're like, we got this. So we built it and we very quickly got up to like 70% accuracy. And we were feeling like rock stars, Demetrius, rock stars. And then.

Alon Bochman [00:32:34]: But the domain experts basically said, look, we're not going to let your bot talk to actual people because it could mess up. Like we cannot have. So we're going to have a human in the loop and the human is going to look at what your bot says and they're either going to accept it or reject it. And you'll know. And they kept rejecting the cash question. Like it kept, you know, we kept getting it wrong. And then when we looked at the data, we realized that it was like. It was like an edit war in Wikipedia where you have, like, two people that are not talking to each other, but they're sure that they're right about what cache means, and they keep overriding each other's edit.

Alon Bochman [00:33:09]: So no matter how we update the prompt, it's never right, and it just keeps, like, the bit keeps flipping. Oh, cash means this. Oh, no, cash means that. Oh, you know, my mother, my sister, my mother's sister. So it's like, you know, it's like, you cannot. This is why, like, we cannot get above 70% accuracy. And I realized that 70% is the threshold at which people agree with each other, and 30% is actually the area that's actually the gray area in the system. And this is amazing opportunity because these people, the agents that are actually answering these questions for people, they didn't know that there was a 30% error.

Alon Bochman [00:33:47]: They thought that it was just all black and white. It's like an Excel formula. There's a right one and there's a wrong one. But it turns out that we have a unique ability to organize this knowledge base better than it was before the copilot started. So to come back to, like, machine learning terms and like the nitty gritty, we did a cluster analysis where we embedded the questions and we identified clusters of similar questions, semantically similar, that had opposite answers. And that's how we were able to find this. Like, all of these people are asking, you know, where's cash? What's the formula for cash? Tell me, Cash, I want cash or not. They use not.

Alon Bochman [00:34:29]: Or sometimes they don't use the word cache, but they use a close synonym for cache. And all of those questions get grouped together purely through semantic clustering through the embedding model. And then we basically say, okay, the clusters that have consistent answers, I don't care about those. Those are good. Solved. Check. I want the clusters where it's a clusterfuck, right? I want, like, I want the clusters where it's like, all different answers. They're all over the.

Alon Bochman [00:34:54]: The questions seem identical and the answers seem all over the place, right? How could that be? And then, you know, so we, we. But the beauty part is we could identify those automatically. We didn't know the right answer automatically, but we could identify disagreements between domain experts automatically. And this was hugely value added because they didn't have a process for finding those before. And then it's just a matter of, like, basically getting the two sides engaged. Like, how do you solve a Wikipedia edit war. You get the two people in a room and they start to talk to each other and like, hopefully one convinces the other. And then, you know, then the edit war settled.

Alon Bochman [00:35:32]: So we kind of did that. There's like a little bit of plumbing, there's a little bit of organizational design, there's a little bit of, you know, ML programming. And the knowledge base got much, much better. And we were able to push that gray area from 30 to 25 to 20 to 15 to 5. And that was not just great for the, for the humans that were dealing with all these user questions because now, you know, now they can have consistent answers. But it was great for the copilot because now the copilot clear up that knowledge base as well.

Demetrios [00:36:03]: But there are going to be scenarios where you do not have a black and white answer.

Alon Bochman [00:36:08]: That's right.

Demetrios [00:36:09]: And so are those. I've heard folks say steer clear of those with your AI.

Alon Bochman [00:36:15]: It really depends. Look, I think AI is not going to be any more black and white than people are, right? So like, we can't, we can't expect AI to give us black and white answers to questions that are just like part of the human condition. And like, there's no right answer. Like, we can't. Just because there's a new technology doesn't mean that suddenly there's a right answer where there wasn't before. But we can generate a boatload of value if we go to the areas of human knowledge that are where like, most people agree and there's value to giving the answer that most people agree on. And that value, like, if it's very expensive to get to that answer today because you need very experienced, you know, expensive people to do it. And you're able to get the 80, 20 with, you know, like 80% of the accuracy value, area of agreement with like 20% of the cost, 10% of the cost.

Alon Bochman [00:37:12]: That's a huge value. We don't have to all agree on what poetry we like. We could just like focus on, you know, whatever, extracting data from financials, you know, so like all the use cases that we know of, you know, for AI copilots. And I think like, it's, you would think, like, how does a company compete in this kind of area? Like, how does an engineer compete, but also how does an organization compete? So there's going to be, there's usually some secret sauce to every business and. Right. You know, in the beginning that secret sauce is in the head of the founder, in the head of the partners, the Senior, let's say 5 to 3% of the people. And if there's a way to scale that, basically to apply that, not to publish that secret sauce, but to apply it at scale so that the BOT applies the same level of nuance and the same level of special sauce that the senior partner does, that's a huge value unlock for the organization. Right? Like in my example, you know, there's probably like there were a couple of people that had a really good nuanced understanding of cash.

Alon Bochman [00:38:21]: It's just that they. That that value that was in their heads was not communicated to all the callers. So you basically hit the lottery. Like if you get somebody who thought it was one thing or the other, you wouldn't even know. Like, you wouldn't know that you're getting part of the value. But as a company, we all wanted to figure out what is the best answer and give it to everybody. Even if the best answer includes some options. Right.

Alon Bochman [00:38:45]: And this kind of process is a way not just to make the AI more accurate, but to help domain experts, like, kind of settle that value pyramid.

Demetrios [00:38:55]: Man. When you talk about the nuances in someone's head and then being able to translate that to reality, for lack of a better term, the thing that instantly came to my mind was, wow, I could do that with folks that I get documents from. And sometimes one thing that I do not like doing is getting on calls with folks. If it could have been an email or if it could have been a document, I would prefer for it to be a document first and then I'm happy to get on a call afterwards once we have established what it is that we are talking about, right? And what the problems are, et cetera, et cetera. What I find though is sometimes you can be working with someone and the document is amazing. And it's like, wow, this is incredible. There's so much here. It's very clear.

Demetrios [00:39:54]: It's really well put together, it flows and you can just comment on it and whatever you can't figure out in the comments, that's when you get on the call. Other times I'm like, holy crap, this document is a mess. And it's not a mess because I make a lot of documents that are just brain dumps and those are a mess. But it's a mess because it is so verbose or it is so AI generated that it hurts my eyes to even read it. And then right there, it's not clear, what are we, what is this document for? What are we even doing here? Where's the meat and potatoes of this whole document. And so I was thinking about, I wonder how I can start getting the ways that I enjoy, like, documents that I really appreciate, using those as my eval set, and then saying, all right, anytime anyone brings me a document, just ask, run it through an LLM and say, here, I want to make sure, how can I make this better so that it aligns with these documents or so that it's more in Demetrius style, a hundred percent.

Alon Bochman [00:41:03]: You know, I mean, Demetrius, you, hats off, you built an amazing community. There is a special sauce to building a community. Like, I don't know how to do it. So right now, in order to get that special sauce out there, you've got to put your own personal touch on it. And I can tell your style because I read some of your LinkedIn posts and I've been to some of your conferences. Like, it's distinctive, but you and your community would benefit a lot if in addition to when you're able to be there in person, you could figure out a way to scale your style to like, all the touch points that are just like, you only have 24 hours a day, you gotta sleep like I do. So if there's a way to scale your style, your community would grow that much faster, get that much more value. It doesn't.

Alon Bochman [00:41:55]: It's not about replacing you. It's about, like just taking the thing that's unique, the thing that made you guys grow, and making it accessible to people. Even if, like, whatever, you know, even if they can't be at the show.

Demetrios [00:42:08]: Yeah. There's something that I wanted to mention before we move on to around the whole subject matter expert piece, because you had said this before, and it's worth repeating how traditionally when engineers are building SaaS products, it's okay if they are not a subject matter expert and they don't bring in the subject matter expert because at the end of the day, the users will suffer through it. And if it kind of helps and it kind of gets the job done, that's great. It's useful. However, now, if you're trying to create a copilot and you're doing it without the subject matter expert, you're probably not going to get very far because it's not going to be that valuable. So the crucial piece here is again, like hammering on this, how you can bring in that subject matter expert and make it easy for them so they don't have the blank page problem where they don't have to learn how to code and they're not firing off Python scripts or. Or having to really update prompts to make it better. So.

Demetrios [00:43:16]: So that is something that I really wanted to hit on because I. It resonates with me a ton.

Alon Bochman [00:43:22]: Yeah, thanks. Thanks, Dimitri. I think it's. Yeah. Like, if you imagine, you know, you're building, I don't know, building SAP, or you're building salesforce.com, maybe it's an inventory management system or ERP or whatever it is, it's basically a couple forms. And all you really have to do as the engineer, you have to make sure that the data goes where it needs to go. Like, it's pretty black and white. You just have to not lose the data and the form has to render, you know, there's.

Alon Bochman [00:43:50]: It's pretty clear. Like when you've messed up, it's pretty clear. And you will, on purpose avoid any kind of situations where, I don't know, you need to render, you need domain expertise. You'll avoid those because you're going to try to stay to the, you know, the objective piece. And in the AI world, it's kind of flipped on its head because the value is in delivering domain expertise more broadly, cheaply, scalably. So if that's the goal, like, if you're trying to. And whether that domain expertise is finance, legal, architecture, whatever it is, like AI is reaching into all these places, that's really the new part. The new part is not being able to draw a GUI from scratch.

Alon Bochman [00:44:34]: Like, yeah, it's kind of exciting when you talk to AI, could draw a GUI from scratch, but we know what it's like to draw GUIs. Like, we've seen that before. The exciting part, the economic value, the new part, the unlock, is, you know, you deliver legal expertise that used to cost 500 bucks an hour, you deliver it for five bucks an hour, or architectural expertise or medical or whatever it is, like, stuff that people really, really need and used to only be accessible to the few, and now it's way cheaper, accessible to the many. That's the unlock. In order to get that unlock, you need a different skill set. That's our challenge. We're so new, and it's the same challenge with every new technology we got here. Because we're into the AI, we're both you and me, we're into the gears.

Alon Bochman [00:45:21]: I love to see those gears going. I just want to jam my hand in there and fix it. I love playing with it. I love getting my hands dirty. But the value from those gears is domain expertise that we don't really have. Like, so in a sense, the most successful features, the most successful products are going to be the ones where we can make the gears invisible. Everything that we're doing will be invisible. And the user will just have the experience of talking to an experienced lawyer or, or, you know, or an experienced engineer or an experienced podcaster.

Alon Bochman [00:45:58]: Right, right. Or community builder. Like, that's the magic. And in order to get that magic, we have to get these domain experts, like, right in there with us. And my suggestion, my humble suggestion is that evals are the most friendly medium to get domain experts into your workflow. They're much more friendly than asking them to write stuff from scratch to edit to edit prompts. They, because they're, they're very flexible, they can grow to your needs.

Demetrios [00:46:33]: It's so funny you say that, because the last podcast that I did an hour ago, before we were on here, the guy was talking about how in his product he put, he went and he had this idea of some fancy AI that he was going to add to his product. And after talking to all of his current user base, he realized, wow, people just want, like to save some money on their law fees or their legal fees, so maybe we should try and figure out if we can use AI for that. And they ended up doing it and implementing it, and it's been a huge success. And so you. I really like that, like taking something that you're used to paying a whole lot of money for and then can you. Maybe it's not a hundred percent, but it is at least trying to figure out a way to get expertise for much cheaper and figuring out how you can fit that into your product with the help of the subject matter experts. So now I want to jump on the topic of the LLM as a judge because we kind of danced around it with the subject matter experts and how they can help the judge. What I am fascinated by is how many different ways you can do this LLM as a judge.

Demetrios [00:47:54]: And I've seen some papers that come out and say, no, just one LLM isn't enough. You need all 12 of them. So it's like a jury or you need to be doing it where I, I thought it was hilarious. It's like some researchers are hanging around in the dorm room and having their pizza and beers and saying, well, what if we did not one LLM judge call, it was two LLM judge calls. And then somebody else is like, what if we did three of them? And so you just keep more LLM judge calls on top of it. And you see, is there A place where that will kind of top out. So I know that you've been doing a lot and give me the download on it.

Alon Bochman [00:48:39]: Yeah, that's an ICLR paper for sure. Right? Just n plus one LLM judges, like, you know, get me the seat. Okay, so before we talk about all the different ways of doing it, the how, I propose a metric for deciding. And the metric I propose is the human agreement rate. Because, like, okay, we're talking about. We're swapping tips. Like, we. You've read a paper and we have, like.

Alon Bochman [00:49:10]: And there's lots of ways to do it, and we're swapping techniques. And anytime that you're debating, like, is technique A better or technique B better, like, should I have two or three or five? What's. When is it going to. Any questions that you have about which one is better as an LN judge, like, there has to be a way to answer those questions that's got to be like, more than who? Whichever one of us is, like, has the bigger audience or, like, yelling, yelling more, or, you know, is higher paid or whatever. There should be a metric. So the metric I propose is whichever LLM judge approach reaches the highest human agreement rate. And that's just like, not a human off the street, but like, the humans that matter for you, maybe it's your users, maybe it's your domain experts, we can define them in advance. And we're basically trying to optimize for that.

Alon Bochman [00:49:56]: And, you know, we're ML people, we optimize, we understand what optimize means. So, okay, so we start with the same approach, right? We start with like a simple LM judge, probably pick a cheap model, pick a simple prompt, and we measure the baseline human agreement rate. So in we have our eval set, we evaluate it with people, we evaluate it with the LM, we measure the agreement, and it's whatever, it's 70%. And then we could try things. We could try, like, the jury could be good because different models have different strengths. So, like, for use cases where, let's say some of the questions are heavy math and some of the questions are heavy creative writing, well, maybe Claude is better at creative writing and maybe, I don't know, Deepseek is better at math. And so if we jury them together, maybe we can have higher human agreement rate, or maybe they'll just, like, fight with each other and we'll have less consistency and the human agreement rate will be less. My point is the empirical approach over the theoretical.

Alon Bochman [00:50:57]: And, like, try it with your data and with your eval. Instead of like taking my word for it, your word for it or the paper? Like the guy who wrote the paper does not know what's going to give the highest human agreement rate for your task. And if you can unlock 80, 90, 95% human agreement rate with your domain experts, you'd be amazed how happy they would get because it's a lot of work off their plate. They don't have to look at any of the, like, whenever the LLM judge expresses confidence, which is going to be 80, 90% of the cases, if they have high human agreement, then the domain expert is going to be like, I don't want to look at any of those, just give me the exceptions. So immediately like their work drops down a couple of orders of magnitude.

Demetrios [00:51:41]: That's awesome. Then, so then would you say you have to always have those human curated eval sets before, like you can't just raw dog with some LLM as a judge, right? From the.

Alon Bochman [00:51:59]: Yeah, you bring up a good point. There's, there's a bit of a chicken and egg, right? Because like, like we said earlier, if you just, if you don't have anything and you, you start by coming to your human, human domain experts and you say like, here's a blank spreadsheet and show me all the things that teach me law. Basically, that's what you're saying and they're going to be like, get the hell out of here. Like I, you know, I went to, I, I spent and I, I have clients and like, I don't have, I.

Demetrios [00:52:29]: Can'T teach, I'm too important for this.

Alon Bochman [00:52:31]: Right, right. But so I think like in the beginning, yes, you should probably, like, you come up with the eval yourself and you are the human that the LLM judge should be agreeing with. Because if the LLM judge does not agree with you on the first, let's say 20, 30 examples in your, in your LLM set, that's really, really easy to fix. It's a very tight loop. It's like you, yourself, me, myself and I, and I, I test a couple things. If the LM is doing things that I totally don't expect, I can just assert like, okay, maybe I don't need to be the world's greatest domain expert to deal with these like really simple few cases. I'm going to start there. And then once you have it, once the LLM judge behaves in a way that you trust on a small group of cases, which should not take long, we're talking about like days of work maybe, maybe not Hours, maybe days of work, less than a week.

Alon Bochman [00:53:25]: So if you have like some agreement on a very narrow based on knowledge, you could say, hey, domain expert, I basically got it as far as I can get it. Like, I, I'm not a law expert, but it doesn't say totally crazy things that, that I don't understand. So I trust it. And I could show you some things that it says. If you trust it, we could just let it go. If you want to get in the loop, you can. And I think they will want to. Like, if it, if you've cleared the bar that you can clear yourself, they're much more likely to engage at the next level, in my opinion.

Demetrios [00:54:03]: I like how you bring up this idea of we need a baseline of what we're comparing it to. So if we're using an LLM as a judge, what are we comparing it to? Let's start with something from me and then we'll bring in the subject matter expert when it's needed. But we'll get the basic stuff out of the way so that I'm not wasting some famous or some highly.

Alon Bochman [00:54:28]: Yeah, and there's a couple more hacks there. Like you're trying to sort of accelerate learning. Right. So imagine, imagine that you're really like trying to teach somebody law without taking them through law school. So, okay, you could start with just common sense. There's also probably, you know, there's probably some data sets out there that are available where you're just teaching them like basics. And then there might be a couple of examples where, you know, maybe your law firm has a different view on things. Like maybe your accounting firm has an idea that it's okay to take a chance in this particular and be aggressive in this like one niche because they have a variant view.

Demetrios [00:55:04]: Yeah.

Alon Bochman [00:55:04]: And because you know that you work in the firm, you could just like build it in as a little, as a, as an example that knowledge is plastic and it can fit to what the firm thinks and not just what the standard benchmark thinks. And you could just build one or two of those to show your domain expert that of what can be accomplished and they will put in more.

Demetrios [00:55:26]: When you're grading the LLM judge responses, I imagine there's different vectors that you can grade it on. How do you look at that?

Alon Bochman [00:55:37]: Right. We call that criteria in our product and there are hundreds. And just like every organization has a secret sauce, every evaluation will have different criteria, what it means, what success means. So I think it's a really juicy, challenging area for a lot of People. I think there's a lot of value to first of all having a library of criteria that's available off the shelf. And I don't think that it's enough to have five or ten. I think you need a couple hundred to cover all the different use cases. There's really.

Alon Bochman [00:56:12]: Because they could be very, very different. And then probably beyond the standard library there's, you know, you will want to create some that are specially for you, like special for your task. So not just creating the eval, let's say the input and the output, but also creating the criteria, the prompt for the criteria and the definitions for the criteria are going to evolve as your system evolves. There's a lot of value to breaking down. Usually people will start with really broad criteria like accuracy. Accuracy is kind of a catch all, but it means different things to different people. And the way I've seen it evolve, maybe, maybe you've seen something different, but the way I've seen it evolve is, you know, typically an engineer or engineering team will just define accuracy and they'll use like a non domain specific way to define accuracy. And the domain experts will shit all over it, be like, well this got a 5 in accuracy but it's totally missing that this guy's going to get sued.

Alon Bochman [00:57:12]: Or it's got a five in, you know, it's got a one in accuracy. But there's all these good things about it and you should, you should, you should get partial credit for that. Do shit all over it. And that's great because they're engaged, you have feedback and the opportunity there is to break down the one accuracy criteria into probably like three or four that you can observe from the pattern of feedback that you get is very valuable activity. So like maybe they care about, I don't know, naming things correctly, that's one way of accuracy. And maybe they care about the length. That's a totally different axis of accuracy, the level of detail. Maybe they care about the refusal rate, maybe they care about the, the tone, how persistent you are, like how many times you've tried.

Alon Bochman [00:57:56]: So these are like all different dimensions that I find that the best way to elicit those is to show people obviously wrong things and then they, they jump in and correct it. And that feedback loop is a really, really good, it's a good flywheel to get people to think more about what they want.

Demetrios [00:58:14]: Yeah, it reminds me a lot of a talk that Linus Lee gave at one of the conferences that we had last year and he was really banging on about how in music you can see sound waves in different ways and you have like, if you're a producer, you have an equalizer that you can play with in the production or the, the daw and in filmmaking or in lightroom for photographers, you have these histograms and they're so rich and you can change the photos output by playing around with the histograms. And his whole thing was, how can we bring that to AI output? Is there a way that we can now start creating more of a visual aspect of this output? And it makes me think about what you're saying here. There's all these different vectors and you have all these different criteria, as you call them. And so is there a way that we can visualize this in like different ways so that it is more engaging for folks to look at?

Alon Bochman [00:59:32]: Yeah, a hundred percent. You know, I think, I think of it like an evolutionary process. We start with really big blocks that are really easy to put in place and we get more refined and more sophisticated over time and it kind of never stops. So you know, it starts with accuracy and then it evolves into maybe three or four criteria. It can eventually evolve into a checklist. Checklists are super useful in a lot of medical scenarios, a lot of high risk scenarios where you know that you need to basically meet all these different criteria to be successful. It's hard to remember them all. It's, you know, it's hard for the best domain experts to remember them all.

Alon Bochman [01:00:08]: It's perfect, perfect job for an LLM judge because they're tireless. Yeah, they'll just be like super rigid and check all the things that you asked them to check, even if they're, you know, even if they've been doing it for a day. And humans are not like that. So it's a great way to scale a rubric, which is a set of criteria and scaling it makes things fair and it makes things cheaper and makes things valuable. And as an organization you want to have a feedback mechanism about the rubric, not just about the eval, but for the same, like very same things. And we were, you know, we're finding that, you know, social media companies are revising their views about what is appropriate. Moderation.

Demetrios [01:00:57]: Yeah.

Alon Bochman [01:00:58]: So like there used to be a rubric of how to, you know, how to grade a social media post and it was implemented one way and for a long time every post was graded on this rubric. There were people grading it. And now we're in a different era and like it's being graded in a different way. The rubric changes.

Demetrios [01:01:15]: Yeah.

Alon Bochman [01:01:16]: So my point is that for commercial application, LLM application, AI application, you want to have that flexibility as well. Your rubric could change.

+ Read More

Watch More

Building Reliable AI Agents

Posted Jun 28, 2023 | Views 1.3K

# AI Agents

# LLM in Production

# Stealth

Building the Next Generation of Reliable AI // Shreya Rajpal // AI in Production Keynote

Posted Feb 17, 2024 | Views 949

# AI

# MLOps Tools

LLMOps: The Emerging Toolkit for Reliable, High-quality LLM Applications

Posted Jun 20, 2023 | Views 4.1K

# LLM in Production

# LLMs

# LLM Applications

# Databricks

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io