Sign in or Join the community to continue

Product Metrics are LLM Evals // Raza Habib CEO of Humanloop

Posted Jun 03, 2025 | Views 303

# Generative AI

# LLMs

# Humanloop

Share

speakers

Raza Habib

CEO and Co-founder @ Humanloop

Raza is the CEO and Co-founder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Before Humanloop, Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a Ph.D. in Machine Learning from UCL.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Raza Habib, the CEO of LLM Eval platform Humanloop, talks to us about how to make your AI products more accurate and reliable by shortening the feedback loop of your evals. Quickly iterating on prompts and testing what works, along with some of his favorite Dario from Anthropic AI Quotes.

+ Read More

TRANSCRIPT

Demetrios [00:00:00]: They don't know what they're getting blessed with right now. How you are looking at product metrics and combining those with evals.

Raza Habib [00:00:14]: So I think that there's no real difference between product metrics and evals. Or at least like the best evals are the same as product metrics. What do I mean by that? Yeah, so I think a lot of what you're trying to solve with evals is you're trying to get a measure of the quality of your AI system. And what's changed with generative AI and large language models is a lot more of the use cases are very subjective to measure performance on. If you are writing an email or summarizing a document or answering a question, there isn't one right answer anymore. And so it's hard to say what even is the correct answer. Take a concrete example of summarization meeting summarization is so contextual as to whether or not it's a good summary. If you have a sales call, you want a different type of summary.

Raza Habib [00:00:56]: Different salespeople will care about different things, like, did you pick up key information? So there's no one correct answer anymore. So in some sense, the best eval, the eval you wish you could have in practice would be the end user outcome. Like, did the user get the outcome that they wanted? Were they able to achieve their goal? Did they use the text? Did they like it? Did they, depending on the use case, did the agent achieve the goal it was supposed to do for the person in the way that they expected? Was the customer support message actually helpful? And what evals are really trying to do is give you a proxy for that during development, because you can't have that necessarily before you put stuff into production. And so hopefully those two things are very correlated. And product metrics and evals are not that different. And a lot of the evals that people are creating with lmsjudge or like automated evals are often run as production metrics as well. Right. So on every utterance, they're actually scoring it with either a classifier or a piece of code or an element as judge, and then monitoring that for how things are drifting over time or to be alerted if something changes.

Demetrios [00:02:01]: And do you see this as like the happy path in whatever the product is? There's a next step that I'm trying to take. And if that gets accomplished after I have some interaction with AI, then that's a win. And you can easily see that's black and white. Like, I took that step, I made the next best action or I didn't.

Raza Habib [00:02:24]: Yeah. So in practice the we see like three very common types of sort of like in production evals as like default end user metrics and then, and then ones that are more bespoke or kind of use case specific. So the most like basic ones that everyone does, you see them in lots of apps. So the simplest thing is just end user feedback, thumbs up, thumbs down scores, things like that, they're somewhat useful, you don't actually get as much of it as you would want and it biases towards the extremes.

Demetrios [00:02:52]: I always laugh at those because it's like the people that are given the thumbs up and thumbs down are just a very small subset of your user.

Raza Habib [00:02:59]: Base, but it's still useful data. So something we'll often see people do in Humanloop is look at the examples that got a bad rating because they want to try and understand really where are the failure modes, where is this system going wrong? Then you get corrections. So if people are generating text or getting an answer to something, they're editing a legal document or something like that and being given suggested edits, then they probably edit the text before they use it a little bit. And so having that correction can be very useful signal both how much did they correct it, but also as a stored data point of what the right answer was in that case. And then finally what you said of like actions, like what is the natural next action and does the user take that, do they send the message, do they copy the text, whatever it is contextual in your application. Those we see people use a lot as baseline. And then things that we see on top are people will often build custom LLMs, judge or code evaluators that are very use case specific and those might be monitoring tone of voice, whether or not there's stuff is on brand or stuff like this that may be a little bit more subjective to monitor. And they might want that only in development or they might not want that in development and production as well.

Demetrios [00:04:14]: Yeah, because when you get into the enterprise, they can't just yolo things into production.

Raza Habib [00:04:19]: Yeah, exactly right. So when we built the first version of Humanloop, the initial focus was all around helping people to measure how well the system was working in production and use that to guide improvements. And that's still a core piece of the platform. But as we went upmarket and we started to sell more to mid market and enterprise companies, then yeah, you had this problem that you have to be able to be confident that it works well in development before you're willing to roll it out. Even to a beta group or to all your users. And so you need some way of actually building a rigorous set of testing metrics in dev that hopefully are predictive of what will happen when you actually deploy to production.

Demetrios [00:04:56]: So basically you went into human loop with that idea. And I like that you talk about the evolution and you also see how things change. I talked to you what like three, four years ago when HumanLoop was starting?

Raza Habib [00:05:08]: Just over two years ago. Yeah, it was about, it was, it was just a couple of months before ChatGPT came out, actually. I think we recorded that episode. Classic.

Demetrios [00:05:17]: Great timing.

Raza Habib [00:05:18]: And yeah, that is roughly how I split the world now, before and after ChatGPT.

Demetrios [00:05:24]: Some people it's Christ, others exactly, it's ChatGPT. Yeah.

Raza Habib [00:05:28]: BC it was about three months. BC.

Demetrios [00:05:30]: Yeah. So what have you changed? How have you like seen the product develop and grow and your thinking around AI in production in general?

Raza Habib [00:05:44]: Yeah. So, you know, when we started, LLMs were still very new. They're mostly being used just by startups. The applications were much more limited, so people were building like these single prompt based apps. It was, you know, the early use cases were copywriting or various forms of writing assistant, user comes to the website, describes what they want, spits out a blog post or helps them brainstorm. And the models were kind of just about good enough for that, but for more complicated use cases like question answering or agents, people were trying and building demos and even two and a half years ago you had the first agent demos, but the models weren't smart enough to make that reliable. And so the first version of Humanloop was really focused on helping people who are building this first generation of LLM applications to understand how well they were working in production. Because if you can't measure it, it's very difficult to improve it.

Raza Habib [00:06:32]: Know how to make it better over time. And subjectivity and stochasticity was the challenge. Right. And it's still the challenge. But if you think about copywriting or writing assistance, that's an extreme point of evaluation, is very subjective. And for a lot of engineers it was the first time that they were building software where every time you run it you get different results. Right. People in the machine learning community, this is relatively normal for them, but I think that if you are not from a machine learning background, then that was actually quite a new experience.

Raza Habib [00:07:01]: And so the first version of Humanit was very simple. It allowed you to monitor what was happening in production. So basically every time someone was interacting with your system to get the inputs and the outputs Gather end user feedback and match that against different prompts. So you could kind of do AB testing or comparisons of prompts in production. What's changed since then is as the models have gotten smarter, the complexity of the use cases has gone up enormously. So we've gone from simple writing apps to people are automatically negotiating legal contracts. Right. So people like filevine and Entre are our customers.

Raza Habib [00:07:33]: Or we've got people building language learning assistants at Duolingo or at Gusto. They're building complex agents that can actually go in and do tasks that accountants were previously doing in payroll software. So we've gone a long way from, hey, just generate me a blog to actually go do something complicated for me.

Demetrios [00:07:51]: Multi step.

Raza Habib [00:07:51]: Yeah. And so we've had to build out still the same. The fundamental problem is the same, which is that I'm trying to take this unreliable base thing, the LLM, and build a reliable, trustworthy application on top of it. But the stakes have gone up right as we started to work with enterprises, they care a lot more about avoiding mistakes, about having guardrails in place. The complexity of the applications have gone up. So tracing and observability actually becomes way more important. The way that people build has changed as well. Original Humanloop was just this in production, gathering feedback.

Raza Habib [00:08:28]: Since then, we built that out into a full tracing and observability system. So if you have an agent that's going off and doing things, it'll let you see the full path of what, what did that system do? So user asks the question, then every turn of the conversation or every step that the agent takes, you have that recorded and you can augment that with feedback or evaluation data. So you have this very rich data set that allows you to do both governance but also improve your system. Like what is and isn't working and how do I guide that? And then coupled to that, I think what's become the core of the application actually is an evals tool. And this is really used primarily in development. But what it allows people to do is to write down essentially the criteria that the system needs to pass to be good and have a data set of test cases that they can then iterate against. So LLM development is very iterative. You build the first version of the pipeline as quickly as possible, get something that works end to end, and then you're looking to go from something that works end to end to something that's good enough to trust in production.

Raza Habib [00:09:26]: And you're iterating over prompts, you're iterating over tool Definitions over pre processing and post processing code. You're doing trying out different models, you're trying out different data sources for the context. If you don't have a way to know as you make those changes, am I making things better or worse, then it's very difficult to actually make progress and not just spin your wheels. And so what evals do is they allow you to get quantitative feedback on every change that you make so that you can actually tell am I improving things? And that's really helpful during development to actually get to a system that works well. And I think we spoke alongside Filevine at recent AIOps conference that you guys had organized, and they were able to take a system of just over 50% accuracy on doing information extraction to 90% plus. But it would have been impossible without this constant feedback loop.

Demetrios [00:10:16]: Yeah, evals were the most important thing.

Raza Habib [00:10:20]: Yeah. At least for them. And I think it's one part amongst many, but it's kind of an essential baseline. If you don't have good evals, I don't think you're going to be able to get very far on the other pieces. So it's foundational.

Demetrios [00:10:34]: Stay with me here.

Raza Habib [00:10:36]: Yeah.

Demetrios [00:10:38]: Have you heard the parable of the scorpion and the frog?

Raza Habib [00:10:43]: I have, but maybe we should. We maybe should remind the audience.

Demetrios [00:10:47]: So. So there's this parable of a scorpion and a frog, and the scorpion's trying to get across the river, and so it sees a frog and it says, hey, will you give me a lift? I need to get across the river, so can you take me? And the frog is like, yeah, but you're a scorpion and you got a tail. And if you sting me, then I'm gonna drown and I'm gonna die. And the scorpion said, don't be ridiculous. Why would I sting you if I'm on your back? And then you drowned, and I also drowned. And so the frog thinks about it for a second and goes, oh, okay, I'll buy into that one. Yeah, get on. And so the frog's taking the scorpion across the river and the scorpion can't help itself.

Demetrios [00:11:33]: It just boom. At a certain point, it stings the frog and they both go down and they drowned. And the reason I'm saying this, and.

Raza Habib [00:11:41]: As they're drowning, the frog says, oh, yeah, why did you sting me?

Demetrios [00:11:44]: Yeah.

Raza Habib [00:11:45]: And he says, it's just in my nature.

Demetrios [00:11:46]: That's it. That was the. That's the key part. Yeah, it's in my nature. Right. And the reason I think about that story is because a lot of times I Wonder if these LLMs, we're caging them and trying to put all of these guardrails around them and all of this different stuff. We're doing all this wild engineering, like, evals and like everything that we can muster up when these hallucinations are just part of the nature of, of the LLM and we don't like them, but we have to figure out a way to contain them.

Raza Habib [00:12:24]: Yes. I feel like there's maybe multiple things going on here and there's a few things to talk about. Let's put a pin in the hallucination question. I really want to come back to that, but I think there's an extent to which sometimes what people are doing is working around the limitations in the current generation of models. Right. So some of what people are doing is, oh, the system is not reliable enough or the models aren't smart enough. So we have to figure out a way around that, and that's a stopgap solution. But I think even if you had the model to be arbitrarily smart, let's say you just have an API to Einstein or whatever, Dario says his data center of geniuses.

Raza Habib [00:13:02]: I don't think the need to try and iterate on your system to achieve your goal goes away. You still have to figure out what is the context the model needs to be able to do this goal. Is it doing the goal in the way that I expect if I change things over time, is that still true? So OpenAI releases a new model, and I want to try it out. What is the impact of that change on my metrics? Maybe the accuracy of a system goes up, but people don't like the tone of voice as much or something like this. So it's not necessarily about mitigating limitations of the system so much as even if you had the smartest model in the world, it's always going to be stochastic by nature of these systems. And we're not trying to get rid of that stochasticity. That's often a benefit. And it's always going to have slightly different characters with the models.

Raza Habib [00:13:51]: Because I think, again, sort of sticking to lines from Dario, since I think he has some good lines. He says we don't so much build these systems as grow them. There's a certain unpredictability in the character of different models, and that's a feature, not a bug, because it means we get lots of different personalities or styles or skills or variations. Variance is an ability that I think people like and can make use of. I don't Think that takes away the need for evaluation. If you're a software engineer, you're writing unit tests, you're writing integration tests. Why would it be any different with AI? It's just that the types of tests you need to write are different. And in fact, if you come from a machine learning background, you were always building a test set of cases and measuring metrics on them.

Raza Habib [00:14:31]: It's just that the types of metrics and how you do it are changing. And then on the point of hallucination, when I speak to people about this colloquially, I find it very hard to get out of people a coherent definition of what they mean by hallucination as opposed to just the model being wrong. When you talk about hallucination, you're implying an intent to the model. And so most people, I don't think, have a clear idea in their head of the distinction between those two things. Maybe academics have defined a good distinction. But when people say the model hallucinated when it answered a question, what's the difference between that and it just got the answer wrong?

Demetrios [00:15:05]: Yeah, I guess a hallucination is the wrong answer at the end of the day. So it makes sense that they're one in the same. But the, the idea there then is mitigating the wrong answers, not mitigating the hallucinations. In a way, it's like, let's just figure out how to be more consistent with right answers. And we can do that by like, what you're saying, having this overview of how the system is changing over time.

Raza Habib [00:15:39]: I feel like when people use the word hallucination, it implies something about their mental model of, like, what the LM is doing. And I'm not sure exactly what that mental model is that they're thinking of, but it feels at least conceptually muddy to me. Yeah, like, I don't think people are clear on what they mean. And I think it would just be clearer to say, like, the LLM is being factually inaccurate.

Demetrios [00:16:01]: Yeah, it's getting things wrong. Yeah, I like that on the inside of what you've built, you've got this system running. Right. And you're recognizing that there's changes that are happening to the system continuously. But then there's the production environment and there's logs and traces that are coming back and it goes to these product metrics. Like, you're getting product metrics back, you're getting information back. So how are you using that to update your system and how are you making it easy for that loop to just continuously be happening and ideally, like, get Better each time.

Raza Habib [00:16:41]: Yeah. So maybe this is a good point for me to kind of explain the different pillars of what's in human loop.

Demetrios [00:16:45]: Yeah.

Raza Habib [00:16:46]: Because I think understanding like what we've put into the LM evals platform and why we chose those pillars kind of answers this question. So there's really three core pillars. We've talked about two of them. So one is the in production, tracing and observability. The other is having data sets and evaluators that you can run during development. And then the final pillar that we haven't talked about but is kind of the answer to this question is tools for managing prompts, iterating on them, collaborating between the technical people and the non technical people who have the domain knowledge maybe to be able to change a prompt or to look at an output and sort of score it in some way. So some concrete examples here would be. We mentioned filevine earlier.

Raza Habib [00:17:26]: At filevine, it's like lawyers who are doing a lot of the prompt engineering at Duolingo, it's like language learning experts at fundrise, they have their kind of real estate experts and product managers be involved in writing the prompt. So at a lot of our companies, the people who are either looking at the evals and scoring them at Gusto, it's customer support staff or the people who are writing prompts are not the engineers themselves. Part of the platform as well is separating out the configuration, which is the prompt and the tool definitions from the code so that you can iterate on the prompts independently but still have them versioned and tracked in the way that you would have with traditional software. So those are the components. And then now we can kind of answer a question of like, okay, well I'm monitoring all of this stuff in production. How do I actually use that to make things better? And usually the cycle is that someone will be looking at an eval or they'll be looking at logs from production and they'll have some hypothesis about why the system isn't working the way it's supposed to be working or something that's wrong. So they'll spot inaccuracy in a rag system or they'll spot the tone of voice is wrong or something like that, and they'll hyper.

Demetrios [00:18:32]: This person is usually technical, non technical.

Raza Habib [00:18:34]: We actually see it with both. So it's often PMs. So PMs sort of bridge that gap. Or it might be the engineers themselves looking at the eval logs. They will then say, okay, I think what's going on is that the model is not getting the context it needs. My guess is that the problem is that actually the RAG system is not retrieving enough. We should go and improve the recall of the RAG system. Well, they can then go make that change within Humanloop.

Raza Habib [00:18:57]: They can actually update the configuration of the tool, or they can go and change a prompt, or they can trigger a fine tune of a model and then they can run the eval and see did I actually make things better on my metrics or not. And then there's an ability to deploy that change from there. For a lot of the changes you might be wanting to make, which are changes to prompts, tool definitions, or trying to run some kind of fine tuning job, you can often do that without having to go into code. And so that allows the non technical people to iterate very quickly and freeze the engineers to do their thing as what they want to do from the PMs and the non technical domain experts. And something we see a lot otherwise is a PM wants to make a change to a prompt or wants to try something. And that might involve a full app redeployment. Maybe they have to open a ticket in a system like Jira or Linear to be like, please update this prompt. Then an engineer will go edit that in code.

Raza Habib [00:19:51]: Then they'll do the app redeployment. It's like, oh, actually that didn't make the change I was expecting. Or if they don't have good evals in place, then people are nervous to make changes because they're worried about regressions.

Demetrios [00:20:01]: I like how you're bridging the gap between all these different Personas. So it's as technical as you want to be or as subject matter expert.

Raza Habib [00:20:09]: There's really three Personas that we interact with a lot. And maybe this goes as well to what we've seen the most successful teams do when they're building products. And so you have your nice segue, by the way. You have your product engineers or the actual software engineers who are building the system, they're making the API calls to the LLMs. They're actually, you know, building all the pipelines. But oftentimes they're not the right people to be able to define what good looks like. And they're not necessarily the right people to write the prompts. And the people who are usually best at that are the product managers.

Raza Habib [00:20:43]: Writing a spec is not that different to writing evals. Actually. There's some definition of what does good look like. And oftentimes there might be a third Persona that may or may not be The PM themselves, where. Which is the person who has the domain expertise to kind of define what the goal is and what good looks like. So sometimes that's just the product manager, but sometimes it's a separate person.

Demetrios [00:21:03]: And that's where the lawyer comes in and says, wait a minute, you think this is okay?

Raza Habib [00:21:08]: And so the way those people get used is either they're being used just to provide feedback. So sometimes what people will do is as part of the eval loop, they'll map the system over a data set of test cases. They'll have the lawyer or the domain expert, whatever the use case is, come in, look at each output and score it as like pass fail with a critique. So hey, this one is good, it's fine, this one is bad because. And then that binary data of like which ones were good enough? And the reasons then become what informs the next round of either prompt engineering or updates. And sometimes the lawyer will go to the prompt engineering themselves, sometimes the engineers or PMs will do it. But that feedback cycle is really useful and often also it gets used to build build the LLMs as judge so that human feedback is invaluable but very expensive and hard to scale. And oftentimes what you can do is take this combination of pass fail feedback and critiques and then essentially kind of distill that into a prompted LLM that you can now use at scale.

Raza Habib [00:22:07]: And as long as it correlates highly with the human judgment, you've now got a way to do more scalable evaluation that can also be done in production.

Demetrios [00:22:17]: So evals are being done primarily by folks that are RPMs. People talk about how data scientists can fill that role really well to create these evals to also help explore in the prompt exploration space. And it's very akin to what a data scientist is used to doing.

Raza Habib [00:22:40]: It's very helpful to have someone on the team who has a background in stats or machine learning or data science, just for the rigor of understanding how to do evaluation well, understands the distinction between held out data versus not held out data and what kind of metrics you might want to calculate. And if you're dealing with stochastic systems, you probably want to run them multiple times and get averages and stuff that will be obvious and intuitive to anyone from a stats background. But someone who's not done that style of thinking might not even know to think about that. And so the typical makeup that we see is usually there's like the normal number of engineers you would have on any product team, maybe one data scientist and then the PMs and the domain experts and they're collaborating together. Right. So each of those people is bringing something to the table. The data scientists usually don't have the relevant domain knowledge to be good at doing the prompt iteration. And so what you really want is to be able to give them tools that, that allow the non technical people to come in and iterate without being blocked by either of these other groups and they can share information between them.

Raza Habib [00:23:43]: The way this works in Humanloop is that when you want to run an evaluation, you map your application or your LLM system over a data set. It generates all the outputs and then it scores them. And one of the ways it can score them is with human feedback. So it creates a queue of jobs for the domain experts to come in and actually provide that feedback.

Demetrios [00:24:01]: Nice. And then do you also map it to the product metrics?

Raza Habib [00:24:05]: So not during development, but during production. What happens is that people add some logging calls or we have integrations with OpenTelemetry so they can send data to Humanloop of what is the agent doing. And then once those traces land on Humanloop, people can set up evaluators to be in monitoring mode. And so what that means is that every time a trace comes through, the evaluator states an LLM is judge, evaluator will look at the output, score it, and basically you're building this data set of traces augmented with product metrics or augmented with end user feedback. And so when a PM comes into the app, a typical workflow might be, okay, show me the cases where the LLM thought this was bad or show me the cases where a user gave me bad feedback. I'm going to look at those and try and determine whether I need to go change a prompt or fine tune a model or update my RAG system. I'm going to use that to guide where my failures are.

Demetrios [00:24:54]: And now you said that you can update the RAG system from within Humanloop.

Raza Habib [00:24:59]: So there's some amount that people can do within Humanloop and some that might require busting out into more powerful tools. So within Humanloop people are often specifying configuration. So they're specifying chunk sizes, like what metrics being used to like actually do the retrieval, whether it's like a traditional IR method or they're doing something with like semantic similarity. So they have a tool definition within Humanloop that that's high level configuration for the retriever and they can edit that. And the way it works is we're kind of like a CMS for prompt and tool definitions. So in production, the way it works is the production application basically speaks to Humanloop and says, hey, what's the prompt that's currently deployed? Or what's the tool configuration that's currently deployed? Humanloop sends that back and then the retriever gets called or it gets cached locally and then the retriever gets called. So you're managing the configuration, which is the prompt and tool definitions from within Humanloop and that allows you to iterate quickly and make changes very fast on the application.

Demetrios [00:25:56]: And so that configuration, how are you tracking it over time or tracking changes? Is it just in GitHub?

Raza Habib [00:26:03]: So we have sort of a version control system built into Humanloop that tracks it and then we also have serialized versions of all of these, we call them files. So every tool and prompt in Humanloop has its file type and there's a prompt file format or tool file format that people also check into Git. So people can have like a sync with Git if they want and still be using Git for versioning. But we're also keeping track of the history of changes. We've got audit logs, you can see who changed what on Human Loop natively. So if people are doing it through the UI as a non technical person, you have the same rigor as you would have if an engineer was committing code.

Demetrios [00:26:36]: Let's say I update a prompt or I start seeing something is going a little bit weird and so then I'm going to change out a model. What is the cicd process?

Raza Habib [00:26:47]: So typically the workflow would be someone goes into Humanloop, they would make the change in our editor, save a version of that prompt and they trigger an evaluation. Sometimes these evals are actually triggered as part of just typical CI CD. So someone is running CI CD and a GitHub action is going to call HumanLib to run the eval report and then they get back the metrics and the diff from so they can set thresholds on like we have to be above certain metrics before something's going to be committed to prod and then it goes to prod. So it's actually like very much just the normal CI CD people are used to. The only difference is that one of the GitHub actions or one of the hooks in your CICD system will be calling an eval report on Humanloop for that change.

Demetrios [00:27:30]: Okay, what have you been seeing as far as use cases and actual like AI that is making people money? It's always easy and fun to talk about all these evals and everything that you can be doing. But at the end of the day if you're not looking at what the real potential is or yeah, where's the, where's the business? Where's the business impact the real use cases are? And so how have you been seeing folks succeed in this department?

Raza Habib [00:28:00]: Yeah, absolutely. So we see more customer facing use cases. So there's kind of two buckets where I think people are either saving money or making money with Genai. So there's like internal process like automation. People are building customer service support agents. Right. Whereas previously before they might have had humans take a certain amount of time. They can improve their deflection rates or increase the success with things like customer support.

Raza Habib [00:28:23]: That's like a very obvious use case that at scale I know is saving companies tens of millions of dollars and I can't name names but like at least, at least one company I know is saving on the order of $30 million a year from what they had before. But then what we spend a lot more time on is customer facing applications. So less internal process automation. And the reason why we naturally end up spending time on that or our customers are building that is in an internal app people will tolerate a certain amount of bugginess or kind of. It can't, it doesn't need to be 100% accurate. But as soon as you're putting something in front of customers then reliability and confidence that it's going to perform the way you expect goes up way higher and the need for evals and tracing and observability becomes an essential thing. So we mentioned earlier filevine which is a legal tech company and the reason I'll talk about them is because we had a case study together that's also shared.

Demetrios [00:29:14]: Shout out to Brianna.

Raza Habib [00:29:16]: Shout out to Brianna from filevine. But they've released something like six new AI products that in their contract sort of lifecycle management product and tools for lawyers across the last year and a half that are generating I think tens of millions of dollars of revenue for them, net new revenue and growing very quickly. So legal tech is like one vertical and there are other companies, Harvey Entre that are doing ironclad that are doing amazing things with LLMs in the legal space. So those are applications that I think are generating significant revenue. Coding tools is like a vertical where we know there's a lot of revenue being generated. Cursor, Windsurf. Now you've got Claude Code from anthropic natively like OpenAI has their own tools. GitHub Copilot was the OG.

Raza Habib [00:30:02]: Right. So that's another set of vertical applications that we know people are getting a ton of value from.

Demetrios [00:30:06]: Can we pause right there for GitHub copilot? Because talk about opening up a whole market and then just losing it. So incredible how much work they did to create this whole new field and then let Cursor come and eat their lunch.

Raza Habib [00:30:24]: Yeah, you know, it's interesting. Like, we'll see how. I think the game is still very early. Like, I think it's too old to. Too early to call how it will play out. There's not that much differentiation even now between these different coding tools. Developers are very fickle. They swap models and they swap which tool they're using very quickly.

Demetrios [00:30:40]: True.

Raza Habib [00:30:41]: So I wouldn't call it yet.

Demetrios [00:30:43]: Yeah. But yeah, it's GitHub Copilot, I feel like, is it's true that. You never know. It could come out of nowhere or rise like the Phoenix from the ashes.

Raza Habib [00:30:56]: Yeah. I mean, I'm interested to see what. What the model providers themselves start to produce.

Demetrios [00:31:00]: Well, there was talks about Windsurf being.

Raza Habib [00:31:03]: Bought by rumors that OpenAI is going.

Demetrios [00:31:04]: To buy Windsurf or it already bottom, who knows?

Raza Habib [00:31:07]: And then you've got Claude code, you've got OpenAI's codec. So, you know, they're already releasing their own tools. They're very clearly entering the game of competing directly on coding tools.

Demetrios [00:31:17]: Yeah. And Copilot has some kind of partnership that I saw with one of the. I think it might be anthropic.

Raza Habib [00:31:27]: Interesting, because so early on they had a very close co partnership with Microsoft because of the Microsoft OpenAI relationship. And the early GitHub Copilot models were the early Codex models from OpenAI sort of chucked over the wall, like, you know.

Demetrios [00:31:40]: All right, so continue. I cut you off.

Raza Habib [00:31:42]: There's coding in terms of vertical applications, like, legal tech is definitely a domain where people are making a ton of money. Coding is one where people are making a ton of money. Another pocket that we see is edtech. So we have a bunch of companies who are building tools for teachers or for students or AI tutoring. So one of our customers, there's macmillan who are building an AI tutor, but Duolingo for kind of language learning.

Demetrios [00:32:03]: Love Duolingo.

Raza Habib [00:32:04]: And a bunch of.

Demetrios [00:32:05]: And a bunch of others help if you've ever tried to. You think you've got like a level 7 or level 10, and then you go to the country and you're like, yo hablo espanol to ablas Espanol.

Raza Habib [00:32:18]: But I think, I think that's going to get a lot better with improved AI tools. Right? You can actually practice conversations.

Demetrios [00:32:23]: Exactly. Because I need to learn German. I've been living in Germany for too long to not be able to speak German.

Raza Habib [00:32:28]: But you don't speak German.

Demetrios [00:32:29]: Yeah, exactly.

Raza Habib [00:32:29]: I always assumed that because you were in Germany, you were German.

Demetrios [00:32:32]: I was Deutsch. No, I am very not Deutsche. And it's starting to show now. My daughter's six and she speaks German incredibly. And so I'm missing out on a lot of half the conversation because she'll speak to my wife in German and then so duolingo. I'm going to start using it again and see if, see if you can.

Raza Habib [00:32:52]: Get fluent at German.

Demetrios [00:32:53]: And because of human loop, I'm going to say raza, you saved me. Now you're the reason I speak German.

Raza Habib [00:32:58]: I don't know if we can take too much credit. I think the good, the good folks at these companies are doing the hard work. We're hopefully fac facilitating. So yeah, those are, I think, three concrete examples I could probably name more from. I think those are still applications that are maybe like one generation back from the state of the art in the sense that there's been a progression of complexity of things that people are building with LLMs from right at the beginning, everyone was building these writing assistants kind of very. The only thing that really had traction in the beginning was marketing automation.

Demetrios [00:33:27]: Copy AI.

Raza Habib [00:33:29]: Copy AI.

Demetrios [00:33:31]: Jarvis, that was the one. They raised a ton of money.

Raza Habib [00:33:35]: There is a ton of money.

Demetrios [00:33:36]: I don't know.

Raza Habib [00:33:36]: I don't know what's happened to it.

Demetrios [00:33:37]: And then, well, OpenAI when ChatGPT came out, it was like, oh, we don't need copy AI anymore.

Raza Habib [00:33:43]: But interesting. Some of these companies have managed to keep reinventing themselves. Like rytr I think is doing really well. And they started off in a very similar vein. But you know, what they did differently was that they were always focused on private models rather than building on top of the large public LLMs. So they seem to be going fine. But yes, you had kind of copywriting and then I think you started to get rag systems and people realizing, hey, we can actually use this for customer support or for factual question answering in various ways. And we're only now seeing more agentic applications really get deployed.

Raza Habib [00:34:18]: And I think the first version of agents is still very much like the skeuomorphic. I'm going to go do what I used to do with a human with an agent. But we're starting to See, some of our customers build things that go one step further and are imagining things that wouldn't have been possible before LLMs. So they're still agentic applications, but they're not, I'm going to go replace a human. They're, I'm going to go do something that couldn't be done before. And a concrete example of this, one of our customers, a company called Windmill, and you know, the problem they're solving is that performance management in large companies sucks. You get a review from a customer like, you know, not review from customer. You get like feedback from your manager in like a cycle, like once every six months.

Raza Habib [00:34:58]: The manager doesn't really understand, like, what's going on. It's not a great experience. But what you can do if you have an army of like LLM agents is they can go out and they can speak to everyone in your company all the time. They can proactively ask them questions, they know whose colleagues are and they can then aggregate that information back up and share it in both directions. The managers can have a much better idea of like on the ground, what's happening. And also you can get real time or continuous feedback from both your peers and your manager on how you're doing, mediated by these LLMs that can effectively have conversations with everyone. And it's still very early, but that's an example of an application that I think wouldn't have been possible. You couldn't imagine doing that before LLM.

Raza Habib [00:35:42]: So I'm quite excited about people pushing agents in that direction.

Demetrios [00:35:45]: That's a fun thread to pull on because you can go in the direction of, now that we have the ability to talk to people at scale, what can we do?

Raza Habib [00:35:57]: Yeah, absolutely.

Demetrios [00:35:58]: I imagine they're talking to people in various formats. It's maybe that you get on a call and there's, I think in this.

Raza Habib [00:36:06]: Case it's in Slack, but you could imagine it being a voice agent. You could, you know, and it's, it's, it'll just send you a Slack message with a few questions and, and it'll say, hey, we spoke to the colleagues who work most closely with you and they think you're doing a great job, but here's what you're doing well. But we got this theme that came up across four different people where you might want to improve and that wouldn't have been possible before.

Demetrios [00:36:28]: I take feedback horribly. So I would not like that tool at all.

Raza Habib [00:36:35]: Yeah, I think you can imagine more Pro human, pro social and less pro social versions. I think this is actually taking A very productive approach and just really trying to help people communicate better.

Demetrios [00:36:45]: Yeah. It reminds me of this idea that friend of the Pod Fausto had and he was saying, you know, what about when everyone joins the community, they get on a call with an avatar and they're asking questions. The avatar is asking you questions on why are you joining this community? What is interesting for you? Are you looking for a job? Are you looking just to learn more? Are you looking for other people in the field? Because we kind of know the buckets that people will fall into when they join the community. And it's one of like these seven things.

Raza Habib [00:37:20]: And so to what extent can you not do that? Just with a sort of form when they.

Demetrios [00:37:24]: Because people don't fill out forms. That's the problem. Okay, nine out of ten people just go, yeah, fuck that. Like, I'm not going to take five minutes and fill this out. Even though if they did and we knew more information and we knew like, wow, five out of 10 people who join are looking to meet others in the field. We, we have this program where we set people up, one on one matching program with other people in the community. And so we can suggest that to them and make sure that they know that we have this program. But if you never get the data, you don't know.

Demetrios [00:38:03]: And so I can suggest it, but maybe 5 out of 10 times it falls flat.

Raza Habib [00:38:07]: Yeah. And one of the things that I think is interesting that you can do with this tech that you couldn't do with just a form is ask follow up questions and ask some follow up questions that are specific based, you know, adaptive forms I don't think was like possible before.

Demetrios [00:38:23]: So again, it goes back to being able to talk to people in a much deeper way at scale.

Raza Habib [00:38:27]: Yeah, but building those systems, like there's so much added complexity. Right. So the windmill PMs are kind of like iterating on, you know, they've got tens or maybe more prompts at this stage, like hundreds of versions of those. Like each of those needs to be benchmarked. It's a complicated system. And being able to separate out like iterating on the prompts from having that in your code base where they might be scattered all over is a, is a very big win.

Demetrios [00:38:51]: I do appreciate that you extract it out of the code base so you're not going through lines and lines of codes or just like searching for that one prompt.

Raza Habib [00:39:02]: You want to have a very clear picture of exactly what context are you sending to the model. So this is one of the beefs I also have with like a lot of the LLM frameworks, like agent frameworks or things like that, where. And I won't name any specific names, but people will know if they're using them. They all do this to a certain extent.

Demetrios [00:39:18]: Yeah, Many of people have already talked about this on the Internet, but oftentimes.

Raza Habib [00:39:23]: They'Re hard coding prompts within the framework. And so you as the end user don't have a clear view of exactly what am I sending the model. Prompt engineering and building with ELEMS is not complicated. All you need to really understand is exactly what context did the model receive so that you can reason about whether or not it's possible for the model to answer or achieve the task. And then what things do I need to change or give it more information on for it to be able to either adjust the tone or get closer to reliability. Like edge cases I might not have thought of much like communicating with a colleague, but if there's a bunch of hidden stuff that the model's also being.

Demetrios [00:39:59]: Told, all this random body language.

Raza Habib [00:40:01]: So I think just making it really easy for people to see exactly what is the context that's going to the model, which is the tracing and observability and also having the prompt sound separated and centralized so that you don't have them scattered around the code base.

Demetrios [00:40:14]: How often are you seeing folks use variables in their prompts?

Raza Habib [00:40:20]: Oh like all the time.

Demetrios [00:40:21]: That's a common pattern now, right?

Raza Habib [00:40:23]: Yeah. So like we talk about prompt template, so when we talk about prompt and human loop, we really mean the choice of the model, the hyper parameters and then the prompt template, where that template is going to have, you know, at in production variables that are coming either from user feedback or they're going to be being filled in from a retrieval system. So you have a template with gaps. You know, people use maybe like mustache style sort of handles or ginger templating or something like this to specify the.

Demetrios [00:40:49]: Prompt template because it brings up the other piece that we were talking about earlier on the recommender systems.

Raza Habib [00:40:56]: Yeah, so the context here was we were saying that the closest analogy to building with sort of very subjective use cases before LMS was probably recommender systems. A lot of machine learning has correct answers, but recommender systems don't like what is the best movie to recommend or clothing item to recommend. There's not one correct answer. And in that way it's very analogous to building LM systems where you know, it's subjective as to what, what good is.

Demetrios [00:41:22]: Then you were saying, well now when I want to go for a movie.

Raza Habib [00:41:26]: I look at 100% go to ChatGPT now.

Demetrios [00:41:29]: Yeah, I know, and I think you're an outlier.

Raza Habib [00:41:31]: I probably. I probably am.

Demetrios [00:41:33]: We'll ask the chat and we'll see what people say. Like, listeners can let us know.

Raza Habib [00:41:39]: But if you're not doing this, you should be. So my. Let me. Let me make my pitch quickly, which is just that if you're. If you're a heavy user of One of these LLM systems, ChatGPT, Claude, whatever, it might be, like, it's learning a lot about you. It knows a lot about your preferences through the memory now. It also, like, can you. You can describe to it what you want in a way that, like, is very hard to do with, like, a traditional rexis system that just has, like, you know, how similar are you to these other people? So, you know, you can do weird things.

Raza Habib [00:42:06]: You can be like, hey, I'm in the mood for a political thriller that's like slightly dark.

Demetrios [00:42:11]: In the style of Guy Ritchie.

Raza Habib [00:42:12]: Yeah, I recently. In the style of Guy Ritchie, I recently watched a movie that I really liked and I'm like, I'm in the mood for something similar. Like, what else is out there that's like this? Or I really like Christopher Nolan's films, like, which other directors also play on themes of time travel or whatever it might be. You can't get that from a rexis.

Demetrios [00:42:29]: No. Yeah, I was thinking like, oh, here's the last five films that I really enjoyed.

Raza Habib [00:42:35]: You can do this too.

Demetrios [00:42:35]: Give me. Yeah, give me five more. Or at least give me two, because I'm probably not going to watch all five right now.

Raza Habib [00:42:42]: Yeah. And I don't think this is a replacement for traditional rexis, but as a. As a. It is. It is interesting, though, to see, you know, you can do embeddings that are, like, conditioned on descriptions or text or things like that. So I think these two worlds will collide a bit.

Demetrios [00:42:58]: I think you're an outlier because of the cognitive load that takes. And even though for you, you're sitting there and you're like, no, this is really cool because I can be really creative and you can get a much better recommendation than just scrolling through the horizontal bars on Netflix. But then I was looking at it and I was saying, yeah, but, man, clicking is so much easier than trying to articulate what movie. And really, I think that's what it comes down to is like, if I go on Netflix and I just scroll around for a while and it knows what I have watched already. And it's already bringing me different films, then it's much easier to just go, yeah, all right, this looks good. However, I don't think I am that person either, because usually I'll get my movie recommendations from folks on TikTok. I'll see different people on TikTok break down, like, here's the top 10. These movies, this insert, like, favorite type of movie.

Demetrios [00:44:02]: But for the most part, the recommender systems where you're just clicking or scrolling. If we're going to a recommender system, like, TikTok's got the best one. Right. And so if you extrapolate it out, maybe for a movie. Yeah. But if it was for a social network like TikTok, I'm in the mood for something funny. It has to do with, like, cats and this and that. And it's going to be much more effort to get something where you can just get those signals and scroll.

Raza Habib [00:44:31]: Yeah, I think that's probably true. This isn't something I've given a great deal of thought to, although something that occurs to me just off the top of my head is that, like, one of the hard problems with, like, recommender systems is that you're like, very often in a small data regime. Right. Like, new person joins and you know very little about them. And like, being able to go from like a description of that person or like looking some stuff up about them and then using that to, like, find which people are like them might be. Might be a very interesting way to overcome those cold starts.

Demetrios [00:44:59]: That's a cool way of plugging it in. Yeah.

Raza Habib [00:45:01]: But not. Not something I'm personally expert on, so I will avoid having too strong opinions.

Demetrios [00:45:06]: Well, speaking of things you are expert on, as you look forward, what's around the corner for you.

Raza Habib [00:45:11]: So there's really two things.

Demetrios [00:45:12]: Yeah.

Raza Habib [00:45:13]: And one of them, we're launching actually, you know, in the next week or two. So maybe by the time this podcast comes out, it'll be depending on my editing in, at least in beta and sort of being used by people. But. So one of them is we're trying to give much better support for people who are prototyping and building agents. And agent evaluations as well are just much harder than evaluating simpler systems because.

Demetrios [00:45:34]: Of the steps involved.

Raza Habib [00:45:36]: Yeah. Because there's like this branching tree of so many things that the agent can do. So having good coverage over what might the agent do is hard. Also, oftentimes to evaluate an agent or even run one, they're interacting with users or other systems, you have to be able to mock Those systems as part of your eval. So if there is a user who's asking questions as part of the agent, if one of the tools the agent has is like get information from user, then to run a good eval you need to be able to mock the user or if it's calling third party APIs, you need to have that integration into your eval system. So evaluating agents is just more complicated. And so we're launching a new agent builder as well as much better tooling around how do I evaluate complicated agents. So today there's I think decent support for this in human loop, but this will just simplify both the DX and UX and I think be pretty strong there.

Raza Habib [00:46:26]: And then the one I'm very excited about is auto optimization. We're trying to come up with a better name for this and maybe you can help me out with names. We haven't released this yet, auto optimization, but the premise is people are spending so much time in humanloop today tweaking prompts, adjusting tool definitions, iterating on these different components. And actually we've done a bunch of research now in prototype on having AI systems do a lot of this iteration for you. And because we have access to your evaluation scores and your data sets, we can run that. So once you've set up basically your data set in evals, we can have the AI propose changes to prompts, update parts of the system, rerun the eval, look at the eval report and then like a human data scientist would sort of think about like, oh, what is the cause of the failure? Make an adjustment and rerun until the score improves to a certain point. That works better today than like we've benchmarked it against dspy, We've benchmarked it.

Demetrios [00:47:28]: Against has a bit of dsp, it's.

Raza Habib [00:47:31]: Got a DSPY vibe to it, but one we're looking at like entire pipelines. So not just kind of prompts or.

Demetrios [00:47:38]: Those parts because they do the auto prompt optimization, but you're saying the entire pipeline. What if it's the model that you need to change out? And so let's test five different models here and see if that gives us any extra lift.

Raza Habib [00:47:51]: And there's quite a lot of complexity to getting it to work well. And then we've benchmarked it against kind of the best things we can find in the academic literature. And you know, we're currently outperforming. So we're going to sort of be running that in beta starting very soon actually with a small group of early customers. And then I'm hoping in a few months or a couple of months time, kind of launch that to ga. And so something that, you know, our top users are spending 10 hours a week or something of this iteration now. I think we can probably just automate all of that.

Demetrios [00:48:20]: Wow. All of it.

Raza Habib [00:48:22]: A large fraction.

Demetrios [00:48:24]: So it's basically like you're going to more of a declarative way of working. Just say, give me your evals and I'll give you what you need.

Raza Habib [00:48:34]: Give me your evals and that kind of first version of the system that works, 80% accurate or whatever. And a lot of the iteration that you were doing to get from there to the system, that's 90%. Plus all those hours that Brianna's team at FileVind spent in HumanLoop with expensive lawyers, we can maybe do a really good first cut of that or at least automate a significant fraction of that work. So those two are the sort of new things that we're most excited about.

Demetrios [00:49:02]: Lawyers are expensive, engineers are expensive, PMs aren't cheap.

Raza Habib [00:49:07]: And so especially good ones, especially in.

Demetrios [00:49:10]: AI, if you're thinking about, hey, this 10 hours now open are opened up for the lawyer to be creating more.

Raza Habib [00:49:18]: Evals or just doing anything else.

Demetrios [00:49:21]: Lawyer work, whatever lawyers do. Yeah.

Raza Habib [00:49:25]: Yeah. So I think that's going to be quite transformative. And I think that we have to build the foundations first. You can't actually, like, someone couldn't show up today and be like, hey, we want to build the optimization system for AI products. Because the first thing they'd have to do is they'd be like, okay, well, what we need is the ability to build evals and run them over data sets and put that in the closed loop of the optimization. I mean, people are doing it. Anthropic has some experimental APIs around prompt improvement. So if you go into the Anthropic console or if you go into their API, they'll take a prompt and they'll rewrite that prompt for you, sort of using best practices for Claude.

Raza Habib [00:50:02]: But it doesn't tell you anything about how good did it get better for your use case. And I think having this eval suite coupled to the auto optimization gives us a unique advantage over anyone else who's building in the space.

Demetrios [00:50:15]: It sounds a bit magical. And so I want to see it. When you release it, come and show.

Raza Habib [00:50:21]: Us because I would love to come back. And I mean, what we're going to try and do, the reason why we're launching it in beta first with small number of customers is exactly this, that we know there will be skepticism because the promise is so big. And so what we would like to do is we want to launch it with concrete case studies to kind of help people kind of get over that initial skepticism. But I know from having used it in kind of our research work and having a very good intuitive sense of what people are trying to do, I have very high confidence that this will work.

Demetrios [00:50:52]: And so are there things that, whether it is creating a more fine tuned prompt or just having more prompts that you can pull from or having more eval data points, does that add to the success rate or is it like whatever, give us that baseline and we'll add on to it with synthetic data with some evals that we get and we don't even need you around.

Raza Habib [00:51:24]: No, we still. So we like the taste of the end user is not something that we can ever like replace. And they express what they want. Right. You said moving to a declarative approach in two ways, right? In the definition of the evals, in the feedback they give us. So one of the things that we can do is much the same way that we might that the human teams today ask the domain expert for feedback in terms of score these outputs. That's something that we can ask for. And then the auto evals that are already there also guided.

Raza Habib [00:51:55]: So we cannot. There's no free lunch. We're not magically able to build a better system than you without your input. But what we're changing is that all we need from you, as you say it's declarative, is for you to express what you want clearly through decisions on feedback. So yes, this one, not that one. Preference data definitions of evaluation. So minimum criteria of what good looks like and a first version of the system from which we can infer your goal and then from that we can run.

Demetrios [00:52:25]: So when you tell people about Humanloop, how do you describe it?

Raza Habib [00:52:30]: So we describe HumanLoop as the LLM eval platform for enterprises and we're focused on evals because that tends to be the most foundational piece. It's not the only part of Humanloop. We're really trying to give you all the tools you need to do prompt management, to do observability, to do evaluation, to get a reliable AI system. And I think we're quickly becoming the kind of sort of industry standard for evals. So that's the bit we lead with because I'd rather be I 100 clear and 80 accurate.

+ Read More

Watch More

Foundational Models are the Future but... with Alex Ratner CEO of Snorkel AI

Posted Dec 29, 2022 | Views 1.4K

# Foundational Models

# Snorkel AI

# Foundation Model Suite

# Snorkel.ai

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB