Sign in or Join the community to continue

Eval Driven Development: Best Practices and Pitfalls When Building with AI // Raza Habib & Brianna Connelly// AI in Production 2025

Posted Mar 13, 2025 | Views 407

# AI Development

# RAG

# LLM

# HumanLoop

Share

speakers

Raza Habib

CEO and Co-founder @ Humanloop

Raza is the CEO and Co-founder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Before Humanloop, Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a Ph.D. in Machine Learning from UCL.

+ Read More

Brianna Connelly

VP of Data Science @ Filevine

SUMMARY

Learn how the best companies use evals to guide AI development and build a virtuous cycle of product improvement! We’ll cover the fundamentals of how to use evaluation driven development to build reliable applications with Large Language Models (LLMs). Building an AI application, RAG system, or agent involves many design choices. You have to choose between models, design prompts, expose tools to the model and build knowledge bases for RAG. If you don’t have a good evaluation in place then it’s likely you’ll waste a lot of time making changes but not actually improving performance. Post-deployment, evaluation is essential to ensure that changes don’t introduce regressions to your product. Using real world case-studies from AI teams at companies like Gusto, Filevine and Fundrise, who are building production-grade Agents and LLM applications, we’ll cover how to design your evaluators and use them as part of an iterative development process. At the end of the session you should understand the pitfalls and best practices of how to construct evaluators in practice, as well as the process of evaluation driven AI development.

+ Read More

TRANSCRIPT

Click here for the Presentation Slides

Demetrios [00:00:05]: Next up, we've got my good friend Raza and Briana coming on the stage. I'm so excited for this chat. Let me bring them on the stage so we can say hi. What's up?

Raza Habib [00:00:18]: Hey Demetrius, good to see you. And, and hello to the mlops community.

Demetrios [00:00:22]: Yes. So you all have have got a talk for us all about eval driven development. I'm very excited for this because as I like to talk amongst friends, we all kind of know benchmarks are and hopefully you can tell us better way to do it.

Raza Habib [00:00:45]: Absolutely.

Demetrios [00:00:48]: I see there's a, there's this here. I'll let you guys get rocking and we'll keep it moving. I'll be back in 20 minutes with questions.

Raza Habib [00:00:56]: Sounds good. Well, hello to everyone who's online and thanks very much, Demetrios. So Brianna and I are going to be chatting today a little bit about eval driven development for AI. So I'm the CEO and co founder of Humanloop and I'll tell you a little bit about myself in a moment. And Brianna is the head of data science at filevine and I'm going to try and give you a little bit of the lessons that we've learned as the eval platform for enterprises at Humanloop. And then Brianna will sort of jump in midway and take over and talk about sort of doing it in practice, actually building a real world product at enormous scale. So maybe just like a little bit about me to start with to give the context. Why should you be listening to my opinions about how to build AI products at all? So I'm the CEO and co founder of HumanLoop, where the LM eval platform for enterprises who are helping companies who are trying to build reliable AI agents or products to get them into production and to make them good enough and reliable enough for their customers at scale.

Raza Habib [00:01:51]: And before that I did a PhD in probabilistic deep learning. I worked at Google and Monolith for a while. But critically, over the last two years I've been working really closely with some of the leading companies who are building AI agents and products on the frontier. So filevine being one of them, but also Duolingo and Gusto and Vanta and Vendor and Macmillan and many others. And so what I'm trying to do with the talk today is to summarize some of the lessons that we've learned from working really closely with those customers about what the best teams do to build these amazing AI products and to share some of those best practices. That's hopefully the core of it. What I would like you to be able to take away is why evals are so important, how to do it well in practice, and also what the lessons of the best teams are. It's really not that complicated.

Raza Habib [00:02:38]: Something that we've seen universally through humanloop that all of the best AI teams do is through three things. And the first one is unsurprising, given the title of the talk, is they put evaluation at the center of development. And most of what I talk about will be focused on why that's so important. But there's two other things that they do that I think are really critical as well. And Brianna will speak to concrete examples in a moment. But one is they spend a lot of time looking at their data. And I think people who are coming from a machine learning background, from a data science background, that won't be such a crazy idea. It's almost cliche at this point.

Raza Habib [00:03:11]: But what's happened with LMS and Genai is I think there's a lot of developers who are coming to AI for the first time from a more traditional software engineering background and they maybe underestimate how much alpha and benefit there is in spending time looking at the traces of your AI agents, of the logs, of what's happened, of how people are interacting with the system, and just building a qualitative sense of where are the failure cases, where are things strong. And so anything you can do to reduce the friction and provide observability tools that work well to your team tends to be extremely high value. And the last thing that I think is particularly true for Genai, even more so than any machine learning that come before, is that the role of domain experts throughout the process has become even more important. And that's partly because the way that these LLM apps are built relies very heavily on prompt engineering is often very subjective to evaluate. And so having the right domain experts as part of the process is often the difference between building a product that's okay and building one that users really, really love. And we've seen pretty consistently that the teams who do these three things tend to produce better AI products that are that get more success than the ones that don't. Okay, so why are eval so important? Why does human focus on this so much? Well, fundamentally, LLM agents and applications aren't just the model, right? They're complicated systems. You have the core LLM, but then you also have, if you're building RAC systems, the data source and the way you chunk the documents, if you're building agents, you have the choice of all the different tools, you have the actual prompt template, maybe there's some control flow that's governing how the agent works.

Raza Habib [00:04:42]: It's not crazy complicated. But there are enough design decisions that if you don't have a good way of measuring performance, it becomes very hard to actually make the system robust or to improve it over time. You're trying to choose which of the models should I use? Do I use OpenAI? Do I use Claude? Do I use something open source? What should the prompt template be? What's the right way to construct the context? What are my tool definitions? And because every time you run these systems you get different answers because it's subjective to evaluate without good measurement in place, without good evaluation, it's very easy as a team to spin your wheels and not make any progress on actually improving these things. The other reason why evals is so important is that I think with Genai development, even more so than normal software development, machine learning development, the speed of iteration is really what determines your success. And when you're building software in general, you will often have a very clearly defined spec and you'll work against that. Whereas I think with Genai applications, the team who do best, teams who do best build an end to end version, an MVP version of their product very quickly. And then they iterate on all of those components that we saw in that diagram before. They iterate on the prompt templates, on the tool definitions, on how they do their chunking for rag.

Raza Habib [00:05:54]: And after each iteration you need to get quantitative feedback on did I actually improve this or not? If you're going to actually improve the system, giving you the right tools to be able to iterate quickly and get feedback in a development loop that involves evals is kind of the core of what we've been trying to help teams do and what we think is critical to success. And so we call it eval driven development and evaluation really shows up at three different places when building these LLM products. Initially, when you're first starting off, it's probably just intuitive interactive based evaluation. The first thing you're going to do is create a prompt template, create a first version of the system and just validate as the domain expert looking at outputs. Is this good enough? Does it work? But you very quickly want to move from that vive based analysis to something much more concrete, which is offline batch based evaluation, where you're going to have a data set, you're going to have a systematic set of evaluators, and for every major change you're making, to the system, you want to be able to get quantitative feedback. Then finally, once you're in production, you still want to be monitoring the same types of metrics that you set up during development in production. I guess what's different about that's the process that we think about with eval driven development is that you make a change to the system with some hypothesis about how you're going to improve it. And then you get quantitative feedback from these evaluators on whether it worked.

Raza Habib [00:07:17]: And you repeat that process very quickly until you've got the system to be sufficiently good to deploy in production, and then post deployment you continue monitoring it. And evals fundamentally are just kind of analogous to unit integration tests, but for stochastic software. And similarly to traditional testing where you would have unit tests that you'd run very frequently and that are cheap to run. And then maybe integration tests and end to end tests are more complex and maybe not run quite as much. There's kind of three levels of evaluation that you have for LM models. So the simplest is you have code based evaluators that are checking things like valid JSON length, they're cheap to run, they're very fast LLM based judges that can be run more frequently and can be done at scale, but maybe are not as quite as good as human. And then the kind of equivalent to end to end test would be human feedback. And so the, the teams we work with, almost all of them have all of these three types of evaluators.

Raza Habib [00:08:13]: They have simple code based assertions, they gather human feedback, and then they use that to align LLMs as judges to actually scale that evaluation and production. I have some examples here, but I'm going to jump past them to just summarize the lessons that we've learned so that I can hand over to Brianna to talk about what this looks like in practice. So if I was to summarize the best practices that we've seen from teams building with AI, I think the most important one, the sort of number one at the top, would be to have good evals in place from the start and iterate against them. It would be to look at your data a lot and make it easy to do so. So I think a lot of teams log everything, but it's not necessarily something that's accessible to their PMs, to their data scientists, the engineers who are actually building the product. So making those logs very close to where prompt engineering is happening, to where the actual development work is happening, build the first version of the system as quickly as possible and Then iterate alongside evals, make sure that you have simple code assertions first and then layer on LLM as judge and human feedback afterwards. And as far as possible, you want to be using the human feedback to align those LLMs as judges rather than relying on the human feedback, because that's much more scalable. And if you're interested in learning more about this, if we have a lot of blog posts about this on the Humanloop website, more detailed guides and evaluation.

Raza Habib [00:09:34]: But now I want to hand over to Brianna, who's actually been doing this in practice, to talk about how filelvine's been building their AI products and how they use Humanloop and evals to make them reliable. Over to you, Brianna. Awesome.

Brianna Connelly [00:09:48]: Hey, ML Apps, I've been a huge member for the last few years and I'm really excited to be talking to you. So I'm going to share what filevine is doing as a leader in the AI and legal space. And really, filevine is a case management software where we are working with clients, our law firms, from the beginning of a case, where they intake a firm to the settlement or the verdict on a case. So everything that happens in between filevine holds from calendars, notes, emails, text messages, all the documents, all the evidence, everything that's happening we sit on top of as a platform. So we have an enormous unstructured data set that we work with. We were very early to adopt AI in our workflows, and that's for a variety of reasons. Our clients really expect us to be able to amplify their workflow and provide this wonderful experience so that they can get to the itemized details and the pieces that are going to really make their case strong. Remember, law firms, their clients are people who sometimes it's the worst times of their lives when they're coming and working with a law firm.

Brianna Connelly [00:10:53]: So they want to feel comfortable content, they want to know that their case is managed and handled. And then raza, can you advance or do you want me to thank you. Okay, so really one of our original AI products was AI fields. This is still living and breathing. That's one of our most used AI features. This product sits on top of that vast document data store. And what we do there is we process and leverage generative AI and prompts to do extraction and orchestration on top of that huge unstructured store. With Humanloop, we're actually processing over 1.5 million chat requests.

Brianna Connelly [00:11:33]: So the times the users come and try to extract data or interact with these documents, over 360,000 documents went through this process in the last month and we processed through HumanLoop 25 billion tokens of input output. Really we allow clients to leverage and build on top of this robust prompt library. And this is where evals become like non negotiable. We are doing such a volume for our clients. We cannot react to like whack a mole feedback. And we have to know that every single change we make is going to enhance the feature and that it doesn't cause harm for our clients. Right. The other thing, and if you could advance again for me.

Brianna Connelly [00:12:17]: Thank you. The other thing here at valvine is we have a huge collaboration point with domain experts. So as Raza said, it's not just about the prompt, right. It's about that whole workflow leading up to the process and then after the process of running a prompt or running a gen AI process. And so we have non technical subject matter experts, we have existing attorneys, former attorneys, paralegals, users, domain experts that are all working and working through this iteration process. So they start with an idea and ultimately we work with then technical stakeholders who have the ability to use the API features of Humanloop on the back end to create data sets and then start to validate that these workflows are working. And to Roz's earlier point, and I'll show you, we can actually go through experiments with different variations of each one of those configuration points so we can know what's going to be the best result for our client. If it's speed, latency, cost, we can make all of those decisions comprehensively.

Brianna Connelly [00:13:22]: And we also allow our external filevine clients to interact. They are able to do user defined prompts, so they need to be supported in that. Right. Our law firms don't have a lot of time. A lot of these law firms don't have the ability to spin up an AI group to focus on generative AI and best practices. They're working cases, they have large caseloads, they don't have time to go read industry best practices with prompting or prompting techniques. They want to be able to get into filevine and interact with their data in this chat generative process without having to know how they're doing it or what they need to invoke or use. And then ultimately something like AI fields, which I talked about earlier, becomes the backbone for other products.

Brianna Connelly [00:14:06]: So not only can you extract a lot of detail from your document or your case, now you want to be able to leverage that. You want to draft a demand to an insurance company for that. You want to draft a motion, you want to move that data along so that it can automate your workflow. And so really we have several extraction workflows and chat workflows. We have more than this, but this is a nice example. So on the top, we're inputting document data, we're running a production prompt that we did eval through. We have a great benchmark or standard on that. And then we have the output data that could.

Brianna Connelly [00:14:42]: That could be it. Somebody could want to leverage and use that like a medical chronology from thousands of pages of a medical history. And then they can also pass it along to a draft in terms of something they're creating, whether it's a demand or another document. And then we also have a chat workflow where you can interact with all of your content and documents and then continue to move that along for automation. So it's really important that our foundation is strong and evals is the only way we cement that and make sure that's true. Otherwise, as you all know, whether you've been in a data background or not, garbage in, garbage out. If we didn't do a good job with initial extractions or initial data setup, everything downstream is going to be a total mess. Erode trust with clients and ultimately they're going to have to redo it.

Brianna Connelly [00:15:30]: Next slide, please. Thank you. All right, so this is an example. I have some screenshots from, pardon me, the human loop ui. And so this is where we built a classification model. As you all know, language models are said to be like, really great at classifying. Right. And generally I believe this, but we had to go prove it with evals.

Brianna Connelly [00:15:50]: And so what you're seeing here is all of those purple icons are different configurations of what we fed for token size. So in this case, we did from 4k to 100k samples of documents. Those 4k to 100k is tokens. And we did this on 6,000 documents. What we ended up doing is creating a data set that had those different ETL or prep processes. We're going to feed it 4k, we're going to feed it 100k tokens. We did that because sometimes lawyers like to put a lot in one PDF. We needed to understand how many documents are we going to be able to identify and classify within that.

Brianna Connelly [00:16:30]: And where's kind of an elbow point of we don't need to process more than X tokens. So we set that up. We also, of course, if you've been in data science or data, you have to have a validation set. Right? How did you get the right answer and know that it's true. And if you can go back, yeah, thank you. Then really what we're able to do is run evaluations and know how well the model is performing. So on the bottom left, what you're seeing is all of our experiments. So we can actually tweak each one of those pieces that Raza was talking about.

Brianna Connelly [00:17:03]: The inputs, we can tweak the prompt, we can know exactly what changes we made on the prompt, like for instance, a confidence score generation metric, and then what impact that had on the prompt output itself. And then we're using those little orange icons or the LLM judge evaluators. And so those are really directional. It's not as tight as precision recall, but what it is is a really great indicator to say like this is how well it's generally performing. So we're able to do that. And then we are also within Humanloop, able to use Python to truly grab a true precision recall metric. So the big so what here is that when we have evals in place and we have the tools to use them both for non technical and technical stakeholders, we can iterate really quickly. So what began as, hey, we need to classify about 60 categories from documents, we were able to do that very quickly.

Brianna Connelly [00:17:59]: Our precision recall is in high 80s and for most documents it's actually in the 90s, which is really great. And then we were able to jump that to 160 categories with the same high 80s. And then for some, mostly in the 90s for those classes, we were able to jump that up really quickly and validate it so that now we can release a document classification product that then other automations and workflows can be built on. With high confidence in the future we can continue to add classes to this and know if we're causing harm. So that's really the power of doing this, is that it? You know, you can advance to the next slide and I'll kind of wrap up here, but this is the power of being able to go through eval processes. We use Humanloop because it brings together the subject matter experts, the technical folks, and then it creates a safe space for us to really work with our data and as Raza said, look at it, work with it and enhance it. When we started in the early days in 2023, when the recent gen of all the new generative AI models came out, but we started like a lot of people do, and I believe a lot of people are still in this pattern, but we started with hard coded prompts and code very hard to change, update and know what changes had impacts. Subject matter experts were working out of Docs or other familiar places because they're not familiar with GitLab or GitHub.

Brianna Connelly [00:19:24]: It was very vibe check, right, kind of that initial step for stakeholders and subject matter experts was like, I have a few. This feels good, but I don't know what it's like at scale. And then it was hard to respond and diagnose client feedback. It was a lot of digging and trying to pull teeth to get that information and ultimately right. Without something like Humanloop, you have to build every API pipeline to every language model. So if you want to test out a different foundational model or a different company or your own model, it became incredibly cumbersome for you to justify the cost to go build just to test. So now after, since we've been using Humanloop for well over a year and been having a great time with it, all of our prompts are version control. We have dev staging and deployment environments which dovetail very nicely with all of our engineering rigor.

Brianna Connelly [00:20:20]: We have evaluation set. So for every prompt that we release, we can actually see what is enhancing. Can we test out new models where it's the precision recall on this? We do have to continue to go with that foundational validation set. We have searchable logs, which is awesome. So you can actually go see exactly what's happening for specific clients and what they saw and what the issues were. And then we can through clicks, literally clicks, I can go click a different model and then run it and see how well it's performing. So it's been a game changer and it allows us to move quickly, have high confidence in what we're making and ultimately our clients experience that. Right.

Brianna Connelly [00:21:01]: We're able to just get to it a lot faster. And you know, earlier it was said in terms of like, you don't want to do harm. This allows you to really be intentional and methodical with your enhancements and your releases so that you're not kind of crossing your fingers. And yes, as was said, like you have to create your own benchmarking and your own utility. You can try to rely on external benchmark, but we all know that it's not going to be an actual applicable fit when somebody goes to use it. So that's how we're using it and thinks all.

Demetrios [00:21:36]: Excellent. Well, we've got the chat that has, I think unanimously decided that was fantastic. And so there's a bunch of questions that are going to be coming through. I like can we just keep this slide up? Because this before and after is really cool. I want to know. And Brianna, thank you for coming on here and like talking about the journey with Human Loop and how you're doing it. I want to know, did you have certain statistics that you were able to take to leadership that says, hey, this is awesome and this is why?

Brianna Connelly [00:22:20]: Oh, yeah, for sure. I mean, we did a huge boost from that, like circa 2023 to this new workflow. Basically within months of adop human loop, we went from prompts having like 50% completion because, like, it's really hard to get all the details without knowing all that config set up too. We went from like 50% completeness to for. Most prompts were in the high 80s and 90s. And these are highly, highly detailed prompts. It's not like summaries. So yes, we definitely were.

Brianna Connelly [00:22:53]: And then clients, of course, experience, experience the goodness of all that as well. So yeah, we were easily able to justify why we did all this.

Demetrios [00:23:02]: Nice. Did you ever get to a point where it was like, hey, there's product metrics that are starting to show also? Like you're saying if. If clients are feeling that goodness, is it that they're spending more time or there's certain like key metrics in the product that you can point to also?

Brianna Connelly [00:23:20]: Yeah, for sure. I mean, usage month over month continues to grow with people just leveraging that one product or feature that I show where they're using it for kind of their extraction hub. And then we see more engagement and the downstream products that connect to that. So now you can create documents on top of those data inputs. So we definitely see more usage in that existing features. And then other like downstream dependencies get more and more folks using them and leveraging them because it just becomes easier when you trust the foundation.

Raza Habib [00:23:53]: Yeah, I'd maybe just add there, Dimitris, that we spent a lot of time today talking about the process and development, but actually a lot of the evaluation continues post production as well. You can be getting those products down metrics, like usage metrics back into Humanloop as well to monitor the impact of your changes. You can also ab test things in production, not just in development.

Brianna Connelly [00:24:14]: Yeah, 100%.

Demetrios [00:24:16]: Go ahead, Brianna. Sorry.

Brianna Connelly [00:24:18]: Yeah, we can actually look at troublemakers in terms of, oh, that got pulled. Poor response. You can do thumbs up, thumbs down in Human Loop. So it's easy to be able to like peel off these things. Didn't work like we thought they would. What do we need to do here? So it's really easy to identify that.

Demetrios [00:24:35]: Well, the reason that this is so fascinating for me is because the a lot of times I hear folks saying how to justify the spend. AI can get really expensive. And if you're just throwing it out there and you don't really understand that it's making your end user's life better a la meta AI. Right. And it's just kind of like, why is this here? Some product manager was forced to put AI into the product and now I never want to use it and it's just getting in the way. But it feels like you're able to not only attach the evals, but attach like product stuff too. Is that a safe assumption?

Brianna Connelly [00:25:21]: 100%. Yeah. We really early on skewed this whole product. So it wasn't this ambiguous. Like, I hope they use it right. We sold it and people did buy it because there's a huge automation and value add and efficiency add for law firms. And so, yes, we originally did that. And then we continue to layer experience onto it so that people can kind of walk through, through the entire lifecycle of a case with that efficiency behind them.

Brianna Connelly [00:25:49]: So we intentionally build that way so that you're not having to, you know, it might be automated over here, but now you manually are doing this drafting piece. Right. Like you're able to connect those two elements. And so we continue to do that because our goal isn't to replace lawyers, replace legal staff. Our goal is to supplement and make them an X factor of themselves so they can be and do all the great things that they need to do. You know, they have an incredibly hard job. And so let's automate the things that we can for them.

Demetrios [00:26:20]: Excellent. All right, there's a question from Stefan in the chat. How granular should prompts be? How do you define it? Like one action or task per prompt or something else?

Brianna Connelly [00:26:32]: Yeah, great question. I would say, honestly crappy answer. But it depends really how we build is we have what is. What am I looking for? Right. And so in the case of an extraction prompt, like we get medical chronologies, where we get thousands of pages of a medical history, and we want to build a full comprehensive chronology of somebody's medical history with date, visits, you know, physician. So we say this is what the right answer is. We start with that in mind. It may change as we develop, but we have to be intentional and attention intentional about it.

Brianna Connelly [00:27:04]: So you can start as granular as you want, but ultimately what is it you're trying to do? And then you can kind of go back and forth through the eval to see if you're hitting the mark. So if you have too many tasks and it just is not doing well, you can easily see that and split it up and start to isolate and identify how you can work through it and get the accuracy on the entire process. So it depends, but start with the end in mind and then you can really easily iterate to make sure you're hitting it.

Demetrios [00:27:33]: A lot of times I will hear folks saying that, especially if they're relying on one provider and they have that working well. But there's some black magic that goes on with the provider and they update something behind the scenes and then all of a sudden you have to go and relook at your prompts. Are you seeing that? Has that been something that like, without any warning it's just like, hey, we're getting alerts, maybe we have to re architect our prompts.

Brianna Connelly [00:28:05]: Yeah. So obviously we try to be intentional about updating prompts and not so reactive to responding to changes or issues. But yes, of course we see it. We have the like LLM judge and evaluator in place to catch that so that we're not getting inundated with client complaint. Really. What we also try to do is strive for templating and making it so that things are consistent so we don't have, you know, we have kind of like prompts as code and we're working more and more towards that instead of having to go figure out what we need to change language on or update verbiage on across all those production prompts. So we try to be as like concise and templatized as possible. So if you update in one place, it's going to be received in other places immediately.

Brianna Connelly [00:28:52]: Humanloop lets you do that through snippets and tools. So you can just update it once and It'll be in 50 prompts if it was included in that. So yeah, we try to be very proactive, but when we do have to be reactive, we can at least really easily isolate what happened and update it quickly.

Demetrios [00:29:10]: Excellent. And are there, what kind of metrics are you using for in your eval framework for specific terminology for regular regulated industries?

Brianna Connelly [00:29:24]: Yeah, that's a great question. So I think from like a data security perspective, like we're honoring all SOC 2, type 2 HIPAA, we have an enormous amount of due diligence that we have to do, so we honor all of that on that side. And then in terms of what we're doing from a like standard and practice area, really a lot of this is so manually done that we're competing with spreadsheets, we're competing with PDF, Adobe and Control F. And so that's where we're really starting to create those benchmarks. And this is a new workflow for a lot of folks. We're kind of establishing and creating this. But we do have to honor and maintain all of the requirements obviously, that our clients trust us with, since we have such sensitive data. So I hope that answered the question.

Brianna Connelly [00:30:10]: Let me know if it didn't, or ping me if you want more.

Demetrios [00:30:13]: Yes, there are a few more questions here. Raza, are there any issues with getting stuck in local Optima with the eval approach? If so, how can you combat this? Start with diverse prompts.

Raza Habib [00:30:26]: Yeah, I guess there's. There's two things I would say to that. So one is that the evals typically co evolve alongside the product and the prompt development. So usually the way that we see people working is they'll build the MVP of the system, they'll run that over a data set of test cases, and then they'll look at it and they'll sort of notice common failure modes, places the models are strong, places the models are weak. Sometimes they'll be surprised that the models can do things that they didn't know they could in terms of the capabilities. And so it's usually not that the evals get written at once and then people just iterate against them. Usually what's happening is like they have an initial set of evals and then they're adding new evals to that over time, or they'll notice data points in production that are kind of examples of edge cases. They'll add that to an eval to make sure that the model stays good at those.

Raza Habib [00:31:12]: And so the reason I think they don't end up in local Optima is that they're actually evolving the evaluation metrics alongside the prompts. And then the second thing I would just say is that typically people aren't starting with a diverse set of prompts. But what we're starting to do, and I think what we're going to be launching in the near future, is actually we can start auto optimizing some of these prompts for you. You can set up your first version, but once you have the evals and logs in place, we will actually do the search over the space of possible prompts for you and proactively suggest to you how to improve it. That's going to be launching in beta soon. And if people are interested, just watch our Twitter and LinkedIn we'll start taking signups for that in the next couple of weeks. In fact.

Demetrios [00:31:54]: That sounds incredible. That reminds me of DSPY type stuff. And how do you get out?

Raza Habib [00:32:00]: Conceptually very similar.

Demetrios [00:32:01]: Yeah, yeah. It's like prompting and there's folks that are saying it in the chat. It's very much like we're trying to figure out the best way to prompt and it's not really us science, it's not really clear. I guess it is very much like exploratory right now. And so there's.

Raza Habib [00:32:21]: There's two parts to it I would say. Like one is things that look like tricks or hacks or kind of working around the deficiencies of the models like how do I actually communicate with the model And I think that is best optimized away or will be just be improved by the providers themselves. But then there's the second part of just clearly specifying what it is you actually want the system to do. And that's why domain experts end up really being important. And I think that second form of prompt engineering of having someone who is a domain expert articulate what good looks like and exactly what you're trying to achieve, I don't think that's going to go away and I think that's where the role of prompt engineering will kind of stay and tend towards.

Demetrios [00:32:59]: Excellent. Well this has been very cool. I appreciate both of you coming on here and as always if anyone wants to continue the conversation then you are both on LinkedIn and really thanks again. Huge thanks Rasa for Human Loop sponsoring this, making it happen. It is. It warms my heart and you can't get away from me. I'm going to see you in San Francisco soon enough.

Raza Habib [00:33:27]: Yeah, see you in a couple of weeks. Thanks very much Demetrios and thanks to the community.

+ Read More

Comments (0)

Popular

Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.4K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

Pitfalls and Best Practices — 5 lessons from LLMs in Production

Posted Jun 20, 2023 | Views 1.1K

# LLM in Production

# Best Practices

# Humanloop.com

# Redis.io

# Gantry.io

# Predibase.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io

Bridging the Gap between Model Development and AI Infrastructure // Mohan Atreya // AI in Production 2025

Posted Mar 13, 2025 | Views 337

# LLM

# GPU

# Rafay