Advancing the Cost-Quality Frontier in Agentic AI // Krista Opsahl-Ong // Agents in Production 2025
speaker

Krista Opsahl-Ong is a Research Engineer at Databricks, where she focuses on bringing agentic AI systems into production for large-scale enterprise workflows. Prior to Databricks, she was a PhD candidate in AI at Stanford, researching compound AI systems, automatic prompt optimization, and enterprise agents. Krista is also an active open-source contributor for DSPy, a framework for building and optimizing agentic LLM systems.
SUMMARY
Enterprises love the promise of AI agents, but most projects stall in an endless loop of manual prompt tweaks, ambiguous evaluations, and ballooning inference costs. Agent Bricks—Databricks’ new agent builder platform—solves these pain points by turning a simple task description and data source into a production-ready, domain-specific agent that is optimized for both cost & quality. In this talk, we’ll give an overview of what Agent Bricks are, how you can get started with them, and discuss some of the research that is powering them.
TRANSCRIPT
Krista [00:00:00]: Hi everyone, I'm Krista. I'm a research engineer at Databricks working on the Mosaic research team. And today I want to talk with everyone about how we're going about productionizing agents for various enterprises at Databricks, and specifically some of the work that we're doing to push the cost quality frontier. All right, so what I'm hoping that you'll be able to learn by the end of this talk, even though we only have 10 minutes together, is first, an overview of where enterprises are applying AI agents in production today. So, sort of what are the common use cases? Second, what are some of the common challenges with building and then productionizing agents for these types of use cases? And then finally, of course, how to go about building and optimizing agents for both cost and quality. And I'll talk about a platform called AgentBricks that we've been developing to help do this. All right, so starting with agents and enterprise today, there are a few common use cases that we see come up a lot across enterprises.
Krista [00:01:10]: So the first one sounds almost deceivingly simple, but there are a number of challenges with this and there's a lot of impact to be had here, which is document understanding. So you can think of a few different use cases. Like one of which maybe a company has tens of thousands of invoices that they receive monthly. Normally they need to go through and manually extract relevant information from these invoices to put it into some sort of software like SAP. But as you can imagine, this takes many, many hours to do and it would be great if we could automate this process. Or similarly, you can imagine, let's say we had an E commerce platform that, that has millions of product specs that are sort of buried in PDFs in these unstructured documents. How can we extract relevant information about those products so that they can then surface that information on their website for customers? Another common use case is knowledge assistance. So there are a few ways that this comes up.
Krista [00:02:07]: But you can imagine an internal use case where maybe a firm wants to create some sort of deep research tool tool that's powered by many of their own internal documents to help the sort of productivity of analysts on their team. Or you can imagine a external facing use case where maybe there's a company that wants to build a chatbot to deploy on their website and there are very specific brand guidelines the chatbot needs to follow that we somehow need to optimize for. And then of course, there are many, many other use cases people are very excited about. Gen and everything that AI agents can do. Um, but you know, some of these might be a company wanting to fine tune a custom LLM for a certain task like generating good titles for articles, for example. There are so many more use cases here. Um, and then naturally you might want to chain these different components together too. So this is something that also comes up.
Krista [00:03:05]: Alright, so what are some of the common challenges in actually getting these types of agents to production? Unsurprisingly, there are many. So one of which are, one of these challenges is that there are many design decisions that need to be made when you're trying to create a, you know, very performant agent. So this ends up like involving lots of manual setup and trial and error. So for example, trying out different prompts, manually testing out different types of architectures, plugging in and out different tools, plugging in and out different types of models and sort of seeing what works best. And this can be pretty time consuming. The other piece here is that naturally we want to be able to optimize our system for both cost and quality. And different types of tasks or use cases will have different ways they prioritize these, but there's always a trade off between the two. And here, this is a graph that shows cost on the X axis, this is log scale, and then quality on the Y axis.
Krista [00:04:08]: And this is from some benchmarks that we did on information extraction with out of the box models with a decently optimized prompt. And here you can see that in order to get decent quality we really need to pay up. Like it becomes more expensive and that's not always feasible for people when cost isn't a factor. And especially for this information extraction use case when you might have really high volumes that you want to do inference over. And so in order to sort of push this curve up into the left here, you need a lot of the latest research. And so not every company can afford to have a full in house research team working on every single problem. So the question is sort of how do we, how do we do this? And then finally, evals are naturally always a challenge. So many of these tasks can be ambiguous.
Krista [00:05:01]: A lot of, a lot of times we don't have labels for what we're trying to do. So this can be hard. And you ultimately need some way to measure quality in order to see if something is production ready and optimize it. So this can be a blocker for folks. All right, so now getting into how we can actually go about building AI agents with this agent bricks platform that I mentioned earlier. So Our goal here when we were working on this platform was to enable enterprises to very easily create and optimize AI agents for both cost and quality. And I'll talk about what this means in a second. So basically the product here is again this platform called Agent Bricks.
Krista [00:05:43]: We have a brick for all the use cases that I mentioned earlier. So information extraction, Knowledge Assistant, this multi agent supervisor that can chain together different tools like MCP servers, genie, which a text to SQL application, or any of these other bricks to create a multi agent system. Then finally custom LLM which is basically tuning an LLM for a particular custom use case. Again, we don't have too much time here, so I'll give a high level overview of how this works and if folks have questions, please reach out afterwards. But there are a few steps to building an agent with agentbricks. The first is that as input we take in some high level specifications. This is a task, whatever task you choose to set up. Maybe it's Knowledge Assistant, maybe it's custom LLM, but this is something the user will choose.
Krista [00:06:41]: They'll provide a description of the agent that they want to build, just a high level natural language description saying what they want. Then finally we'll ask for some basic inputs. For example, if you're building a chatbot and you want it to use custom docs internally, then we'll ask for those. So here's an example of what that looks like for a Knowledge Assistant brick, where here you have a description that you can enter and then you can configure the knowledge sources that you want to use within your like custom conversational agent. All right, so then the first step on the agent brick side that will automatically do for you is create a set of evals. So here's another example from the Knowledge Assistant. But we've created custom LLM judges for sort of all the things people generally care about when it comes to question answering. So here we have completeness, groundedness, relevance, safety and we can generate lots of responses for you and score these using these lm so you can sort of see if the judge is aligned with your expectations.
Krista [00:07:49]: The second step is where sort of a lot of the magic happens. And this is in optimize. So here the way this looks in the UI is just a button saying optimize your agent. And here we'll basically sweep a number of different methods to optimize weights and prompts of your system, many different system configurations, models, etc. With the core goal of improving on cost and quality for your task. And so this is just a handful of some of the methods that might be behind the scenes here. But again we have automatic prompt optimization, fine tuning RL methods like Tao tool choices, model choices, etc. So the idea is that this can kind of push your cost quality curve up into the left as we mentioned earlier.
Krista [00:08:38]: And in practice this is what we see. So again using this benchmark from information extraction that I showed earlier, we see this blue line prior to optimization and then this red line post optimization and you can see that we're able to achieve the same performance for about a 10th of the cost or about the 16th year and then a similar cost if you're very cost sensitive and want the cheapest option possible. Same cost for about 20% increase in quality. All right, so that is it for for my talk. If you're interested in learning more. Definitely. Check out these links here. We have some more docs that go into detail and then this demo that we did a summit recently and also happy to answer any questions offline or in the slack if folks have follow up.
Krista [00:09:28]: Thank you.
Skylar [00:09:31]: Awesome. I would maybe just go. Yep, that's exactly what you are. Codes. Awesome. Yeah this is great. Awesome to see. Sort of like yeah know pushing that Pareto frontier on cost and quality.
Skylar [00:09:44]: I find that that's like often a challenge of like you know, big models, capable but expensive. So this is exactly. Yeah. So I'll just like I think we have a couple minutes so I'll just say again if folks are watching and you have questions for Krista, feel free to drop them in the chat and I'll pass them along. We're getting some hand claps in the chat so.
Krista [00:10:07]: Oh wow, thank you.
Skylar [00:10:13]: Not seeing any questions come through so maybe I'll just kind of take one off the dome. So curious to hear a little bit more about sort of you have this process where like these optimization techniques are applied. Can you say a bit more about like what kinds of optimizations are applied there?
Krista [00:10:33]: Yeah, sure. So there's, there's a lot of things going on behind the scenes and it kind of depends on what our research finds is most useful for the type of task or use case that you're looking to use. But some of the ones here that I can like talk more about. There's automatic prompt optimization which basically applies methods for instead of sort of manual prompt engineering as we typically do today, basically systematically proposing and evaluating another different prompts that could work well for your system to optimize performance. There's also RL methods. So recently our research team released a method called tao, which is basically a label free method. So if you don't provide any labels here, that's totally fine and we can handle that. That basically uses RL to continue adapting your program to continuously improve it over time based on any new queries you might get.
Krista [00:11:30]: Those are a few of the types of methods, but we're always like iterating and that's basically like my whole job is to work on the things behind the scenes here. So we're like making sure all the best and latest research is powering the bricks.
Skylar [00:11:46]: That's awesome. Is just curious because I think there is a relationship between these things. Is any of that powered by like DSP in the background?
Krista [00:11:54]: Yes. DSPI, which I worked on heavily. My PhD is also is definitely sprinkled in here for sure.
Skylar [00:12:03]: Awesome. Awesome. Big fan of DSPy. Cool.
Krista [00:12:06]: Same. I'm glad to hear.
Skylar [00:12:08]: I often actually, you know, if you worked on it, maybe you can answer this question. I often get people asking me, is it DSPY or DSPy?
Krista [00:12:17]: It is DSPy, but as long as you're using it and enjoying it, that's all that matters. So.
Skylar [00:12:24]: Yep. Awesome. Well, thank you so much for your time. This is. This is amazing. So again folks, go check out agent bricks. Awesome platform. Do less work, get better results.
Krista [00:12:36]: Exactly. All right, thanks. Bye.
