MLOps Community
Sign in or Join the community to continue

Operationalizing AI Agents: From Experimentation to Production // Databricks Roundtable

Posted Mar 30, 2026 | Views 1
# MLFlow
# Databricks
# AI Agents
# GenAI
Share

Speakers

user's Avatar
Samraj Moorjani
Software Engineer @ Databricks

Samraj Moorjani is a software engineer working on the Agent Quality team. Previously, Samraj worked at Meta on ads/product classification research and AppLovin on MLOps. Samraj graduated with a BS+MS in Computer Science from UIUC, advised by Professor Hari Sundaram, where he worked on controllable natural language generation to generate appealing and interpretable science to combat the spread of misinformation. He also worked with Professor Wen-mei Hwu on speeding up LLM inference with extreme sparsification.

+ Read More
user's Avatar
Apurva Misra
AI Consultant @ Sentick

Apurva Misra is an AI Consultant at Sentick, focusing on assisting startups with their AI strategy and building solutions. She leverages her extensive experience in machine learning and a Master's degree from the University of Waterloo, where her research bridged driving and machine learning, to offer valuable insights. Apurva's keen interest in the startup world fuels her passion for helping emerging companies incorporate AI effectively. In her free time, she is learning Spanish, and she also enjoys exploring hidden gem eateries, always eager to hear about new favourite spots!

+ Read More
user's Avatar
Ben Epstein
Co-Founder & CTO @ GrottoAI

Ben was the machine learning lead for Splice Machine, leading the development of their MLOps platform and Feature Store. He is now a the Co-founder and CTO at GrottoAI focused on supercharging multifamily teams and reduce vacancy loss with AI-powered guidance for leasing and renewals. Ben also works as an adjunct professor at Washington University in St. Louis teaching concepts in cloud computing and big data analytics.

+ Read More
user's Avatar
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

SUMMARY

This panel discusses the real-world challenges of deploying AI agents at scale. The conversation explores technical and operational barriers that slow production adoption, including reliability, cost, governance, and security.

The panelists also examine how LLMOps, AIOps, and AgentOps differ from traditional MLOps, and why new approaches are required for generative and agent-based systems.

Finally, experts define success criteria for GenAI frameworks, with a focus on robust evaluation, observability, and continuous monitoring across development and staging environments.

+ Read More

TRANSCRIPT

Adam Becker: [00:00:00] So, and the reason that I wanted to zoom out for a little bit, uh, as we're going live, by the way, if you're joining us, very good to have you. It's that I like, I understand that we're gonna dive really deep into how to actually productionize agents, put them in production, do it safely. There's a lot of challenges and a lot of failure modes, and I get that.

Adam Becker: And this is the purpose of this conversation. And so I'm very glad to have my panelists here today. So thank you guys for joining. We're gonna do a round of introductions in a minute, but I do wanna just plant a flag here, which is, I don't know what you guys are feeling and I'd like to get your sort of sentiment here, but from my point of view, I am, my mind is blown every single day by these agents and by our ability to move.

Adam Becker: So, I mean, just the, I feel like software engineering is just. Well, we've had it for decades and the way [00:01:00] I program now just looks completely different from two years ago, a year ago. I just don't understand. It's like it blows my mind and I'm amazed by this every single day, uh, and what it is that I'm coding is changing and how I'm coding is changing.

Adam Becker: And every single thing about this is just blowing my mind. Uh, and we're just, we're able to move so fast, and I just wanna at least share that sentiment because you guys might be feeling it too. And I understand that when you're actually doing it in production, yes, we're gonna have challenges and yes, it's, it might be embarrassing if it goes wrong or whatever, but every single day now, I tell my partner, I'm like, look at what this agent did right now.

Adam Becker: Can you believe she said this right? Like, I just, and at some point she's like, Adam, I'm done with you and your agents. I mean, just gimme a break. You're just so excited about them and I am very excited about them. Uh, but. For this hour, we're going to face reality, uh, and we're gonna see how to actually make sure that they're not doing us damage, despite all of the excitement.

Adam Becker: So thank you very much for [00:02:00] joining today. I am Adam Becker and I'm your host for today. I'm gonna be the moderator of this panel for the MLOps community. Uh, so today with me we have Raj, Ben, and Apurva. Thank you all. How about we start? Maybe you can give just like, uh, a sentence introduction to you, your name, your role, your title, and perhaps what aspect of agent deployment has been, uh, keeping you most busy recently.

Adam Becker: Uh, Ben, do you wanna start?

Ben Epstein: Yeah, sure thing. Uh, thanks everybody for, for coming on. Um, my name is Ben. I am, uh, also a volunteer at the MLOps community, but I am the co-founder and CTO of a company called Grotto. Uh, we're a vertical AI company in the multifamily space. So active users and buyers of all of the agentic platforms and systems and GPU, uh, providers that you could think of, um, constantly evaluating them and seeing what's best for our team.

Ben Epstein: I would say the biggest impact that agents have had for the [00:03:00] company, besides just like making our pro the thing that we build possible, like it wouldn't be possible without them, but besides that, uh, is the internal agents that we have at our company that exist in our Slack channels, um, that have given.

Ben Epstein: Pretty much everybody at the company, from engineering to operations, to marketing access to essentially all of the data and the context across the system. And if you think about what was happening in like 20, even 20, 25, 24, where a data person or a product person or a marketing person would have to go ask an engineer to do a query against BigQuery to get you some data, like that doesn't exist for us.

Ben Epstein: It's kind of crazy. Like all of our data, we have backups, our agents have access to those backups, and people can authenticate through our internal Google account into those backups. And we'll see like 9,000 plus message threads with our agents where they're just analyzing the data, producing spreadsheets, like giving those spreadsheets to customers.

Ben Epstein: Um, I mean, it's, it's unbelievable. Like we have a team of [00:04:00] six doing what I, what feels like it used to take 40, like, it it's nuts.

Adam Becker: 40 and probably years to build.

Ben Epstein: Yeah. I mean, it's crazy. Like I, I, I Slack used to give you the ability to see. The breakdown of who was tagged the most that week. And I want, I can't find that feature anymore.

Ben Epstein: 'cause Slack changes every three days, but like, I would love to see it, 'cause I'm pretty sure our agents are tagged probably eight to 10 x more than any individual person,

Adam Becker: the, the most prolific employee at this point. Uh, nice. Pva, do you wanna go next?

Apurva Misra: Yeah, sure. Uh, hi, my name is Pva Misra. Um, I'm a founder of a company called cec.

Apurva Misra: We actually do a lot of consulting work with startups and smaller companies, um, building their AI strategy and building solutions for them. And we are also in the education space. We are doing a lot of like workshops and paintings for companies. Um, so, uh, currently I'm working, uh, with the companies.

Apurva Misra: Actually most of our work is focused on internal workflows, like Ben was saying, [00:05:00] like, um, helping in like marketing, like it's AI is applicable everywhere, um, in all of the different domains that a company has. So, um. A lot of work is with like QBI degeneration, like automating the existing workflows, which were not efficient and were taking a long time.

Apurva Misra: Um, so a lot of it is like talking to these different, uh, people in different domains, uh, learning what they're currently doing and figuring out where we can sprinkle the AI magic and make it faster. Um, yeah.

Adam Becker: Nice. Awesome. Uh, samra.

Samraj Moorjani: Cool. Hey, thanks for having me on. Um, my name is Samra Mujani. I am an engineer at Databricks working on the ML flow team.

Samraj Moorjani: Uh, and you might know ML flow for the more traditional side of ml, but uh, we're working more recently on providing end-to-end gen AI platform for your observability evaluation governance needs. Um, yeah. And so one of the, uh, one of the big aspects that I work on and, and something that I keep on having to rack my [00:06:00] brain about is, uh, quality of agents, both for, um, helping our customers pro, um, build high quality agents and deploy them into production as well as for ourselves, building our own high quality agents to use internally.

Samraj Moorjani: Um, so yeah, that's me.

Adam Becker: Nice. Awesome. So just to set the stage then, so that everybody's on, on the same page. Summarize, you are working on the tool ML flow. People might know it from the classical era of machine learning and MLOps. Uh, and so you get to see a lot of users and what their needs are, and then you're trying to build some tools to help them.

Adam Becker: And you guys have your own needs and you're obviously, you know, dog footing your own tools and, and, and are building towards your own needs. Apurva, you're seeing in high depth, uh, a few different companies. You're doing consulting for a lot of companies, mostly smaller ones, and then mostly ones that. It might just be their initial, the, the first time that they're actually getting, uh, uh, started with ai.

Adam Becker: Uh, and then Ben, you are perhaps one of those [00:07:00] startups and then you're deeply involved in building a lot of agents, uh, and then seeing how well they run. So we got the entire sort of like the, the entire spectrum here from heavy users to tool builders. So this is excellent 'cause we're gonna get to see sort of like all of the perspectives, or at least how all these different perspectives are, uh, how the needs and challenges are manifesting across all these different stakeholders.

Adam Becker: So that's, uh, that's awesome to see. I wanna start us out by asking the following question. So somebody's coming to you, either Ben, that might be, you know, an internal user or a provide might be a new client, or Sage might be a new user and they're saying, I want an agent and I want that agent to do X.

Adam Becker: What is that X? Ben, you already started to give us a little bit of a sense, but I wanna make sure that it's as vivid as possible so that we're all on the same page. What are, what are companies dreaming their agents be doing? Uh, whoever wants to start. Uh, Ben, if you wanna go for it.

Ben Epstein: Yeah, sure. I mean, for us it [00:08:00] is an extension of what I just said.

Ben Epstein: It's all of the things that, um, you could imagine typically marketing or customer success wanted to do but didn't have access to because it required like engineer resources for, for engineer resources for bad reasons. You know, there's good things like we need to upgrade, um, this system or we need to modify the database because a user's data, I don't know, like was was assigned to the wrong thing.

Ben Epstein: Like you don't want an agent to be able to do that. At least we don't want an agent to be able to do that. Some companies might, and so great, like it has to go through an engineer, but like, uh, you know, for us, a user was promoted. We don't want that to be like a part of our UI necessarily, but a user was promoted.

Ben Epstein: We have to modify their email. Like, yeah, MCP server requires human approval. Like there's no reason that, uh, a, an engineer for us is required. We so great. We have MCP servers, we have [00:09:00] easy ways to deploy them. With fast MCP, we deploy those MCP servers. They are like right tools, so they require a human click to approve.

Ben Epstein: We tell our internal agents that they exist. We can figure them up through like our internal Google off, and then they're like, okay, do go, go agent, do this thing. It comes back. It's like you have to click this link to approve it, clicks it, and the thing is done. And no engineer was harmed in the, in, in the making of that, of that server call.

Ben Epstein: Things like that. So it's, um, simple right? Tools that we are comfortable giving non-engineers access to make those rights that are never DV queries for us. They're always very, uh, contained isolated functions that, uh. Call, like internal functions that make DB calls, but like you can't, there's no like SQL injection risk or things like that.

Ben Epstein: Um, and then a bunch of read access that used to be very difficult or annoying to set up. Um, those for us are by far the biggest. And then we have things like for marketing, we want to test out different versions of our website. So our internal agent has the ability to [00:10:00] host our website separately on its internal domain, like exposed through CloudFlare, and then our marketing team can go and modify and like make changes, test SEO, see how it comes up with different search queries, and then they can port that over to a pull request that actually modifies our site.

Ben Epstein: So it's giving them access to do, those are real, all real examples that we do today. Um, that would've required engineering.

Adam Becker: I'm curious then, is it, is it fair to say that it, that like those needs emerge from both camps? That is the engineers themselves are saying, why are you giving this to me? I don't need to do this.

Adam Becker: Just do it. And then maybe the non-engineers are saying, why do I need to go to this engineer? Can't I do it myself? Are you seeing them kind of bubble up from both, both sides?

Ben Epstein: Um, I would say it's, I mean, given how small we are, I, it's definitely more the latter of, I mean, it's both, it's, it's just a cultural shift.

Ben Epstein: Like I, I pushed everybody at our company to time they think they want to ping me or my team to do a thing. I say like, first, just like go see if the [00:11:00] agent can do it. Also, I'm sure a lot of companies have this, but like our agent at this point, it constantly is updating its own memory based on interactions and like, it's gotten to a point where it just knows to ping me.

Ben Epstein: Like, like it just messages me, it threads if it thinks it can't do a thing or it doesn't know how to do a thing. So it's sort of at this point is, is self-enforcing. So somebody on our team will go ask the agent, they'll go back and forth a bunch. The agent will either do it great or it won't, and it'll literally type out and send a message like, at Ben, can you jump into this thread?

Ben Epstein: Like how do we, like, how do we do this?

Adam Becker: That's fine.

Ben Epstein: Um, and that's sort of how it's been going and, and, and self-improving.

Adam Becker: There's like a Ben MCP. It is just, there's,

Ben Epstein: it's actually just a Ben. It's just a Ben handle on Slack.

Adam Becker: Uh, fun. Okay. Uh, Apurva, what, what, what kinds of, uh, use cases are you seeing and what are kind of like the, the grander of the aspirations that people have when they

Apurva Misra: come?

Apurva Misra: I'm, I'm just wondering, did it like learn over time that it has to tag Ben?

Ben Epstein: Yeah, it did. Like, so we, [00:12:00] one of its internal, you know, like when, when you set up a one of those systems, you often will give it like a Soul doc, like a soul on MD and one of those Soul Doc and in ours not unique is just like every time you have an interaction, go, like, think about what you could update in your memory about it.

Ben Epstein: And it just has started naturally tagging me with questions like I, that that was not like explicitly written. It just sort of picked up on that. Um, I'd like it to stop doing that, but like it does, it does do that. Um,

Apurva Misra: it's interesting what they pick up on, you know? Um,

Ben Epstein: yeah, totally.

Apurva Misra: So the kind of use cases that I'm getting are more from the business perspective.

Apurva Misra: So the kind of like people that come to me are like business owners or founders, and they have heard about ai. Um, they have tried exploring it in their company and they're trying to find some export who can build something for them. So it's mostly like a recent company. Uh, um, founder. They had a SaaS company.

Apurva Misra: But the users are non-technical. So it's a SaaS company, he's technical, uh, the owner. Um, but the users are non-technical and they were like struggling providing support to them. 'cause the questions would be like, [00:13:00] oh, where is the dropdown menu? Or where is this button? Like, they are not that experienced with using a computer software.

Apurva Misra: So they wanted to build an assistant which can do stuff for them. All the complex tasks that they have, like job assignment, um, or like email template creation, whatever, inside the SaaS, uh, software, they want to automate all of that and assistant whom the person can go talk to and can do all of that stuff.

Apurva Misra: So, uh, so a lot of my work is like narrowing down the scope and not like building what they want is like an end-to-end solution, which can like, do everything for them. So I have to like, help them like narrow down the scope and like build like smaller agents that we can, like, put together later on based on like how they're working.

Apurva Misra: Are they working good? Can we put the, uh, put, put them together and put like an orchestrator on top? So, um. Um, yeah, so that's like the kind of work that I get. Like a lot of it is like, is more about like communication, building the POC, um, trying to narrow down their scope and then actually building the solution and like then going on to like evaluation and stuff.

Adam Becker: So it's kind of like they, they come in and they want, [00:14:00] it's almost like the, they're not yet sure how to scope the agent. And so they imagine that one agent can do everything and then think they

Apurva Misra: sort of have figured out the problem and they think, I, I'm gonna find this expert, I'm gonna solve that problem.

Apurva Misra: Um, yes, in those problems, most of the time AI is the way to go. Sometimes AI is not even the way to go. Um, but even, even for the problem, like, um, what they have heard is like, agents work perfectly all the time and like you can just start with them. It's better to like narrow it down. Um, um. Reduce the scope, build something smaller to begin with, see how it's doing, uh, how the employees are working with the agent, uh, and then like, go with that.

Apurva Misra: Because a lot of it is also, uh, the behavior shift that Ben was talking about. Like the employees are getting used to like using AI tools and stuff. They know about hallucinations. Um, every time, like every time the agent was answering something wrong, they were saying it's hallucination, but the issue was their documentation.

Apurva Misra: Mm-hmm. It was just not [00:15:00] up to date. With startups, that happens a lot, right? Like you have documents, uh, written about your interface, and the interface changes so quickly. Like you, the button has gone from there or the dropdown menu has gone. So a lot of it is like them understanding what context management is, what context is, like, how like your documentation needs to be updated for the agent to work well.

Apurva Misra: Mm-hmm. So, um, it's always easy to start with something smaller so that the employees can get used to it and then like explore. More.

Adam Becker: Yeah. Uh, Raj and Ben, have you guys seen similar things then where it's like you're, you, you have high aspirations for what an agent can do and then you start to say, okay, well let's start a little bit more reasonably.

Adam Becker: Um, and if so, what is driving that pull towards the reason reasonableness? 'cause you might not have an apporva there, right? Who's telling you actually, da da, there's some other pressure from the environment that then forces the better constraints. Is that how you're seeing things and what, what does it look like there on that?

Samraj Moorjani: Yeah, definitely. I think, I think the quintessential [00:16:00] example is like you can try and replace an entire role or a person with an agent and, and you know, I think a lot of customers we've seen have built out prototypes in, in hackathons or, or for demos. And they notice like, oh, okay, it works pretty well while we're testing it, but as soon as they kind of release it more broadly.

Samraj Moorjani: It flops like really badly because people use it in ways that the developer never expected. It's obviously non-deterministic. It will hallucinate you run into these issues with like, um, out of date documentation or, um, even issues with just, um, prompt, um, engineering. So I think a lot of the um, uh, the reality check really comes when you see these issues in quality and it kind of, um, pushes people towards these, the kinds of things like VO was talking about where you do have to, um, think meaningfully about design.

Samraj Moorjani: You also [00:17:00] have to. Uh, a lot of times work on aligning with your stakeholders on what the agent should actually behave like and what it should do. Um, a lot of times the customers we're working with, they have separate teams developing the agents and separate teams, which have the domain experts that actually know what the agents are supposed to do.

Samraj Moorjani: Hmm. And so getting these teams to actually talk to each other is one of the biggest challenges because if you just have a developer going and blindly creating an agent for a medical chat bot, and they're like, okay, it looks good enough, um, someone's gonna come back and say, no, this agent's, uh, you know, doing something completely wrong.

Adam Becker: Yeah. So you're saying the first reality check in your mind is often just the quality degradation or so like they, they expect it to perform a certain way. It's subpar. I wanna just plant a flag here because I'm, I'm very curious about you guys' experience. Do people [00:18:00] actually monitor quality of agents? Is that actually I think like a poor, like do you demand that everybody does it?

Adam Becker: What does that actually look like? Like Ben, are you, is that just part of like your best practice? You, you don't even roll out an agent unless it has qua, like what does it mean? And, and is that something that we talk about or is it something that people actually are, is it like second nature for, for engineers?

Adam Becker: Like, help me make sense of this.

Apurva Misra: I think it

Ben Epstein: go.

Apurva Misra: Uh, at least when I'm building the solution for them, um, we try to have an observability layer on top, um, just to make it, because like what, um, Samra was saying, like, especially in my case, like it's their first AI solution, if it doesn't work, they will not trust the technology later on with like any of their other like needs.

Apurva Misra: So it, that's why you need to like narrow down the scope and make sure like it's working and at least like either make it work as well as they need it to, or they have their expectations or like drop [00:19:00] down the expectations and level it to like how the technology can like work. Hmm. And then like have an observability layer on top and like start pushing them towards like evaluation and stuff.

Apurva Misra: Um, because. It needs to be reproducible in the sense like if they say, okay, this didn't work, or the output on that day was bad, like we should be able to trace and like figure out like where, which agent decision or which tool call? Like where was it messed up? Mm-hmm. So, um, observability there is like important,

Ben Epstein: I think it depends, uh, on what you're doing. Like, I think, I mean we have unbelievably strict, uh, observability for all of the LLM calls and pipelines. That go through our product for sure. You can trace every decision back to a call that's flushed into a lakehouse and like, that's great. But also the agents that [00:20:00] we run in our, in our product are way, way narrower in scope than the agents that we have running live in our Slack channels because the tolerance for error is so much higher.

Ben Epstein: Like when I write with Claude code, if it doesn't respond for seven minutes and then just make some stuff up, like it's annoying, but it's fine. Like my life is still 10 x more productive than it used to be. Even if occasionally or even if frequently, it makes mistakes. It's always making mistakes. That's okay.

Ben Epstein: The agents in our Slack channel are making mistakes too, like all the time. We have a small, and obviously this is different for large enterprises, but we have a small team and everyone's highly, uh, has, has high levels of ownership and high levels of trust in the company, and they're responsible for the outputs of those models and for fact checking and for validating.

Ben Epstein: And like anytime there's a take, they're like, okay, system, update your memory, go fix the thing. But we can [00:21:00] iterate on that. So like, no, we have very little observability for our internal agents because our internal agents live in our Slack channel with read only access to stuff. Like they can't go and make real consequential decisions to the business.

Ben Epstein: They can't send the email. Like we still send the email. And, and that's how we kind of manage that. Um, but our, our tolerance for error in our product is, is zero. And so it's a totally different game and those systems are way more narrowly scoped. Um, so I think it's just super dependent on like,

Adam Becker: yeah. And then, then what does it look like?

Adam Becker: So narrow scope. And then what, so then you, you enforce a layer of visibility and you guarantee that a person is there to watch, to measure, to evaluate the quality of each of these. Like how are you actually, and then even organizationally, how does it, how does it happen

Ben Epstein: for the ones in the product?

Adam Becker: Yeah.

Ben Epstein: We go through sufficiently rigorous testing before changing any [00:22:00] of the LLM systems in our product. Mm-hmm. Uh, uh, such that, um, like we have confidence to ship it the same way, like the, we, we treat the agents that are in our, in our, in our, in our product are treated incredibly similarly to just traditional ML models, right?

Ben Epstein: Like every PR that touches any prompt or touches any downstream thing of the prompt runs through our evals. And those evals have a very, very low tolerance for failure. CI fails, you can't merge your pr and then. On top. Like, so that's, that's preemptive. We then obviously have sensor alerts for any time things like, um, schema coercion is failing too frequently, or, um, like changes in calls.

Ben Epstein: Like, like you would maybe back in the day call this data drift, but now you would actually call it model drift because the models literally are changing. 'cause we're using providers who change the models behind, you know, um, non-static API endpoints. So when those are changing frequently and, um, uh, we're getting sensory alerts.

Ben Epstein: And then even on top of that we [00:23:00] have, um, you know, model like our, our, our agents in our product makes a bunch of decisions, um, throughout the day. And we have our internal agents at night go and read the decisions from our product agents and send us slack messages of like, here's what I think your system did wrong in production yesterday.

Ben Epstein: And then we, so, so all of those are pieces that happen every single day for us.

Adam Becker: Okay. So that, so that might be a good, a good place to kind of pivot a little bit because what. What I'm hearing is we need to have high levels of trust. We need to do a variety of different things to establish that trust.

Adam Becker: Some of these things are similar playbooks to what we've done in the traditional MLOps, classical ML era, even pre gen ai. Some of it might be different. What are some principles then, if you guys have seen it all sort of emerge in the last couple of years, that can allow us to establish that high level of trust?

Adam Becker: We're talking about observability. What else do we have? [00:24:00]

Samraj Moorjani: Yeah, I think, um, you know, a lot of the, the success stories we've seen are built around this concept of eval driven development. And actually, if you take a look at Gen ai, I'm gonna make a claim. It's actually not all that different than classical software or traditional ml.

Samraj Moorjani: Um, and, and our playbook is actually, again, not that different, right? If you think about maybe like, take for example, vibe checking, right? Um, or, or vibe testing or whatever it's called in classical software, that's the equivalent of clicking around and praying that your software works, right? And we have a playbook for software.

Samraj Moorjani: It's unit test integration, test production, telemetry. So it only makes sense that for gen AI applications, you have the same exact guarantees to ensure that you trust what you're putting out in production. What that looks like in practice is like, um, unit tests are kind of like your evaluations. You still have production telemetry over, um, your Gen AI [00:25:00] application, but you also have additional components that help you gauge the quality.

Samraj Moorjani: Um, so, you know, these things are, are actually pretty similar at the end of the day. Um, and even for our internal use cases, we're still using a lot of the traditional ML techniques, like having trained test sets, AB testing, K fold validation. Like none of this stuff has really gone away. It's just applying to a new domain with slightly different challenges of, you know, your outputs are non-deterministic.

Samraj Moorjani: Users will use your agents in ways that you never would've expected. Um, and at the end of the day, quality is very subjective and like I mentioned, requires, often requires domain expertise. Um,

Adam Becker: okay.

Ben Epstein: So I think also, I think it's also broadly what Apporva said before, which, which is the thing that hasn't changed, which is just simplify the problem.

Ben Epstein: Like when you, um, I think a lot of the, the, like the demo to failed production deployment comes [00:26:00] from people using some agent kit that is super abstracted and builds the whole thing in one shot. Um, and when I give talks about that exact topic, really the whole thesis of the talk is like, stop delaying the product, work for the end.

Ben Epstein: Like all that, the initial burst of, I think LLM and agent. Development did, was it it, it sold a false promise that you could just no longer do product thinking. So don't do product thinking. Go put a prompt into production. It fails. Like why did it fail? You go back and then you end up just now being in Figma designing user journeys, like had you just designed the user journeys upfront, you would've realized that you didn't need one broad prompt with like 76 different conditionals.

Ben Epstein: You could just break that problem down into normal problems, most of which end up being like some abstraction over a classification or regression task, and then you could just use LLMs for those classification and regression tasks [00:27:00] and now you once again have test sets that you can validate against.

Ben Epstein: That's obviously not every problem, but it is like a lot of problems. LMS state machines, like they're all just decisions to change state. If you are just testing transitions between states really robustly, you sort of know when things fail because you know when the state change was wrong. So, so like, if you just do that upfront, like a was saying, like, it, it, it, it's not everything, but it's a lot of it.

Ben Epstein: Like, that'll definitely help pretty, it, it's like what works for us, like really, really well.

Adam Becker: Apurva, did you wanna respond to that?

Apurva Misra: I, I think everything is covered mostly like unit tests, integration tests, LLM as judge if, like you cannot do equal, equal to like assertions you LLM as judge is the way to go there. Um, otherwise like try to break down the problem such that it leads to a binary output that you can test.

Apurva Misra: Uh mm-hmm Otherwise it's, if it's subjective, it has to be with them as a judge, um, [00:28:00] to evaluate it. Yeah. Yeah.

Samraj Moorjani: Actually I love that, that you said that because I think one of the most powerful things that has come out lately, uh, you know, I notice a lot with my own coding and, and cloud code is having that feedback loop, having verifiable goals.

Samraj Moorjani: That can help tell your agent or your language model, like, have I actually accomplished what I set out to do? So with every task that I do these days, I will always give it some way to verify its own work, whether it's unit tests, whether it's, um, you know, uh, using playwright to actually click around and, and see like, is this the expected behavior In my ui, even for building out agents, I will have, um, judges run at the end of every change, uh, so that there is a, a feedback signal coming back to my agent and it can continuously iterate and help climb on that feedback signal.

Samraj Moorjani: So it's, it's really, really cool how much progress we're able to make [00:29:00] with those feedback loops. Uh, and I think it all stems back to that, that eval driven development point.

Ben Epstein: Yeah, that, that's such a valid point. Like my team has a pretty strict rule set now where NEPR. Goes through essentially a Ralph loop of like PR agent subagent code review, modify those changes, subagent code review.

Ben Epstein: Like until there are no like high, you know, Claude, uh, uh, assigned high quality issues before any person is taking a look at that code. We just Ralph loop it essentially until the subagent decides that it's good and it catches so many of the problems

Apurva Misra: coding. Yeah. It's so much easier to like, evaluate in the coding space versus like the other spaces.

Apurva Misra: So for example, there is, uh, there was this, uh, client project that I was working on in which we had to do a automated QBR tech generation. So like every, like two weeks, what the [00:30:00] agent does is like, it goes and checks gong for like a history of calls with that customer. Um, it goes, uh. It looks at the last QBR deck looks at like, uh, uh, what does a salesperson like, um, is talking about on Slack and like, like basically like gets information from the different sources and figures out if the QBR deck needs to be, uh, uh, updated.

Apurva Misra: So like it would also look at like, support tickets. Like if there is a new feature that went out, um, uh, I would add this to the QBR deck. So like, so it finally generates a QBR deck and now you have to evaluate like, is this like up to the mark? Can this, uh, person take it to the customer or, or not? So, um, one thing you can do is you can put like an LLM as a judge.

Apurva Misra: The other thing is like, it's hard to like align this judge with what the employee wants. You know, you need to align with the domain expert. Um, so you can also like date on this problem into like verifiable, uh, uh. [00:31:00] Verifiable, what should I say? Verifiable output. Like for example, it should have like the initial slide in which we are talking about this com, uh, company and also the client at the end.

Apurva Misra: It should have, um, like the necessary contact information of the company. Then in the middle it should talk about, there should be a slide about the product features. There should be a slide about the, so these could be like just binary checks, you know, like what's the title? This, what's the title? This, what's the title?

Apurva Misra: That, so there could be a template that you check against. Um, then, uh, you can also check for like, were the graphs included or not. Um, maybe this company needs like, some sort of like, um, metric system. Like, oh, these were the support tickets that your, uh, company provided and we resolved so many of them. So there should be a graph showing this.

Apurva Misra: So you can always, like, if there is a problem that you're working on and the LLM or the agent is giving you an output out of that, maybe you can, like, you can try to break that down into verifiable outputs. And if not, then yeah, you have to go for like an as a judge.

Ben Epstein: I do think on your, on your [00:32:00] note of, uh, code is the easiest thing to verify.

Ben Epstein: I'm gonna push back because the thing that we find, I mean, it's really, it's true on like maybe simple, like simple systems. What we find is that, um, even with Opus four six, like the, like the latest models, the sneakiest thing that digs into R Code and that they often fail to catch themselves are like asynchronous race conditions and like.

Ben Epstein: Potential memory leaks. It works. Test pass, things look good. Um, but we have seen a lot of problems that we've caught by hand, um, that you would never catch in CI because you're not gonna run like 150 concurrent streams in ci or even really as a part of like the integration test for that pr. Like it's just too much.

Ben Epstein: Um, but then you see those problems in fraud. Um, I, I literally be right before this was working through, through a problem that, uh, in a pr that claw open that would've [00:33:00] caused like a very problematic race condition for us. So, so that it's, it, there are parts of it that are verifiable, but there are very sneaky issues that it, that, um, that it introduces not, I know that's like super tangential, but it's just the, it's very top of mind.

Ben Epstein: 'cause I was just, I was just working on it.

Apurva Misra: That, that sort of like, I've experienced this as well, like even if you like add in as many evaluations and you have like checked the solution and it's working for you and then you put it out in production and then the users start using it, you'll find like a bunch of bugs that you need to fix.

Apurva Misra: Um, that, that, that is always the case.

Adam Becker: Yeah, it's, it's interesting because it's, uh, do you guys remember in like the more classical era of supervised machine learning people were always talking about, oh yeah, data scientists are spending 80% of their time cleaning data, right? Like just doing data processing, data, pre-processing, uh, just trying to build the data set in the first place.

Adam Becker: And I feel like, just based on what you're saying, I've always thought that, well, that time, maybe some of [00:34:00] it is gone to waste, but a lot of it is product thinking. That is, you have to just struggle with the framing of the problem because you need to start with the dataset. And you need to have a target variable that you're trying to predict.

Adam Becker: And, and so we've almost had to do the, we've almost like front loaded the work on evaluation in order to get started with machine learning in the first place. But what is happening right now is we can get started and then we realized, oh, we never had that why that we were trying to predict in the first place.

Adam Becker: And that's where all the product thinking goes into. And so now we have to do this after the fact.

Ben Epstein: And so that, yeah, that's what I was saying before where it's like you one prompt your product into production without doing the product. Thinking from like a user flow perspective, it's the same exact thing with machine learning.

Ben Epstein: Like if you didn't actually go and explore the AL alums. I mean, the thing that we struggle with in our team with LMS is the one thing we haven't been able to crack is our data analysis. Like we do really, really rigorous data analysis because the patterns in the data that we work [00:35:00] with are extremely hard to tease out, like sufficiently hard that the only people that work on it in our team are PhDs and statistics.

Ben Epstein: Like I can. Tease out the, the, the signal that they're finding and the LMS are unbelievably good at, uh, confidently giving you incorrect narratives and like e especially with data and charts and graphs. Um, and so like we, our, our, our core like data unit is like a leasing call or a leasing tour, like with a prospect.

Ben Epstein: And we just make everybody at the company when they first join, like, listen to 150 leasing calls because they're actually just very interesting. And it's hard to get any context as to what we do and why what we do works without listening to those calls. Um, 'cause the ELs will just massively simplify it and, and we'll sort of ruin the analysis before you even get started.

Samraj Moorjani: Yeah, I think we do a, a really similar thing as well. It's like, we know it's painful, but you need to go and sit down and, and look at, um, [00:36:00] essentially what your agent is doing. Like look through examples, see what it's saying to customers. Um, there's, um, we've been working with our research team a lot on this problem that Ayou alluded to, which is, how do I actually make sure that my evaluator, my, maybe my LLM as a judge is actually aligned with what my domain experts want, right?

Samraj Moorjani: My LLM judge can say something like, and say, this is good or bad, but my colleague might disagree and I trust my colleague more than I trust my LLM judge. So the question then becomes, how do I actually align this, this judge with the domain experts who actually know that the area. Right. And part of this workshop that we've been leading is literally just sit down, look at a, like a significant handful of, of examples and make sure that your team is aligned on what the behavior should be.

Samraj Moorjani: You know, make sure they're aligned on, um, you know, the agent is doing the right thing [00:37:00] here or it's not, and, and come up with evaluation criteria like a rubric based on that exercise that you can then go and use to further down the line your evaluations as you're iterating on your agent. Um, but yeah, I think this, this point of actually developing that product sense, it, it's just something you can't replace when you're, when you're developing your agents.

Adam Becker: Sam, are you, are you using that exercise in order to give better prompts to those LLM as judges or in order to, and, and yeah. Do you like modify the prompts of the judges or do you just use that as feedback after the fact? And t here's a few more examples of what successful application of the existing problems looks like.

Samraj Moorjani: Yeah. We actually use it for the judges, and I think it can also be encoded back into the agent as well. So there's, there's a ton of ways to use that feedback and loops to both improve [00:38:00] agents and judges. But for us, like one of the key things is making sure you also trust your evaluations just as much as you trust your agent.

Samraj Moorjani: Because if you have, um, you know, CI that you can't really trust, you're not gonna feel safe deploying to production. Right? And the same goes for evaluations. So the point is to make sure that you can trust your judges and then use those judges to continuously hill climb performance of your agents so that you trust your agents.

Apurva Misra: I think that's like reiterating on this like. Observe the output in production, and you might have to change the evaluations, update the LLM s judge, update the system. Maybe there's drift. Maybe their expectations have changed over time or the data has changed. So

Samraj Moorjani: yeah.

Ben Epstein: Yeah, I think that's really right. We spent so much, we spend, I always was like a, a big TDD person, but we do it so much more aggressively now and we use CI a lot like unit tests still, but our integration tests [00:39:00] are like substantially more robust than I think I've ever had in a company because like, our system is complex.

Ben Epstein: Like every system is complex and unit tests are great, but like we, we actually are just fully running, like actual, like full simulations on every pull request. And like, yes, that is more expensive. And like, yes, that does take longer, but like otherwise, how, like, how, how am I gonna let LMS touch like the core parts of our code if, if I'm not at least gonna run like a full system?

Ben Epstein: Like a full walkthrough of, of the product. Um, so we, we get, I think that makes a lot of sense.

Adam Becker: Are you, are you talking about integration tests of like the agent beha, like including the agent behavior or just the output of the agents having helped you with coding? Right.

Ben Epstein: No, like when an agent writes code and opens up a pull request in our code base, the level of testing that we run is like more robust and more significant than I've ever done in the past.

Ben Epstein: Even at the expense of like a little bit longer CI than I typically like, just because it, [00:40:00] it overall allows us to move much, much faster with a lot of confidence. Like when our CI goes green at the, at the, we are at a point now, I think when our CI goes green, we're like, we're, we're really confident that things are, that things are working well.

Ben Epstein: Like everything's being cleaned up. We're really actually interacting with the database. Like we're really doing a lot in ci, um, to make sure that we're, that we're comfortable. We run it in the exact same environment that our actual app runs. Like we, we go pretty heavy on it,

Adam Becker: so. Okay. So, so those are then I, I think two things we're talking about in terms.

Adam Becker: Nice. Uh, so those are, this, this is how happy it is to talk about TDD. So we're, there's like two aspects here. The first is the traditional software testing is now done more robustly, and that's separate even from the evaluation of the agent system.

Ben Epstein: I think it, no, I think it's a part of it. Like, I think the part of saying like, I mean, it's hard to separate those two things, at least for us.

Ben Epstein: Like the, the system is the agentic [00:41:00] system. Like the pro, the software is the agentic system. So like running the integration test is running the age agentic system. Like when I say we've run, um, like full simulations of the application, we literally, you know, one of our core products is like a system that guides phone calls.

Ben Epstein: Um, so we literally mock phone calls like we generate audio and we mock the phone call and we stream audio through a mock Twilio stream, which actually goes through our production speech to text system and actually gets predictions and actually makes decisions if we do that whole thing many times as a part of ci, like every single time.

Ben Epstein: Um, and even like a purpose saying, I think the weirdest thing now is that the providers will change. The underlying models, like pretty substantially without telling you, or like changing API calls, and this happened to us pretty substantially when thinking models became a big deal. Like we run a lot of real time applications.

Ben Epstein: And so thinking for us in those contexts is bad. We want it off. We have no interest in [00:42:00] it. Sometimes we would instruct in our prompts, you know, before thinking everyone was still doing thinking like that wasn't new. We would, we would just say like, put thinking in, like put add a thinking tag and like put your thoughts out and then close the thinking and then give the response and you parse that that way.

Ben Epstein: Um, but um, what we found without knowing was that the new models that were thinking models, if you, if you tell a thinking model to like, think in tags in an API call it goes crazy. It'll think pretty much indefinitely until it hits the output token window, like the, the output. Um. Uh, token limit, which was like not a thing that they used to do.

Ben Epstein: They used to like be able to think briefly. And one of our prompts, we just had, like think briefly before answering this, and we got an alert, like the calls that used to take like 600 milliseconds are now taking like 45 seconds because the LLM was just spitting out text in a thought token again, in thought, uh, uh, tags.

Ben Epstein: Um, and we caught that like before it went out because it was just a part of [00:43:00] like our, our CI integration tests.

Adam Becker: Very cool. So I, I just, we have I think about 15 minutes and there's a ton of questions already in the chat. Before we go into the questions, I wanna see, so summarize, you were talking about eval driven development, then you're talking about CDD and how all of this fits together, the CI and, and I wonder if people wanna get started thinking even more rigorously or robustly.

Adam Becker: About how to get up and running. Any recommendations, any resources, anything you can send them to? Either a blog like is eval driven development, like that's the framework you guys use when designing ML flow?

Samraj Moorjani: Yeah, definitely. I mean, I'm, I'm very biased, so I will plug all the ML flow blogs and all the ML flow resources, but I think philanthropic also had a really good blog on eval driven development.

Samraj Moorjani: I think it's called Demystifying Evals or, or something along those lines. So I definitely point people along, um, that way. But, um, I think of, of [00:44:00] all of the teams that we've helped and, um, you know, big company, small company, doesn't matter. The ones that we've seen the most success from are the ones that have had this, this, uh, mindset of starting out with evaluations and really thinking through, um.

Samraj Moorjani: With quality at the, the top of their mind as they're developing. Because at the end of the day, building an agent is really easy, but building a high quality agent is, is still a challenge that many people are facing today. Um, and then I, there are additional challenges on top of this. There's, you know, the question of observability, there's a question of governance of agents.

Samraj Moorjani: So there's a lot of, a lot of different things that are blocking people in production, but I think the most important thing when you're starting off is that quality aspect.

Adam Becker: Hmm. Okay. So, uh, and if you guys have any other ideas for more, uh, resources, drop 'em in the chat. I wanna start taking questions 'cause we do have a lot and we don't have too much time left.

Adam Becker: [00:45:00] Um, Apurva, the first one is for you. What stack tech stack do you generally recommend, uh, for building agents? Um,

Apurva Misra: it depends on the company that I'm working with, but. Um, I, I have been using Pyran AI a lot and lock fire, so it helps with the observability piece. Um, what else? Um, I, I've used bmal as well quite a lot.

Apurva Misra: I, I don't know how many of you know about it, but Bmal is pretty good too. Um, and it's, it's really nice in the ID itself. Like you can like see how the prompt changes what tokens are going and change the model, like test it out right there. So, uh, yeah, play around with that.

Adam Becker: Okay, cool. So I hope that that helps.

Adam Becker: Uh, Vidia, that was a question for you. Uh, Rishi is asking, are there tools you guys use to help increase observability and reliability for agents? Uh, Apurva, you touched on that a little bit. Samara, I have a feeling I might know

Samraj Moorjani: where you

Adam Becker: coming from.

Samraj Moorjani: Um, [00:46:00] yeah, so I think. I work on ML flow, so obviously, again biased, but we provide a lot of the tools for an end-to-end gen AI platform, including that observability piece.

Samraj Moorjani: And, and the thing that a lot of people, um, think about with observability in gen AI and um, I think I might have mentioned this before, is traces. It's essentially a way to see the end-to-end execution of your agent. Really pop open that hood and see what's going on. Every tool call, every LLM call. And attributes and metadata about each.

Samraj Moorjani: Um, it's not something that's specific to ML flow. I think every gen AI observability platform will have some form of tracing integrations. Um, and most of them, including ours, are really, really easy. It's just a line at the top of your, um, at the top of your code if you need more specific, uh, or like more customized use cases.

Samraj Moorjani: I've honestly never seen a case where we haven't been able to trace an agent. [00:47:00] So the, the observability component is super, super important. Um, you know, you can't debug, you can't look into negative customer feedback without it, and it's really cheap to do, so I'd highly recommend it. ML Flow has support for I would say like 30 to 40 integrations now, and it keeps on growing.

Samraj Moorjani: Um, yeah.

Adam Becker: Cool. Nice. Um, another question here, MoveOn is asking. Uh, what architecture are you using for deterministic pre-execution controls in agent workflows? So not just prompt level guardrails. And how do you prove those controls were actually enforced per action?

Apurva Misra: That's

Adam Becker: a question for anybody. Uh, any whoever wants to take it.

Adam Becker: Yeah.

Apurva Misra: So it depends on the kind of project. Like you can have checks for stuff after, uh, the agent has run. Um, in some cases, like there was a project which was about governance, uh, in one of the, like we had to check what the policies, we were updating the documents based on policy update in like different countries.

Apurva Misra: So we had a check after each LLM call. [00:48:00] So, um, like each LLM like circle, like that was happening. We had a check on like, what, what happened, what were the, like, we were able to attribute every change in the policy document and why that happened. It depends, like how many, like depends on the use case, the kind of use case, what checks, how many, like what's the latency?

Apurva Misra: Like, is this a real time need? So, um, this wasn't, so we had like a lot of checks in that application.

Ben Epstein: Can you repeat the question?

Adam Becker: Yeah, for sure. Uh, it's What are you using for deterministic, pre-ex pre-execution controls? Uh, in agent workflows. So pre, uh, deterministic pre-execution controls, not just prompt level.

Ben Epstein: Yeah. What is the, the pre-execution is the part that's throwing me off on that question. I'm not sure what, like, what exactly that means.

Adam Becker: Yeah. Uh, apurva if you had a, uh, a certain reading of it, please.

Apurva Misra: Oh, I was just going for deterministic on how do you, how you make sure that your thing is working the way you expected to.

Apurva Misra: So [00:49:00]

Adam Becker: move on if you want, if you wanna unpack it and until then we'll move on to, to the next one.

Ben Epstein: Yeah. I think there's one thing just worth calling out is. You like, you can't, like if you're building LLM only systems, you can't guarantee anything is deterministic unless you run that model yourself and you control batch sizes and you control input concurrency.

Ben Epstein: Like thinking machines has done the research. Like we understand why you can't actually get true determinism on, on those, on those hosted APIs. 'cause of the way batches are, are managed. If you run that model yourself and set a seed and set the temperature to zero, like you sure you can get determinism.

Ben Epstein: Uh, but if you can't do that, like you can't guarantee it and you need to be building systems where you're comfortable enough with the level of non-determinism that happens, it's quite low. Like if, if you're using a really low temperature and exceed and one of the major providers, um, you can get it pretty low.

Ben Epstein: But at the end of the day, like you have to be, uh, comfortable working with a probabilistic system.

Adam Becker: [00:50:00] Uh, so. How about this? Let's, uh, there's a couple of people that are mentioning a few other things that we hadn't yet gotten to, to, to chat about, and there's honestly, there's so much that we didn't get to, to talk about, and so maybe we can keep those in mind as we answer the following few questions.

Adam Becker: So, uh, they said mentioning land graph, et cetera, but about sandboxing off memory metrics, things like that. So perhaps keep those in mind when we're answering the next few questions. CD is asking, which kind of framework do you use these days? Uh, there are new ones every day. How do you secure your systems?

Adam Becker: So OAuth, for example, between agents and memory and, uh, tools for observability. I think we touched on some of that already, but if there's anything here that you'd like to unpack.

Ben Epstein: I, I like to keep systems pretty low level. I think it's a privilege of one, like being an engineer. First, but also working with a really small company like my startup is, is quite small.

Ben Epstein: We, we only have seven people and it's just because we're all able to like move very [00:51:00] fast and leverage this stuff in a way that like larger enterprises maybe have to take more time to introduce, but we keep things very low level and that gives us a lot of control. And so we don't use like Lang chain or actually Lang Graph.

Ben Epstein: Our entire system is powered by baml because baml for, for me, in my opinion, is the correct fundamental like abstraction layer over prompting. Uh, it turns prompts into functions. Those functions are guaranteed type safe 'cause Baml has to compile and it can't compile without it being type safe. Those types compile through your entire code base, both TypeScript, Python, JavaScript, those Python models turn into our database migrations, so it keeps the entire system 100% in sync.

Ben Epstein: Like we, we literally cannot fall out of sync unless somebody. So can somehow bypass our GitHub rules. Like it's, it's not possible. Um, and then because it's at the level of just making asynchronous calls to an LLM, getting guaranteed, uh, [00:52:00] typed outputs or quality validation errors that you know how to handle, 'cause you know how to write software you sort of have control of for everything at that point.

Ben Epstein: So OAuth becomes the normal way that you do OAuth, which is just like the way that we've been doing it forever. And agent to agent communication, we just leverage the standard MCP protocols for, for identity aware, um, server set calls. Like if you, if you don't use the highest level abstraction systems and, and you keep everything pretty low, you, your code base actually just looks like, um, regular software.

Ben Epstein: You, you wouldn't really know that our code base has a bunch of AI going on and it. Because, because it's all sort of in AML files and the Python and TypeScript code just looks like code. Um, so for us, that's how we do it. And then again, observability is just like, alright, well all these LLM calls are functionally API calls and you log your API calls and they're logged in a structured way with traces and tags and metrics.

Ben Epstein: And so you could just go look at them in Century, like, it's like, it's very [00:53:00] kind of boring software when, when, when you, when you get to control the lowest levels of it.

Adam Becker: Before we move on to the next question, do you guys feel like that's, uh, does, is that true to you as well? So sva like, are you, are you, to what extent do you rely on higher level abstractions and, and these types of frameworks versus let's just get as close to the Chrome as possible?

Adam Becker: Uh, and, and, and, and visual and try to even render our software as traditional software

Apurva Misra: I have been using as well. Like whatever Ben said, I'm all in.

Samraj Moorjani: Yeah, I think I'm, I'm also, I think Ben covered most of the points pretty well, so,

Adam Becker: so I want, I wanna group a few questions together here. Uh, so Nikolai is asking about the observability layer.

Adam Becker: Is that another agent? What, what does the observability layer actually look like? And I'll pack into that as well. So what system are you using [00:54:00] to create and orchestrate the judges?

Samraj Moorjani: Yeah, the observability layer is actually really cheap. It's, it's essentially just a way to sniff what's happening at each step in your agent.

Samraj Moorjani: Um, and if you know it, it's basically cost nothing to implement. I would recommend that any, any use case series about going to production has it. And I, I think Ben and Abor, you guys have probably, um, used this for all of your agent, um, use cases as well. Um, as to the second part of the question, um, this was about, uh, judges, right?

Adam Becker: Mm-hmm. So, uh, let me add to that. So what system are you using to create and orchestrate the judges? And then as Zeal is asking, those judges are probabilistic, how can you trust that? So like, don't you need something more deterministic to recreate evaluation?

Samraj Moorjani: Yeah, yeah. I think at, at the, you know, [00:55:00] at the simplest level, judges are basically just prompts, right?

Samraj Moorjani: That, or they're just classifiers. Um, and often we recommend that people start off with like binary classifiers that can say, Hey, is this good or not? Yes or no? Um, it's usually a stronger quality signal. The, the system that you actually use to create these and orchestrate these is, you know, ML Flow provides, judges, other frameworks like Laying Fuse, arise, Braintrust, they'll also provide judges.

Samraj Moorjani: Um, so it's just a matter of picking the right ones for your use case, whatever works for you. And, and I think the question of how do you actually trust your judges? Is, is a big one that a lot of frameworks have been thinking about. And at the end of the day, any outta the box judge you pick is probably not gonna be something that works for all of your use cases, right?

Samraj Moorjani: It might work 80% of the time, but that remaining 20% is something that you have [00:56:00] to kind of align with your stakeholders on. It's something you have to tune and, and work on your, um, you know, you have to work on your judges to get that full trust and confidence. One of the examples I mentioned before was that, um, workshop that we ran, which really helped people align those judges with their, their expectations.

Samraj Moorjani: And, uh, I think ML Flow and a couple other libraries have been providing tools to help you do this alignment automatically. Mm-hmm. Just get in user feedback and, and apply that directly to your judges.

Ben Epstein: Yeah, I think the only way to trust your judges is to, at like, at some level, I don't, it doesn't matter how many judges you have in the stack.

Ben Epstein: At some point you need to be making a decision that is classifiable. And so as long as you're making a decision that's classifiable at some point, whether it's binary classification, whether it's a multi label, like it doesn't matter, like you have, if you're, if you're classifying something, you can build a data set of evaluations, right?

Ben Epstein: You could build a ground truth set. And so if you build a ground truth set at some level of [00:57:00] statistical significance, you can build trust if you need, you know, it depends how much trust you need. If you need an enormous amount of trust, you build 10,000 samples and that's your evaluation set. If you need less trust, maybe you only need a hundred samples.

Ben Epstein: Like, but at, at some point you, you need a judge to make a qua a quantitative decision evaluation, not a qualitative one, right? If, if you're having them always just spit out summaries of summaries of summaries. Yeah, you can't trust them, but like you can always build a ground truth data set and so you can always just trust them at some point.

Apurva Misra: I feel like if you're using s in inside the agent, why can't you use the LLM for judging as well? Like you're using it for like making the decisions, like doing the tool calls, deciding like, am I supposed to like send this answer to the next agent or not? Then why can't you use, if you like align it really well, it should, it should work.

Ben Epstein: Yeah, I agree. As long as you have ground truth sets, like, I mean for us it's easier 'cause our systems are a lot of real time and like we're, we're [00:58:00] cost conscious and so like we don't use Claude as our agent. We use Claude as our judge, which is nice. It builds like some concept of like separation of concerns.

Ben Epstein: I don't know how real that is. I don't know that anybody's like really measured that maybe they have on archive, but like yeah, we, we use the really expensive max effort long thinking models as our judges and we have a valid ground truth data sets. We all align that they're right. And like at some point, you know, at this point we have like 800,000 samples and, and we've spot checked at least 10,000 of them.

Ben Epstein: And that, that's a lot. Like we, we have a lot of confidence that, that it is aligned to the thing that we care about for that very specific, uh, each individually specific task.

Adam Becker: We are just about out of time. I wanna maybe pose one last question and if one of you wants to jump in to respond to it. Um, it's about the, again, trust, but in this case, risks [00:59:00] involved with, let's say, letting any one of these agents interact with your data, uh, or interact with your database.

Adam Becker: There's a few questions around that, around that. How do you handle prompt injection risks when agents interact with external tools or APIs, uh, or user provided data in production? Uh, is there anything you can tell us about, here you go, gotchas to avoid when giving agents access to the db. Or do you feel the current DBS are good enough for agentic access?

Adam Becker: Can you share anything about that? No.

Ben Epstein: Yeah, I definitely don't, would not give my agent right. Access to my database. Definitely not like no shot. They have read only access and, and they, by the way, they have read only access to a backup that is not on the same server as the production database. So I don't care if they take it down, like they can do whatever they want.

Ben Epstein: Like they're in a sandboxed environment that can, can shut down and be recreated in an hour. Like it, uh, no. My agents have almost no right access. And to the extent that they do, it's unbelievably [01:00:00] limited. Um, I'm sure other people are on like different ends of that spectrum, but like we, we, we serve like really big enterprise customers.

Ben Epstein: Like I'm absolutely not risking anything like that. It's not worth it.

Adam Becker: Thank you everybody for joining. I feel like we could keep going for hours because there's so much that we just left unpacked and it's, uh, and I, maybe we should do this again. Uh, I think we, we probably should. Samra. Thank you very much, Apurva.

Adam Becker: It was a pleasure, Ben. Thank you very much as always. And uh, if you wanna linger in the chat for just another couple of minutes and maybe drop some links. I know Oscar has asked you, Ben, to share a link. A couple of other people have asked for links, uh, for, uh, for the ML flow stuff. And there's another couple of questions about, uh, about, I think I saw a few more things for you, so, uh, about Slack, QBR.

Adam Becker: Yeah. So I will leave you to it. Thank you everybody for tuning in. I hope this was fun [01:01:00] and useful. It certainly was for me. I have a bunch of things to go and research and uh, thank you everybody, uh, everybody for tuning in.

Apurva Misra: Thank you.

Ben Epstein: Thanks for having us.

You.

+ Read More

Watch More

Operationalizing AI Agents in Data Analytics Workflows // Ines Chami // Agents in Production
Posted Nov 22, 2024 | Views 1K
# AI Agents
# analytics
# Gen AI
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.5K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Introducing DBRX: The Future of Language Models // [Exclusive] Databricks Roundtable
Posted Apr 12, 2024 | Views 1.3K
# LLMs
# DBRX
# Databricks
# Databricks.com
Code of Conduct