Sign in or Join the community to continue

Everything Hard About Building AI Agents Today

Posted Jun 13, 2025 | Views 222

# Production failure

# AI system

# Observability

Share

speakers

Shreya Shankar

PhD Student @ UC Berkeley

Ph.D. student in data management for machine learning.

+ Read More

Willem Pienaar

Co-Founder & CTO @ Cleric

Willem Pienaar, CTO of Cleric, is a builder with a focus on LLM agents, MLOps, and open-source tooling. He is the creator of Feast, an open-source feature store, and contributed to the creation of both the feature store and MLOps categories.

Before starting Cleric, Willem led the open-source engineering team at Tecton and established the ML platform team at Gojek, where he built high-scale ML systems for the Southeast Asian Decacorn.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Willem Pienaar and Shreya Shankar discuss the challenge of evaluating agents in production where "ground truth" is ambiguous and subjective user feedback isn't enough to improve performance.

The discussion breaks down the three "gulfs" of human-AI interaction—Specification, Generalization, and Comprehension—and their impact on agent success.

Willem and Shreya cover the necessity of moving the human "out of the loop" for feedback, creating faster learning cycles through implicit signals rather than direct, manual review. The conversation details practical evaluation techniques, including analyzing task failures with heat maps and the trade-offs of using simulated environments for testing.

Willem and Shreya address the reality of a "performance ceiling" for AI and the importance of categorizing problems your agent can, can learn to, or will likely never be able to solve.

+ Read More

TRANSCRIPT

Willem Pienaar [00:00:00]: Spend another minute or two.

Demetrios [00:00:01]: Yeah. Just in case, make sure we exhaust every last brain cell of mine.

Willem Pienaar [00:00:06]: And that shouldn't be hard.

Demetrios [00:00:15]: So we should probably kick it off with. When you go to production, something fails and you're trying to figure out why it's failing, especially in your AI system. It's not the easiest thing, it's not trivial. And so I know both of you have seen some of this, you have thoughts, let's kick it off with that.

Willem Pienaar [00:00:33]: So we're focused on building an agent that root causes alerts in production. So an alert fires in your pagerduty or slack. We have an agent that can diagnose that by looking at your production systems, your observability, stack, planning, executing tasks, calling APIs and then reasoning about that until it distills down all the information into root cause or at least a set of findings. And so I can tee up a few challenges that we've run into. So one of them is ground truth. Like it's a lack of ground truth in production environment. Unlike code or writing, there's not just this corpus of like web information that you could just download and then train online. So you need to figure out whether the agent even successfully solved the problem.

Willem Pienaar [00:01:16]: Like, how do you know that? So you need to find ways, either user feedback, but sometimes the users don't know. Like if you go into engineer and say, you know, is this the root cause? Oftentimes I'll say this looks good, but I'm not sure if it's real. And so verification is also then a secondary problem. Effectively. The thing that we've learned is you need to as much as possible get the human out of the loop, not just from doing the work, but also the review process and the labeling and feedback process. Because otherwise you succeed or fail, but you're still dependent on them, blocked on them to improve your agent or your AI system. Ultimately what you want to get to is a loop, like a learning loop or failure in production and incorporating that back into your eval system. And you want this to be very fast.

Willem Pienaar [00:02:00]: This can't be an order of days or weeks or months. It needs to be ideally hours or minutes if you can. Yeah. So we can dive into some of those areas.

Shreya Shankar [00:02:09]: I'm Shreya. So I do two kinds of areas of research. I'm a PhD student, by the way. One is on data processing with LLMs. So if you have lots and lots of unstructured data like PDFs, transcripts, text data, how can you extract just semantic insights and aggregate them and make sense of them. Turns out that building pipelines to do this kind of has all of the same challenges as building AI agents. It's just AI agents for data analysis. So happy to chat more about what does it mean to build evals in this? Does it mean to incorporate tool use? How do you interplay methods like retrieval and having LLMs directly look at the data itself? And of course, what I think makes data processing a very interesting kind of petri dish for understanding LLMs and how humans and systems kind of all interact is that when you're doing data processing, you don't know all the data that you're trying to make sense of.

Shreya Shankar [00:03:09]: The AI is kind of telling you what's in the data. So verification is so hard here. You're not only verifying the transformation, the insights that are extracted, but also that they were extracted correctly, that they exist in the data, that you didn't miss anything. But how do you know if the LLM missed extraction if you don't know what's in your data? So there are a lot of really interesting challenges there. And then separately, because I feel like we have this pretty rich lens on how all of these problems work, I'm teaching a course with on AI evals in general. How do you evaluate how do you build AI applications that you feel confident in deploying? How do you build evals around them? When you mentioned your question, what you do when something fails in production, that's a horror story if you haven't even thought that out before you deploy it to production. You need to have some blocking in place, some metrics you're measuring. It can't just be, somebody said it failed, and then you're hands up in the air, no way to start.

Shreya Shankar [00:04:11]: You shouldn't have gotten there in the first place. So really, how do you go from zero to being able to debug?

Demetrios [00:04:17]: It feels like there's a bit of overlap here with the fuzziness. You don't understand what's in the data, and so you can't really tell if what is coming out is correct. And then also the same with on your side, you're like, we don't really understand if the root cause has been fixed or not. And so that fuzziness is in a way, like, how do you bridge that gap? How do you do it besides just. Well, I think it looks good.

Willem Pienaar [00:04:49]: Yeah. One really reductionless way to look at it is that if we're in the production environment, like in a cloud infrastructure, observability is like a lot of information time series there's logs, there's code, there's chatter and slack. And what you're really trying to do is really first step is information retrieval. It's search and you're taking very sparse information that's spread out everywhere and creating like dense like gems, like findings out of that that is contextual to the problem at hand. And I think what we've tried to do as much as possible, and I'm kind of curious to hear Shreya's take if this is possible in her use case. I actually want to see the difference between the use cases is we flatten the dependency on the agentix parts of the workflow. Essentially if you build agentic steps on top of agentic steps and if a base layer is wrong, then everything above that is wrong. In our cases, sometimes it still works out because the agent can go on a trajectory that is wrong and still stumble upon a finding that is good and bring that back.

Willem Pienaar [00:05:48]: But I think I'm curious, in her case, in your case, does that work or is it just Then it's a catastrophic failure to the user.

Shreya Shankar [00:05:55]: Most of the use cases we're focused on are more batch ETL style queries. So we're building the system called Doc ETL where you can think of it as writing mapreduce pipelines over your data. But the LLM executes the map. The map is not code function, it's a prompt that is executed per document. The LLM also executes a reduce operation. We do a group by and send all the documents to an LLM and it does some aggregation there. In this sense there's not really retrieval that immediately comes to mind. Retrieval can be thought of as an optimization if you you can understand what your MapReduce pipeline.

Shreya Shankar [00:06:34]: Your problem expressed as mapreduce is to map out insights that are relevant and then aggregate and summarize how they are relevant to the query. You can imagine some of these maps can be retrieval like execution. You can use embeddings to filter out before you use an LLM to extract the meaty insights from each document. And in that sense I would think of retrieval as an optimization. Putting my database hat on, of course.

Willem Pienaar [00:07:07]: So just curious at what layer does the LLM come into play? Is it planning the mapreduce operation or is it also within the operation itself?

Shreya Shankar [00:07:14]: It is within the operation. There is a layer on top that goes from completely natural language query to a pipeline of mapreduce filter. Think like Spark, but every operation is.

Willem Pienaar [00:07:25]: LLM executed and that is the natural language that comes from the User, Yes.

Shreya Shankar [00:07:29]: And in a sense I find that we've done a few user studies on this. Nobody really wants to go from NL to pipeline because you want to think about your workflow as kind of a data flow pipeline. And anyways, you're writing prompts for map, reduce, filter, cluster, whatever it is data operation that you want. So it is low code. No code. Where people really struggle is they've implemented. So say you implement your application, your pipeline and doc ETL for your use case, which you can do, it'll just be really slow. It's how do you go from initial outputs to improving the pipeline? How do you even know that the agent has failed because of an upstream issue? And then when you know something has failed, how do you encode that observation back into the prompt? You kind of have vibes that are like, oh, this is not quite right.

Shreya Shankar [00:08:31]: But to specify that what we call bridging the gulf of specification, that is the hardest part that we see every user do. Just trying to come into play non technical.

Demetrios [00:08:41]: So if I'm understanding that correctly, it's basically like I see something's wrong, but I don't exactly know how to fix it. And I'm trying to fix it by tweaking some prompts or I'm changing out a model, or I'm doing a little bit of everything that I think could make a difference.

Shreya Shankar [00:08:59]: Yep.

Demetrios [00:09:00]: And it may or may not work.

Shreya Shankar [00:09:02]: It may or may not. There's that and there's especially in data processing. And I think that's probably also true for your case. There's just such a long tail of failure modes when it comes to AI. They're like failing in all these bespoke ways and keeping track of all of that and synthesizing them to like, okay, what are the three concrete instructions I need to give the LLM? LLMs are great if you have very detailed prompts. They are horrible with like ambiguous prompts. Great. Like for 90% with it ambiguous.

Shreya Shankar [00:09:28]: But to get that last 10%, you're out here like giving examples, giving very detailed instructions, like, maybe that's just me, I don't know.

Willem Pienaar [00:09:37]: I mean the edge cases is a really good one to get into because I think this is where a lot of teams think, well, it's easy, I'll just build an age and just fork this open source thing and it'll just work and you get to some level of performance. But getting to very high performance really is really, really hard. Depending on the use case. I'm interested in like the HCI perspective as well, because I'm curious, what are the control surfaces that your users have? Like what do you get? Because if you think about like midjourney, sometimes the most frustrating thing is like it's just text and you just get back a random thing every single time. So it's like at the casino, just like the slot machine. Right. And it can be so frustrating. Yeah, but like if you have inpainting, then suddenly you can take an image and improve it.

Willem Pienaar [00:10:14]: Or Even the new ChatGPT image gen. Right. You can specify using a specific what you want. Totally. And that's a different level of aci. And I mean chat and all interfaces are just inferior, I guess to something more structured. So I'm curious, like what are those interfaces today? And where do you see that going over time?

Shreya Shankar [00:10:31]: Yeah. Okay, so now I have to go into history lesson. If you ask the HCI question, this.

Demetrios [00:10:36]: Is the perfect room for that. That is great.

Shreya Shankar [00:10:39]: So Don Norman, who's written a lot about design, really is the first one who came up with these gulfs. Gulfs of execution and evaluation are the one he came up with. And any product, not AI product, any interface, requires users to kind of figure out how to use it. Right. They have some latent intent in their head. Like say it's something as simple as booking a flight. You booked a flight from Germany to here. You need to figure that out in your head.

Shreya Shankar [00:11:10]: Look at the Google flights interface and kind of map that out and execute it and then see what happened and then make sense of it. And this kind of loop was very hard to do in the 80s and 90s. We solved it. But what AI brings that's very new is now this gulf of I'm going to call it specification because it's just easier for me to understand in that way. It's broken down to two things. One is how do you communicate to the AI? And then how do you generalize from the AI to your actual data or tasks? Because you can have a really great specification, right? You have a great mid journey prompt and that gets sent to the AI. But there's the gap between the intent that's well specified and the AI. And it gets wrong and then you're like, oh my gosh, did I do something wrong? So I think the first thing that we have to do as a community, and this is something that we're talking about in research circles too, is just recognizing that these are to gulfs.

Shreya Shankar [00:12:07]: Both need bridging, but the tools in which we have to do that need to be different. So for Gulf of specification, at least in the dock ETL stack, we are building tools to help people do prompt engineering, to mark examples of bad outputs and automatically suggest prompt improvements. The goal is just to get a very complete spec and we're going to make no promises that that spec is going to run as intended because we all know LLMs make mistakes. Our idea is separate tooling like task decomposition, agentic workflows. Better models are going to bridge that generalization gap.

Demetrios [00:12:44]: The gulf that you're talking about and how we've had so many years of humans interacting with computers that we figured out a way. It wasn't that hard for me to book the flights. The interface. Exactly. The interface has been polished, but now when you throw natural language into it, it is so much harder because A we're not used to it and B it is very fuzzy. Again, going back to that fuzziness of when I say a word, it may mean one thing, you may interpret it as another thing and the LLM can just. It's that going to the casino and you're playing slot machines.

Willem Pienaar [00:13:19]: And I think this is partly not just the natural language element of it, but it is having a model of what the system is doing. And if you can't figure that out, then you're just stuck at the starting line. So if you take a cursor and you're doing some software development, if you understand that it's just doing rag over your code, then you have a better mental model. And you know, okay, I can introduce these files, I can index that or if you know that the tabs that you have open are weighted higher. So having that model affects how you prompt the agents.

Shreya Shankar [00:13:49]: Absolutely. And the history too. Like I know that I have something in my keyboard or I did some edit. Yeah, yeah. I don't know why it feels like you have to know those things in order to steer. It's very difficult.

Willem Pienaar [00:14:02]: And I think that's a mistake a lot of people make is they. It's not that they want to infantilize, but they want to abstract this from users and just think, okay, this is just like a black box, don't worry about it. But it makes it much harder to use their products. And so another thing that we were experimenting with is sometimes giving the user a little bit less and giving them more UX affordances that allow us to get more feedback from them. But these are orthogonal so it may make the product slightly harder to use. So for example, we might give you. These are the findings, the key findings. But in like a mid journey Style we will give you buttons that say expand for more information or search further using this finding.

Willem Pienaar [00:14:39]: And so if we give you something that's very useful, if you click on that and expand, we know, okay, this is actually good. If you keep ignoring something that we give you, then we know this is bad. And so there's some implicit feedback that we get back. And I'm not sure if you're incorporating anything like that or.

Shreya Shankar [00:14:53]: Yeah, we're very big on like hierarchies of feedback. So the easiest thing to do is to get binary click or don't click yes or no no and then to be able to kind of drill down on that with open ended feedback. One of the things that we did that was quite successful to help people write prompts for Doc ETL pipelines was always have an open ended feedback box anywhere where you can kind of highlight the document or output that you think is bad and just stream of thought why it's like a little bit off. Oh nice color code that or tag that that lives in a database. And anytime you invoke the AI assistant or you. We also have a prompt improvement feature which can read pretty much all of your feedback and suggest targeted improvements.

Willem Pienaar [00:15:38]: So the prompts are visible to the user?

Shreya Shankar [00:15:39]: Yes, the prompts are visible. I don't think we're at the state yet where there's anything better than writing prompts for steering, especially for data processing. I think if you have a better scoped task, it's possible that they don't have to write the prompt like in your task, you're very specific, you're searching people's logs, helping them do root cause analysis. But say you are using ETL to write those pipelines, you absolutely have to write the prompt.

Willem Pienaar [00:16:08]: I guess there's always just a trade off where how much control you give the user. So we give them like a control surface like text input where they could inject some guidance either globally or contextually. So on a specific category or class of alert we can inject some guidance. So maybe there's like an SLO burn rate alert. Then you can attach something contextual that they would say. Always check datadog metrics, always check the latest Slack conversations, always check the system because it's always complicit or involved in some way to the failure. And sometimes you need the users to give you that guidance. There's so much context that's just latent in their heads that they need to somehow encode in your product.

Demetrios [00:16:47]: What it reminds me of is when you download a new app on your phone, and you have these moments going back to. There's that gap of I don't know how this app works. And so you kind of swipe around and you press buttons and you figure out, oh, okay, cool, I think I know what's going on here. And so having that. But now we're working with text and we're working with prompts, and so being able to really figure out what's the interface that's going to best interact with the human and the text. And so that expand button is huge. I really like that because then it gives a signal. It kind of just gives you a little bit, and then you get to know.

Demetrios [00:17:29]: All right.

Willem Pienaar [00:17:30]: But the thing is, you can't give them all the information. Then. Then you give them, like, a summary, just enough to, like, get them to kind of. You want more. Yeah, exactly. It's a teaser.

Shreya Shankar [00:17:40]: Yeah. And many times it's actually not even useful. People don't click on it. But that's very good signal anyway.

Willem Pienaar [00:17:45]: Good signal.

Demetrios [00:17:46]: Yeah, yeah, yeah, that's helpful. And same with the, like highlighting the text and being able to just stream whatever you want.

Shreya Shankar [00:17:52]: Yeah.

Demetrios [00:17:53]: And what I'm thinking about is when you have that, like, highlighting the text and then streaming of consciousness, how are you then incorporating that back into the system so that it learns and gets better from it?

Shreya Shankar [00:18:04]: And that's. That is why I think you need AI assistance there, because users cannot remember all of the tail, the long tail of failure modes. That is why we have a database of feedbacks. And when users need to improve prompts or want to do something new, we can suggest a prompt to them, because we already know things that they care about because we read their feedback and we always provide a suggestion or diffs to their prompts. And then they can click Accept or.

Demetrios [00:18:31]: They can click Reject, or it's reincorporating it into the prompt and also the eval set or.

Shreya Shankar [00:18:37]: Yeah, eval sets. Right now, we don't have a good workflow for it. We're still playing around with what we think is best, like generating synthetic data or having users. Currently, users would have to bring their own data. Right now.

Demetrios [00:18:50]: It's fascinating to be like, all right, cool, there's insights. There's thousands of insights here. Because of all this toying around with the output that I've had, I have left my. My precious human time has been taken to leave insights here. Now what do we do with them to make sure that they are incorporated into the next version of whatever I build? And so having an assistant and Being able to suggest, well, you might not want to do that because, remember, you said you cared about this and whatever prompt, that didn't work out well.

Shreya Shankar [00:19:23]: And it's not even that we find that I think in any we. It's not just data processing that has this problem. It's also code generation. It's also like bit journey or image generation. But when you're starting out with a session, you almost always do some exploratory analysis. There's a term for this in HCI called epistemic artifacts, and it comes from how artists use tools. Like, if they are given new paints or a new medium, they're going to play around with that before they paint their thing. And all of the interfaces that I think we build in this new.

Shreya Shankar [00:20:00]: Jenny, I like arena, for lack of a better term, need to have the ability to quickly create epistemic artifacts. Like when you're in cursor, you want to try something out and you want to be. If it doesn't look good or if it doesn't work, you want to be able to toss it.

Demetrios [00:20:16]: Yeah.

Shreya Shankar [00:20:16]: And keep moving forward.

Willem Pienaar [00:20:18]: I think that's one of the big failures today. Yes.

Shreya Shankar [00:20:21]: One of the biggest failures.

Willem Pienaar [00:20:22]: Yeah. Some of the costs are just too high to experiment and people just kind of back out of the playground or there's often just not a playground available.

Demetrios [00:20:31]: It really makes me think that this is a UI UX problem. And it's very much in the product of how do we make it as easy as possible for people to not have these. Oh, I know a little black magic on cursor, because I understand it's a rag and it's the tabs. And if I copied and pasted something, then it does better. That is something that should not be like, gatekept.

Willem Pienaar [00:20:58]: Right. Right.

Demetrios [00:20:59]: So how do you, in your product design something that is very much keeping that away from, like, oh, you have to know if you know. Great.

Shreya Shankar [00:21:11]: It's very difficult. What really muddles these kinds of IDEs or interfaces is that there's so many entry points into bridging the gulfs. Right. There's the gulf of the specification where you need to extract, externalize your intent as fully as possible. And there's the gulf of generalization, which is you need to make sure your prompt works, like, regardless of rag, regardless of, like, whatever hyperparameters that were selected. And right now, humans are specified or humans are relied on to give those hints for both the gulfs. Like, you have to know how rag works so you can give the appropriate hints to Bridge the generalization gulf. Like, that's crazy.

Willem Pienaar [00:21:49]: Yeah, yeah. Well, one of the things that we intentionally made a decision to fork off, of which we originally started with like a slack, like an agent that's in your slack. A teammate, essentially. And we actually started on the help desk and we were filling questions from engineers. So one engineer be like a platform team supporting another engineer coming in with like a question and getting in between engineers was very hard because there's a lot of, like, chit chat. There's a lot of back and forth. Often the questions are. They need immediate answers.

Willem Pienaar [00:22:19]: They're something that somebody spent a whole day on. And this is a very synchronous engagement where with the alert flow, it's a lot more asynchronous. There's not necessarily. It's a system generating alert. You can investigate that on your own time. And so if you take the cursors and the devons of the world, it's kind of similar, Right. With cursor, you're in the loop. It's the most important thing of your day that you're trying to solve with cursor.

Willem Pienaar [00:22:42]: With Devin, it's different because you're saying, code this thing for me. But it's like a side task you give to an intern, basically. And I think in our case, we're also trying to take the grunt work away from engineers that they're not immediately trying to solve. So it's more like an ambient background agent that's just doing all this work for you. And if you check in, you're like, well, okay, it solved like 20 alerts for me. I don't have to go and look at those. How do you think about this, like, dichotomy or like the difference between these two worlds? Because I think these use cases are actually different.

Shreya Shankar [00:23:12]: They're very different. And you can't rely on the human, like, hints to work because you're not the premier ID for human attention.

Willem Pienaar [00:23:22]: Exactly, yeah.

Demetrios [00:23:23]: And also going back to this, like, how have you thought about. I don't want to have it so that people need to know this secret sauce for cleric to work. Right. Or like, some people have a better experience because they know these things and they are understanding of how AI works or just RAG systems or whatever.

Willem Pienaar [00:23:46]: You want to avoid that at all if you can. Right. So engineers often come to us and say, wait, so you're going to be better at solving these problems than me? And they spend years at these companies. And we don't claim that. We just see there's so much low hanging fruit in terms of automation that we could automate away for you with these agents and then just lay up or tee up all the key things you need to make the decision. So we want to lean into their domain expertise. They are the experts and we just want to make it easier for them. So what they should be assessing is the findings and metrics and the logs and the dependency graphs and all those things that they already know.

Willem Pienaar [00:24:23]: Well, we don't want them to have to understand the internals of our product. I think that's a failure. But also because we're not a synchronous flow. You're basically looking like the AI is leading itself to an answer and it'll bail if it can't find something and continue up to a certain point if it is on the right path. But for the most part you're just producing artifacts that they can understand and intuit already.

Demetrios [00:24:43]: The stuff that takes a long time is just maybe switching from one tool to the next tool, gathering the data, trying to put two and two together. And then once you have a picture, you can start to really use your expertise. But all of that before you get to the place where you have that picture. That's where you're saying we can automate the shit.

Willem Pienaar [00:25:06]: Exactly. Often engineers are just like dreading dropping into a console or a terminal and keep cuddling. And it's the same thing every single time. And there are of course black swan events and like really tough problems that maybe even an AI can't even solve for you. But there's so much gunk and like base mechanical rote work that engineers have to do. And remember they have a full time job in a lot of cases to write software to actually make the business successful. It's not just debugging and routine investigations in the background. Right.

Demetrios [00:25:35]: I really appreciate like these. I keep forgetting the word that you're using. I'm using like the, the valley. But you're using golf. Golf.

Shreya Shankar [00:25:43]: You know, just knowing that there's a gap doesn't matter what the term is. Recognizing that is helpful.

Demetrios [00:25:49]: Yeah, there is such a gap. And now I'm going to start thinking about like all the places that there's gaps. And so maybe there's other places that you've been thinking about because you said there was two gaps and one we went over, I think heavily. What was the other one again?

Shreya Shankar [00:26:03]: Well, I think now there's three gaps before AI too. Three being specifying, then generalizing from your specification to your actual task or data. Then third is comprehension. Understanding what the hell happened. Like how do you tame the, how do you even look at the long tail of AI outputs? How do you look at your data? Did it do it right? All validation falls in that comprehension and we can go down a deep dark rabbit hole. But it's a really big cycle, right? Like after you've comprehended, then you need to specify again and then that specification needs to generalize and just bridging these three gulfs is. So I think every IDE is going to have this problem. Every product that does something moderately complex is going to have the problem.

Willem Pienaar [00:26:57]: I'd love to also get into the edge cases if we can. One of the things, we were speaking to Adam Jacobs and he was also talking about this problem in DevOps and in a lot of spaces. We know that there's like a model collapse effect model quote, unquote, you know, whatever your AI system is, there's no guarantee that you can just keep adding evals and scenarios and improve your system to, you know, get to 100%. At a certain point you may like solve one problem and then, you know, another problem rears itself.

Demetrios [00:27:22]: It's whack a mole.

Willem Pienaar [00:27:23]: It's whack a mole. And so I don't know if you've got any techniques or you know, experience from your products that you've been building.

Shreya Shankar [00:27:29]: Yeah, I think it's about saturation. It is not 100%. It's about building up the minimal case of set of evals and then to the point where you're adding more, trying new things and nothing changes like that. You're done. You can't do any better. Like you gotta wait for a new GPT model. I asked this question like every single HCI talk that I go to because I think there's so much work and you know, trying to steer, trying to make models or agents better, but there's a ceiling. And I don't think we've figured out yet what defines having hit the ceiling.

Shreya Shankar [00:28:04]: I'm really curious if you have heuristics for your use case on that.

Willem Pienaar [00:28:08]: We didn't really know where the ceiling is. We know that if you can set the right expectation with the user. So what we do is we, we have a lot of data on where, which types of alerts we can attack and, and actually 12. That's good. So we start with something like we ask them to, or our customers to export all of their alerts over the last two weeks and then we try and identify ones that we've solved either in our evals or in other companies or for other teams where we are very confident. So there's almost three buckets. The first bucket's the ones you're very confident you can solve if you deploy us. And the second bucket is the ones that you need to learn in production.

Willem Pienaar [00:28:49]: You're pretty confident you can learn those, but you don't know at what point it gets to the third bucket which is like, you can never solve these. And it's very customer dependent. But from a go to market standpoint, the first bucket is the only one that really matters. If that's big enough, then they're like, okay, it's valuable to have you in prod and that's what gives you the right to stay in their environment. And then the second one is the one you want to expand and really like prove your worth and try and.

Demetrios [00:29:14]: Find where the third one, where's that line where you cross over to the third.

Shreya Shankar [00:29:18]: But I think you're unique in thinking about it this way because many people don't even know what the first bucket is or like have a characterization of it. It's like anything goes with AI, right? You could ask it anything. It'll give you an answer, something.

Willem Pienaar [00:29:32]: It's honestly the worst thing that you can say because a customer will come to us and say, okay, if I deploy you, what can you do for me? And if you say, well, we'll figure it out, then they're like, okay, but that's not good enough. I need to know exactly what you can solve.

Demetrios [00:29:47]: Just put us into production for a few weeks and I'll tell you exactly what we do for you.

Willem Pienaar [00:29:52]: Exactly. So that's what we focus so heavily on really nailing that first class in our evals and then getting a flywheel going, this like learning loop to get the second bucket, the like learned set of alerts really high.

Shreya Shankar [00:30:06]: I, I like this framing. I think I'm going to parrot it to people. They will is doing this. But I, I have seen the problem is just like not knowing, not even having a reasonable idea of what the ceiling is and not knowing where you are right now. And it's, it really is not about numbers. It is about vibes. I'm very pro vibes but about having like some confidence band around like numbers per vibe and not like Overall we've hit 97% or whatever. Every time I read some case study or like some.

Shreya Shankar [00:30:39]: I also do a little bit of consulting or some client might say like, oh, we're at 94% accuracy. And I'm like, I What does it even mean?

Demetrios [00:30:46]: Yeah, prove that.

Shreya Shankar [00:30:47]: Yeah. Like, what I want to know is what are the three to five vibes that you like, really are trying to nail? Or like, if you have well defined accuracy metrics like time to closing. Or like, and then give me your confidence. Bance on a sample. Like, and, and then just make sure that we're kind of in those.

Demetrios [00:31:07]: I really like this idea of, hey, there's that third band that we're trying to figure out where the ceiling is. We don't know. We don't know exactly where it is. Have you found it? It's very logarithmic on the amount of time and you get to this point of diminishing returns and you start to be like, you know what? Might just have to give it up on this one.

Shreya Shankar [00:31:27]: Yeah. All the time. But that's fine. Like AI can still be impactful if it's not 100% reading your mind all the time. What makes us good at using AI is knowing when we can use it.

Demetrios [00:31:38]: Yeah.

Shreya Shankar [00:31:40]: As you said, as long as that.

Demetrios [00:31:41]: Class is big enough, there's plenty of work to automate.

Willem Pienaar [00:31:45]: Yeah. As long as you're not wasting the engineer's time. If this is a productivity focused product and you can be quiet if you are unsure about something, then it's okay. Because then if you do prompt them, it just needs to be valuable. It's like having infinite amount of insurance, but they won't come to you if they don't have anything good to say or ask.

Demetrios [00:32:00]: Yeah. That goes back to what I think about a ton is just like how disrespectful LLMs are of my time. It's like, I don't need a five page report when only one of those sentences was actually valuable to me.

Willem Pienaar [00:32:19]: And I think a lot of proxies there are very in your face AI. So it's like AI, this is stars and glitz everywhere and that's very like ubiquitous. But I think what people would really want is just like work being done for them in the background. Right. Your to do list is getting cleared, whether it's JIRA or linear or whatever.

Demetrios [00:32:37]: But are you trying to help folks with their prompting too? Or you abstract all that away and you just give them the alerts?

Willem Pienaar [00:32:46]: No, we give them the findings on an alert. We're a root cause. So we don't give them access directly to tune the prompts, but we do give them control surfaces so they can inject some guidance and it can be contextual as well. But we don't. They don't have full control over the agents. Yeah. To indicate.

Demetrios [00:33:04]: And is that, is that because of. Going back to this. It's like what you were saying earlier, where they don't need to know the black magic of how to work with AI, they just need to see because it's a, it's a totally different Persona. You don't want them having to dig through the prompts now to figure out if that's the correct way to go about it or if there's a better way.

Willem Pienaar [00:33:22]: We actually did do that at the start and we realized that what happens is every customer would then create like idiosyncratic instructions that doesn't necessarily generalize. And so that's one of the we. You kind of want to build muscles that, you know, benefit everyone. And it's like kind of like a compounding effect. And what we realized is that if, if we had that control on our side, it's harder at the start because the users have a poorer mental model, but it's better for us over time. And so to this point of like model collapse, we found that you can get to a good plateau or baseline that everyone benefits from. And the, like general models like the gpd, the drop and the sonnets, they, they help a little bit. But because these environments are so idiosyncratic and there's no public data set, like I can't just ask you to export your whole company's like infra.

Willem Pienaar [00:34:14]: It's nothing. So we, we contextualize what the agent can do, but we do it centrally. So based on performance metrics, we will say, oh, your agent has a certain set of skills that's different from another agent, but we can sub select those. So maybe if you're running Datadog and Prometheus and you're running on Kubernetes and it's all golang, we will not have like an agent's skill set that's for Python, or these skill packs that are specific to technologies that, you know, don't apply to you. So we'll try and like delete or garbage collect memories or instructions as much as we can to simplify what the agent can do. So we've got a bunch of these techniques that, you know, if you look at our base agent, it's. Or our base products the same. But we do contextually, like modify that slightly.

Willem Pienaar [00:35:03]: So the performance numbers for each customer can be higher. But that's also risky because then like the measurements don't always like on Apple to Apple. Right? Yeah.

Shreya Shankar [00:35:14]: The other thing about exposing prompts, if you're building an application is often when something is a little bit off, the first thing that people do is, like, go and, like, do some mucking around.

Demetrios [00:35:25]: In the prompt and it makes it worse.

Shreya Shankar [00:35:27]: You've already gone through years of doing that. Right. Like, they just saw the prompt for the first day and then, like, it doesn't make sense. Right. It's like exposing, in a way, I really think prompts are like code. So it's like exposing your code base to the user and it's like, no.

Demetrios [00:35:44]: You don't need to see how the sausage is made.

Shreya Shankar [00:35:46]: I think it's like, not a question of, like, proprietary secret sauce or whatever, but it's just, like, don't invite them to do something that's, like, bad for them.

Willem Pienaar [00:35:56]: And also, if you give them that surface, then you can't really take it away. Then you'll feel like, well, I've just spent this time doing this thing.

Shreya Shankar [00:36:02]: Yeah.

Demetrios [00:36:03]: And I can see a world where they spend so much time and it ends up being like, wow, this actually was a waste of time. And this tool that was supposed to save me time is now taking me more time because I'm tuning these prompts and I'm trying to make the system work the best that it can. And so they're going under the hood, wasting time in that regard.

Willem Pienaar [00:36:23]: That's. That's true. But sometimes users feel higher, like, sense of attachment and affinity to products that they've customized.

Demetrios [00:36:29]: Yeah.

Willem Pienaar [00:36:29]: If you change the colors, you put in dark mode, all those things. Before you know it, like, you'd like this product because it's yours. Right. So there's a balance. But I wouldn't expose the prompts necessarily. Maybe just some color sliders.

Demetrios [00:36:40]: So much easier. Yeah. And actually, you know, when you were talking about the different agent attributes, it reminds me of when you're playing any kind of, like, sports game and you have the sports player and it has that circle and what they're good at and what they're bad at. That's what I want to see with the different clerics that. All right, you have this agent and it's good to go lang. But Python. No, the skills in Python. But it doesn't need it.

Demetrios [00:37:05]: That's the good part.

Willem Pienaar [00:37:08]: Yeah. You should see our latest marketing. We have some things coming out soon that is in that vein.

Demetrios [00:37:12]: Yeah.

Willem Pienaar [00:37:13]: We should get you on the, like brand.

Demetrios [00:37:15]: I love it. I love it. But anyway, what else were we going to talk about? I remember there was. There was more items on the doc.

Willem Pienaar [00:37:21]: Well, I was also kind of curious from Shreya's point of view. Like, if you see fetters in production from let's say users got something they're stuck with, do they give you like a data dump or how does that work and what's it like end to end cycle for you? Like, how quickly can you go from like failure back into prod or some new version?

Shreya Shankar [00:37:38]: We don't have big like AI pipelines. Like we were building scaffolding for people to write AI pipelines. So our it's very software engineering esque. Like people will say that there is a bug, like there's an infinite loop here and I'm like, okay, I'll fix it. And it's like typescript. I don't have anything great there.

Demetrios [00:37:57]: It's on them to figure out that whole loop.

Shreya Shankar [00:38:01]: Research project, the open source research project, like everything comes through like discord or like GitHub issues or whatnot. For clients that I work with, with consulting, because they're actually companies, I find that they're actually stuck in just even like being able to detect whether something is wrong. Like, it's not even like a question of like their users complaining. Like they're so early stage where they're just like, help, did we like get this right? Like, can I, should I deploy that? But maybe that's just the people who sign up for consulting or people who are just don't have the comfort to even get there. But I think every single, every single person that I talk to, you got to have some metrics. Whether or not they correlate very strongly with what users think and say, that's fine. But like having something there to look at is a first step. And the other thing that it indirectly gives you is like if you've already instrumented your code, it's much easier to add new evals.

Shreya Shankar [00:39:07]: But people think about like, oh, like having to add evals to their. It's a huge thing because yeah, it is. Like adding observability and instrumentation is so hard.

Willem Pienaar [00:39:16]: I think there's like, there's like production failures and then there's two parts. There's one is how do you assess what happened? And so we started with traces. So in the individual run you can see what the agent did, like which tools it calls what the reasoning was, the prompts and all those things. And then the next step for us is how do you convert that into like a new scenario? Do you want to test? The problem that we had is that it's so manual to do the trace reviews and so you Drop into these traces. It's super low level. And so we built like a summarization or like a post processing process that's completely. Well, it's like part AI, part like deterministic, but it collapses and condenses and clusters all of these things together along many dimensions. So we'll see for the major groups of.

Willem Pienaar [00:40:01]: And we focus heavily on the tasks that for most of failures happen. They'd say one task is analyze the logs in Datadog, or the next one is look at the conversations in the alerts channel over the last couple of hours or days. There's specific tasks that frequently reoccur and then we try and cluster those and then we look at metrics or did the agent successfully call these APIs? Even from an engineering standpoint, are there API failures? Did it go into loops? Did it get distracted? Was it efficient in finding or solving these tasks? And like many of these metrics, you don't need humans for feedback. It's completely like you can just parse the information and then from that we build these heat maps. And then the heat maps, the rows are essentially tasks and like what the agent did and the columns are metrics. And then you can see these, like it just lights up like, okay, the agent really just sucks at this one thing. Like it sucks at querying metrics.

Shreya Shankar [00:40:57]: At Datadog, you're not the only person who's doing the heat heat maps.

Willem Pienaar [00:41:00]: And it works. When you see it, you're like, well, how did we not do this before?

Shreya Shankar [00:41:03]: Yeah, you're like the third person who's told me that.

Willem Pienaar [00:41:07]: Yeah. And then it's easy to write an eval because you're like, okay, I just need an eval that can, you know, help us matrix. And then the next problem we ran into was creating these evals sometimes took like a whole week because you need live prod infra. And so then we built like a. This is like a deeper topic, but like a simulation environment that modeled the production environment. So I was asking if you can get data dumps from your users, because in our case we couldn't. And so we had to like innovate on like the eval layer. The simulation layer.

Shreya Shankar [00:41:36]: That's very interesting. People do send their pipelines and some data to us, so it's very easy to debug for us. But I think the simulation idea is super interesting.

Demetrios [00:41:45]: Yeah, I really like that.

Shreya Shankar [00:41:47]: But it also feels very bespoke.

Willem Pienaar [00:41:51]: You mean for the use case or. Yes, yes.

Shreya Shankar [00:41:53]: And not even just like datadog, simulation, datadog, log retrieval, simulation is different from, I don't know, whatever other agents that you guys have. You probably have to build an environment per agent to some extent or build some spec per agent.

Willem Pienaar [00:42:12]: Sure. What we do is a little bit more akin to what like the SUI agent does with the SWE kit. And so you, you can have like cloth. There's a lot of similarity in the observability layer. So there's metrics, but there's like 20 different metric systems. But the idea of like, like a line graph or seeing metrics is not different. It's time series, right?

Shreya Shankar [00:42:32]: Yeah.

Willem Pienaar [00:42:33]: So you want your agent to operate at like the layer above that system. Of course there's idiosyncrasies of the technology, but if you abstract that away, then there's transferability between those. If you're good at like datadog logs, you're probably good at open search logs.

Shreya Shankar [00:42:47]: And so you don't have to change much.

Willem Pienaar [00:42:48]: You don't have to change much.

Shreya Shankar [00:42:49]: That's nice.

Willem Pienaar [00:42:50]: And so often you can get dropped into. You can just plug in an MTP for a new logging system. As long as the integration works, it'll have good performance. It's not going to be 10 out of 10. There may be some unique things. Yeah.

Shreya Shankar [00:43:00]: Having the simulation also makes it better. A lot of I tell a lot of people, like, if you want to test the reliability of something, just run it on a bunch of different logs, but slightly varying terminology, I don't know, whatever it is, and just make sure that your answer is the same. And a lot of times that's not true. And you can do some sort of anomaly detection on those outputs to figure out, okay, what are the common failures that it gives you. And now if you have this environment, that becomes very easy.

Willem Pienaar [00:43:31]: So what we did originally was we'd spin up actual environments, like actual GCP projects, actual clusters.

Shreya Shankar [00:43:37]: That's so much.

Willem Pienaar [00:43:38]: Everybody is doing this at the moment. That's kind of in the space. I'm not sure if anybody's doing the simulation approach, but what's not surprising, but it's obvious in hindsight, is like these systems are good at like repairing themselves. Like Kubernetes, like wants to bring applications up. All the development is in the software. Software is to like make sure that it works, not broken, and keeping a system broken in a state that is consistent because tests need to be deterministic. Otherwise you run your agent once, run it again and fails. But if the whole world has changed in that last five minutes because the time series are different, the logs are different.

Demetrios [00:44:11]: You're screwed. Yeah, it's worthless.

Willem Pienaar [00:44:12]: Is it this environment that's changed or is it my agent? And then you're like just doubting. And then you need a system to monitor this, the environment as well. So that was very hard. And sometimes you need to like backfill these environments with data and you accidentally load the wrong data. And then now you need to like delete your accounts and like reprovision them. It's just so slow. And so we also then went a different route of like APIs that are like mocked by LLMs. But then this also creates non determinism.

Willem Pienaar [00:44:37]: Right. So that's also a very, very challenging direction to go. And so that's where we land on the simulation approach. We create these fakes. The downside with the simulation approach is that it's not a perfect simulacrum or replica of this like datadog. Right. We can't have every API because then we're like building Datadog effectively. Right.

Willem Pienaar [00:44:54]: But you just need to get it to like a. There's like an 8020 rule where it's good enough that the agent gets fooled by it. And some of these latest models, the agent actually realizes in the simulation.

Demetrios [00:45:04]: No.

Willem Pienaar [00:45:04]: So there's a screenshot on LinkedIn that my co founder shared where it figured out it's in a simulation and it's like, that's crazy. It bailed out of an investigation because it's like this looks like a simulation environment or I can't remember the right word, but something like that. Or like a testing environment.

Demetrios [00:45:19]: Wow.

Willem Pienaar [00:45:20]: Then you see if like change the, the pod names and you have to change the logs to make it more realistic. But that's pretty cool though.

Shreya Shankar [00:45:27]: What if you like tell in the system prompt the agent up front, like, this is a simulation, but you're supposed to.

Demetrios [00:45:32]: It won't work.

Willem Pienaar [00:45:33]: That's what we're doing.

Demetrios [00:45:33]: That's what we're doing.

Willem Pienaar [00:45:34]: But I'm not sure if that's going to backfire, but we are doing that right now.

Demetrios [00:45:37]: It'll be like, yeah, I'm good. There's no production.

Willem Pienaar [00:45:41]: You basically have to tell Neo you're in the dojo. Yeah, but you're going to go into the streets. You're training now, but just train properly as if you're going to go into the streets.

Shreya Shankar [00:45:50]: That's really trippy.

Demetrios [00:45:51]: And it's cool how quickly the iteration loop then becomes. And so you're able to figure things out. And when you learn it once does then it get replicated throughout all of the different agents.

Willem Pienaar [00:46:03]: Yeah, if you, if you. Then it depends. So, okay, this is like part of the problem. The other part is like, how do you improve the agent to actually fix the problem? Right. So then you have to think about, okay, if you need causality that's like spanning multiple services, do you add a knowledge graph or service graph or need like a learning system or there's other components to the agent that you have to expand or introduce. So I'm getting to all of those things. But. But yeah, that's.

Willem Pienaar [00:46:28]: That's where the work really is.

Demetrios [00:46:35]: I don't know if you can see that dog right there. The dog lamp.

Willem Pienaar [00:46:39]: Yeah.

Demetrios [00:46:40]: Is legit.

Shreya Shankar [00:46:42]: There's like two other little dogs there.

Demetrios [00:46:45]: Cuties. Yeah, they are happy dogs.

Shreya Shankar [00:46:51]: Is English your first language?

Demetrios [00:46:54]: English is my first language. Does it not seem like it? All right, now we're cutting this shit.

+ Read More

Watch More

Building Reliable AI Agents

Posted Jun 28, 2023 | Views 1.3K

# AI Agents

# LLM in Production

# Stealth

Inside Uber’s AI Revolution - Everything about how they use AI/ML

Posted Jul 04, 2025 | Views 719

# Uber

# AI

# Machine Learning

Building Conversational AI Agents with Voice

Posted Mar 06, 2024 | Views 1.6K

# Conversational AI

# Voice

# Deepgram