MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Insights from Cleric: Building an Autonomous AI SRE

Posted Feb 11, 2025 | Views 80
# AI SRE
# Knowledge Graphs
# Cleric AI
Share
speakers
avatar
Willem Pienaar
Co-Founder & CTO @ Cleric

Willem Pienaar, CTO of Cleric, is a builder with a focus on LLM agents, MLOps, and open-source tooling. He is the creator of Feast, an open-source feature store, and contributed to the creation of both the feature store and MLOps categories.

Before starting Cleric, Willem led the open-source engineering team at Tecton and established the ML platform team at Gojek, where he built high-scale ML systems for the Southeast Asian Decacorn.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

In this MLOps Community Podcast episode, Willem Pienaar, CTO of Cleric, breaks down how they built an autonomous AI SRE that helps engineering teams diagnose production issues. We explore how Cleric builds knowledge graphs for system understanding, and uses existing tools/systems during investigations. We also get into some gnarly challenges around memory, tool integration, and evaluation frameworks, and some lessons learned from deploying to engineering teams.

+ Read More
TRANSCRIPT

Willem Pienaar [00:00:00]: Willem Pienaar, CTO of Cleric. We're building an AI sre. We're based in San Francisco. Black coffee is the way to go. And if you want to join a team of veterans and AI and infrastructure and a really tough problem. Yeah. Come and chat to us.

Demetrios [00:00:18]: Boom. Welcome back to the mlops community podcast. I'm your host, Demetrios. Today we are talking with my good friend Willem. Some of you may know him as the CTO of Cleric AI, doing some pretty novel stuff with the aisre, which we dive into very deep in this next hour. We talk all about how he's using knowledge graphs to triage root cause issues with their AI agent solution. And others of you may know Willem because he is also the same guy that built the open source feature store for Feast. That's where I got to know him back four, five years ago.

Demetrios [00:01:02]: And since then I've been following what he is doing very closely. And it's safe to say this guy never fails to disappoint. Let's get into the conversation right now. Let's start by prefacing this conversation with. We are recording two days before Christmas. So when it comes out, this sweater that I'm wearing is not going to be okay, but today it is totally inbounds for me being able to wear it.

Willem Pienaar [00:01:38]: Unfortunately, I don't have a cool sweater like you and I'm in sunny San Francisco, but I guess it's got the fog. Yeah, it's Christmas vibe, dude.

Demetrios [00:01:49]: I found out three, four days ago that if you have famine this pill, magic pill with caffeine, it like minimizes the jitters. So I have taken that as an excuse.

Willem Pienaar [00:02:05]: LCnine or which.

Demetrios [00:02:06]: Yeah, you've heard of it? Yeah, yeah, dude, I've just been abusing my caffeine intake and pounding these pills with it. It's amazing. I am so much more productive. So that's my 2025 secret for everyone.

Willem Pienaar [00:02:21]: Carry on a bit of magnesium for better sleep or actual sleep.

Demetrios [00:02:27]: All right, man, enough of that. You've been building Cleric. You've been coming on occasionally to the different conferences that we've had and sharing your learnings. But recently you put out a blog post, and I want to go super deep on this blog post on what an AI SRE is, just because it feels like SREs are very close to the MLOps world and AI agents are very much what we've been talking about a lot as you were presenting at the Agents in Production conference. The first thing that we should start with is just what a Hard problem this is and why is it hard?

Willem Pienaar [00:03:09]: We can dive into those areas and I think we're going to get into that endless this conversation. Maybe just to set the stage. Everyone is building agents, like agents are all the hype right now, but every use case is different. Right? You've got agents in law, you've got agents for writing blog posts, you've got agents for social media. One of the tricky things about our space is really if you consider two main things that an engineer does is the create software and then they deploy it into production environment and it runs and operates, actually has to have an impact on the real world. That second world, the operational environment is quite different from the development environment. The development environment has tests as an ide, it has type feedback cycles. Often it has ground truth.

Willem Pienaar [00:03:52]: Right. So you can make a change and see if your tests pass. There's permissionless data sets that are out there. So you can go to GitHub and you can find like millions of issues that people are creating, PRs that are like the solutions to those issues. But considering like the production environment of an enterprise company, where do you find the data sets that represents all the problems that they've had and all the solutions? It's not just laying out there. Right. You can get some like root causes and things that people have posted as blog posts. But this is an unsupervised problem for the most part.

Willem Pienaar [00:04:25]: It's a very complicated problem. I guess we can get into those details in a bit, but that's really what makes us just challenging. It's complex, sprawling, dynamic systems.

Demetrios [00:04:36]: Yeah. The complexity of the systems does not help. And I also think with the rise of the coding co pilots, does that not also make things more complex? Because you're running stuff in a production environment that maybe you know how it got created, maybe you don't massively.

Willem Pienaar [00:04:57]: And I think if even at our scale, a small startup, it's become a topic internally, how much do we delegate to AI? Because we are also outsourcing and delegating to our own agents internally that produce code. So I think all teams are trying to get to the boundaries of understanding and confidence. So you're building these modular components like Lego blocks with internals you're unsure about, but you're shipping into production and seeing how that succeeds and fails because it gives you so much velocity. So the ROI is there, but the understanding is like one of the things you lose over time. And I think at scale where the incentives aren't aligned, where you have many different teams and they're all being pressured to ship more, belts are being tightened so there's not a lot of headcount and they have to do more. The production environment is really. People are like putting their fingers in that damn wall, but eventually it's going to break. It's unstable at a lot of companies.

Willem Pienaar [00:05:50]: Yeah. So coding is going to make or AI generated coding is really going to make this a much more complex system to deal with. So the dynamics between these components that interrelate where there's much less understanding is going to explode. Yeah, we're already seeing that, dude.

Demetrios [00:06:08]: There's so many different pieces on the complex systems that I want to dive into. But the first one that stood out to me and has continued to replay in my mind is this knowledge graph that you presented at the conference and then subsequently in your blog post and you made the point of saying this is a knowledge graph that we created on a production environment. But it's not like it's a gigantic Kubernetes cluster. It was a fairly small Kubernetes cluster and all of the different relations from that and all the Slack messages and all the GitHub issues and everything that is involved in that Kubernetes cluster you've mapped out. And that's just for one Kubernetes cluster. So I can't imagine across a whole entire organization like an enterprise size, how complex this gets.

Willem Pienaar [00:07:01]: Yeah. So if you consider that specific cluster or graph I showed you was the open telemetry reference architecture, it's like a demo stack, it's like an E commerce store. It's got about 12, 13 services. Yeah. Roughly in that range. I've only shown you literally like 10% of the relations, maybe even less. And it's only at the infrastructure layer. Right.

Willem Pienaar [00:07:21]: So it's not even talking about like buckets and cloud infrast. Nothing about nodes, nothing about application internals. Right. So if you consider one cloud project, like a GCP project or AWS project, there's a whole tree, there's the networks, the regions down to the Kubernetes clusters. Within a cluster, there's the nodes, there's the containers. Within the containers there are, sorry, the pods. There's multiple containers potentially within each of those many processes. Each process has code with variables and it each lets it creates this tree structure.

Willem Pienaar [00:07:52]: But then between those nodes and the tree you can also have interrelations. Right. Like a piece of code here could be referencing an IP address, but that IP address is provisioned by some cloud service somewhere and it's also connected to Some other systems. And you can't not use that information, right? Because if a problem arrives in your, you know, lands on your lap and you have to causally walk that graph to go upstream to find the root cause in the security space. This is a pretty well studied problem. And there are traditional techniques people have been using to extract this from cloud environments, but LLMs really unlock a new level of understanding there. So they're extremely good at extracting these relationships, taking really unstructured data. So it can be conversations that you and I have, it can be kubernetes, objects, it can be all of these, like the whole spectrum from unstructured to structure.

Willem Pienaar [00:08:42]: You can extract structured information. So you can build these graphs. The challenge really is twofold. So, you know, you need to use this graph to get to a root cause, but it's fuzzy, right? As soon as you extract that information, you build that graph, it's out of date almost instantly because systems change so quickly, right? So somebody's deploying something, an IP address gets rolled, pod names change. And so you need it to be able to make efficient decisions with your agent, right? So just to anchor this, our agent is essentially a diagnostic agent right now. So it helps teams quickly root cause a problem. So if you've got an alert that fires or if an engineer presents an issue to the agent, it quickly navigates this graph and its awareness of your production environment to find the root cause. If it didn't have the graph, it could still do it through first principles.

Willem Pienaar [00:09:36]: It could still say, looking at everything that's available, I'll try this, I'll try that. But the graph allows it to very efficiently get to the root cause. That fuzziness is one of the challenges that the fact that it's out of date so quickly, but it's so important to still have it regardless.

Demetrios [00:09:54]: There's a few things that you mentioned about how with the vision or the understanding of the graph, you can escalate up issues that may have been looked at in isolation as not that big of a deal. And so can you explain how that works a little bit?

Willem Pienaar [00:10:13]: So the graph is essentially, there's two. If you draw a box around the production environment, right? There are two kinds of issues, right? Those are what you have alerts for and your awareness of. So you tell us, like, okay, my alert fired. Here's a problem, go and look at it. Another is we scan the environment and we identified problems. The graph is built in two ways. One is a background job where it's just like looking through your infrastructure and Finding new things and updating itself continuously. And the other is when the agent's doing investigation and it sees new information and it just throws that back into the graph because it's got the information that might as well just use it to update the graph.

Willem Pienaar [00:10:49]: But in this background scanning process, it might uncover things that it didn't realize was a problem, but then it sees, okay, this is actually a problem. For example, it could process your metrics or it could look at your configuration of your objects in kubernetes, or maybe it finds a bucket and it's trying to create that node, the updated state of the bucket and it sees it exposed publicly. So then it could surface this to an engineer and say your data is being exposed publicly or you've misconfigured this pod and the memory is growing for this application and in about an hour or two this is going to crash. Yeah. So there's a massive opportunity for LLMs to be used as reasoning engines where it can infer and predict a failure imminently and you can prevent that. So you get a proactive state of alerting that is of course quite inefficient today if you use an LLM to just slap it on a vision model, onto a metrics graph or onto your objects in your cloud infrastructure. But there's a massive low hanging fruit there where you distill a lot of those inferencing capabilities to more fine tuned or more purpose built models for each one of these tasks.

Demetrios [00:12:02]: But how does the scanning work? Because I know that you also mentioned the agents will go until they run out of credit or something or until they hit their like spend limit when they're trying to root cause analysis some kind of a problem. But I can imagine that you're not just continuously scanning or are you kicking off scans every X amount of seconds or minutes or days? Yeah.

Willem Pienaar [00:12:30]: So there are different parts to this. If we do background scanning, graph building, we try and use more efficient models. So because of the volume of data, you don't use expensive models that are used for like, you know, very accurate reasoning.

Demetrios [00:12:46]: Yeah.

Willem Pienaar [00:12:46]: And so the costs are lower and so you set it like a daily budget on that and then you run up to the budget. This is not something that's constantly running and processing large amounts of information. Think about it as like a human. Right. You wouldn't process all logs and all information, your cloud infrastructure, you just get like a lay of the land. Like what are the most recent deployments, what are the most recent conversations people are having in Slack? Just get Like a play by play, so that when an issue comes up, you can quickly jump into action. You've got fast thinking, you can make the right decisions quickly. But in an investigation, we set a cap.

Willem Pienaar [00:13:21]: We say per investigation, let's say make it 10 cents or make it a dollar or whatever. And then we tell the agent, this is how much you've been assigned. Use it as best you can. Go find information that you need through your tools and then allow the human to say, okay, go a bit further, or if I stop here, I'll take over.

Demetrios [00:13:42]: Wow.

Willem Pienaar [00:13:42]: And so we bring the human in the loop as soon as the agent has something valuable to present to them. So if the agent goes off on a quest and it finds almost nothing, it can present that to the humans, say, no, nothing, or say, okay, couldn't find anything or just remain quiet. Depends on how you've configured it. But it'll always stop at that budget limit.

Demetrios [00:14:01]: Yeah. The benefit of it not finding anything also is that it will narrow down where the human has to go in search. So now the human doesn't have to go and look through all this crap that the AI agent just looked through, because ideally, if the agent didn't catch anything, it's hopefully not there. And so the human can go and look in other places first. And if they exhaust all their options, they can go back and try and see where the agent was looking and see if that's where the problem is.

Willem Pienaar [00:14:33]: I think this comes back to the fundamental problem here, and maybe we glossed over some of those, like tools that solve the problem of operation. Operation is an on call. No amount of datadogs or dashboards or kubectl commands will free your senior engineers up from getting into the production environment. So what we're trying to get to is end to end resolution. When we find a problem, can the agent go all the way? Multiple steps, which today requires engineers reasoning and judgment, looking at different tools, understanding tribal knowledge, understanding why systems have been deployed. We want to get the agents there, but you can't start there because this is an unsupervised problem. You can't just start changing things in production. Nobody would do that right now.

Willem Pienaar [00:15:23]: If you scale that back from resolution, meaning change, like code level change, terraform things in your repos. If you walk it back from that, it's understanding what the problem is. And if you walk it back further from that, it's search space reduction, triangulating the problem into a specific area, maybe not saying the line of code, but saying, here's the service or here's the cluster and that's already very compelling to a human. Or you can say it's not these 400 other cloud clusters or providers or services. It's probably in this one. And that is extremely useful to an engineer today. So search space reduction is one of the things that we are very reliable at and where we've started. And we start in a kind of collaborative mode, so we quickly reduce the search base, we tell you what we checked and what we didn't.

Willem Pienaar [00:16:10]: And then as an engineer we can say, okay, here's some more context, go a bit further and try this piece of information. And in that steering and then collaboration, we learn from engineers and they teach us and we get better and better over time on this road to resolution.

Demetrios [00:16:25]: Yeah, I know you mentioned memory and I want to get into that in a sec. But keeping on the theme of money and cost and the agents having more or less a budget that they can go expend and try and find what they're looking for, do you see that agents will get stuck in recursive loops and then use their whole budget and not really get much of anything? Or is that something that was fairly common six or ten months ago, but now you've found ways to counterbalance that problem?

Willem Pienaar [00:17:02]: This problem space is one where small little additions to your or improvements to your product make a big difference over time because they compound. We've learned a lot from the coding agents like Sui agent and others. So one of the things they found was that when the agent succeeds, it succeeds very quickly, when it fails very slowly. So typically you can even see as a proxy has the agent run for 3, 4, 5, 6, 7 minutes, it's probably wrong even if you don't score it at all. And if it ran into like it came to a conclusion quickly, like at 30 seconds, it's probably going to be right. Our agents sometimes do chase their tails. So we have a confidence score and we have a critique at the end that assesses the agent. So we try and not just spam the human.

Willem Pienaar [00:17:47]: Ultimately it's about attention and saving them time. So if you keep throwing like bad findings and bad information that really they'll just rip you out of your production environment because it's going to be noisy, right? That's the last thing they want. So yes, depending on the use case, the agent can go in a recursive loop or it can go on a direction that it should. So for us, a really effective mechanism to manage that is understanding where we're good and where we're bad. So for each issue or event that comes in, we do an enrichment and then we build the full context of that issue. And then we look at have we seen this in the past? Similar issues have we solved. How have we solved this in the past and have we had positive feedback? And so if we check the right historical context, we get a good idea of our confidence on something before presenting that information to you with like the. The ultimate set of findings.

Willem Pienaar [00:18:36]: But yeah, sometimes it does go awry.

Demetrios [00:18:40]: I'm trying to think, is the knowledge graph something that you are creating once getting an idea the lay of the land and then there's almost like stuff doesn't really get updated until there's an incident and you go and you explore more. And what kind of knowledge graphs are you using? Are you using many different knowledge graphs? Is it just one big one? How does that even look in practice?

Willem Pienaar [00:19:05]: We originally started with one big knowledge graph. The thing with these knowledge graphs is that they're often the fastest way to build them is deterministic methods. So you can run kubectl and you can just walk the cluster with traditional techniques, there's no AI or LLM involved. But then you want to layer on top of that the fuzzy relationships. Where you see this container has this reference to something over there or this config map mentions something that I've seen somewhere else. And so what we've gone towards is a more layered approach. So we have multiple graph layers where some of them have a higher confidence and durability and can be updated quickly or perhaps using different techniques. And then you layer on the more fuzzy layers on top of that or different layers.

Willem Pienaar [00:19:51]: So you could use an LL in to kind of canvas the landscape between clusters or from a kubernetes cluster to maybe the application layer or to the layers below. But using smaller micro graphs has been easier for us from like a data management perspective.

Demetrios [00:20:08]: What are other data points that you're then mapping out for the knowledge graph that can be helpful later on when the AI SRE is trying to triage different problems?

Willem Pienaar [00:20:22]: In most teams there's an 8020 like burrito distribution of value.

Demetrios [00:20:29]: Yeah.

Willem Pienaar [00:20:30]: So some of the key factors are often found in the same system. I think it was meta or. Yeah, that's had some internal survey where they found out that 50 or 60% of their production issues were just due to config or code changes. Anything that disrupted their prod environment. So if you're just looking at what people are deploying like you're following the humans, you're going to probably find a lot of the problems. So monitoring slack monitoring deployments is one of the most effective things to do. Looking at like, releases or changes that people are scheduling and understanding those events. So having an assessment of that.

Willem Pienaar [00:21:06]: And then in the resolution path, there's also or the way to build the resolution. Looking at runbooks, looking at how people have solved problems in the past. Like, often what happens is like a Slack thread is created. Right. So the Slack thread is like a contextual container for how do you go from a problem which somebody produces, creates a thread for, to a solution. And summarizing these slack threads is extremely useful. So you can basically say, like, this engineer came into this problem, this was the discussion, and this was the final conclusion. And there's often like a PR attached to that.

Willem Pienaar [00:21:40]: So you can condense that down to almost like a guidance or like a runbook. Yeah. And attaching that into like novel scenarios is useful because it shows you how this team does things. And they, they often contain friable knowledge. Right. So this is how we solve problems at our company. We connect to our VPNs like this. We access our system.

Willem Pienaar [00:21:59]: Think these are the key systems. Right. The. The most important systems in your production environment will be referenced by engineers constantly. Yeah. Often through shorthand notations. And if you speak to engineers at most companies, those will be the two bigger problems. Right.

Willem Pienaar [00:22:16]: One is you don't understand our systems and our processes and our context. And the second one is you don't know how to integrate or access these because they're custom and bespoke and homegrown. And so those are the two challenges that we face as, like, agencies. Basically, we're like a new engineer on the team, and you need to be taught by this engineering team. If you're not taught, then you're never going to succeed. I hope that answers your question.

Demetrios [00:22:42]: Yeah. And how do you overcome that? You just are creating some kind of a glossary with these shorthand things that are fairly common within the organization or what?

Willem Pienaar [00:22:56]: Yeah. So there's multiple layers to this, and I think this is quite an evolving space. Thankfully, LLMs are pretty adaptive and forgiving in this regard, so we could experiment with different ways to summarize different levels of granularity. So we've looked at, okay, can you just take like a massive amount of information and just shove that into the context window, give it in a relatively raw form and that works, but it's quite expensive. And then you show it like a more condensed form and you say, this is just the, like Tip of the iceberg. For any one of these topics, you can deepen query using this tool and get more information. Yeah, and it's not always easy to know which one is the best because it's dependent on the issue at hand. Right.

Willem Pienaar [00:23:38]: Because sometimes a key factor, because needle and haystack is buried one level deeper and the agent can't see it because it has to call the tool to get to it. So we typically err on the side of spending more money and just having the agents see it and then optimizing costs and latency over time. For us, it's really about being valuable out of the gate. Engineers should find us valuable, and in that value, the collaboration starts and then it creates a virtuous cycle where they feed us more information, they give us more information, they get more value because we take more grunt work off their plate. And it's like training a new person on your team. If you see that, oh, this person is taking on more and more tasks. Yeah, I'll just give them more information. I'll give them more scope.

Demetrios [00:24:21]: Yeah, I want to go into a little bit of the ideas that you're talking about there, like how you can interact with the agent and. But I feel like the gravitational pull towards asking you about memory and how you're doing that is too strong. So we got to go down that route first. And specifically, are you just caching these answers? Are you caching, like, successful runs? How do you go about knowing that a something was successful and then where do you store it? How do you, like, give that access or agents get access to that and they know that, oh, we've seen this before. Yeah, cool. Boom. It feels like that is quite complex in theory. You would be like, yeah, of course we're just going to store these successful runs.

Demetrios [00:25:13]: But then when you break it down and you say, all right, what does success mean and where are we going to store it and who's going to have access to that and how are we going to label that as successful? Like I was thinking, how do you even go about labeling this kind of shit? Because is it you sitting there clicking and human annotating stuff, or is it you're throwing it to another LLM to say, yay, success? What does it look like? Break that whole thing down for me. Because memory feels quite complex in that when you really look at.

Willem Pienaar [00:25:46]: Is a big part of this is also the UX challenge, because people don't want to just sit there and label. I think people are just like, especially engineers are really tired of slop code and they're just being thrown this like, slop and then they have to review, they want to create. And I think that's what we're trying to do is free them up from support. But in doing so, you don't want to get them to like constantly review your work with no benefit. So that's the key thing. There has to be interaction where there's implicit feedback and they get value out of that. And so I'm getting to your point about memory. So effectively there's three types of memory.

Willem Pienaar [00:26:23]: There's the knowledge graph, which captures the system state and the relations between things. Then there's episodic and procedural memory. So the procedural memory is like how to ride a bicycle. You've got your brakes here, you've got your pedals here. It's like the guide, it's almost like the runbook. But the runbook doesn't describe for this specific issue that we had on this date, what did we do? The instance of that is the episode or the episodic memory. And both of those need to be captured. Right.

Willem Pienaar [00:26:55]: So when we start we are indexing your environment, getting all these like relations and things. And then we also look at, okay, are there things that we can extract from this world where we've got procedures and then finally as we experience things or as we understand the experiences of others within this environment, we can store those as well. We have really spent a lot of time and most companies care about this a lot, securing data. So we are deployed in your production environment and we only have read only access, so our agent cannot make changes, we can only make suggestions. So all your data, you want to change that, right?

Demetrios [00:27:31]: That. Later we'll talk about like how you want to eventually get to a different state but continue.

Willem Pienaar [00:27:38]: Yeah, yeah, we want to get to closed loop resolution, but that's a longer path. So we're storing all of these memories mostly as I think the valuable ones are the episodes. Right? Those are the, like the instances, like if this happened or this happened and I solve it in this way. We had a Black Friday sale, the cluster fell over, we scaled it up and then later we saw it was working, but, oh, it's done. And we did that two or three times. And we think that's a good pattern, like scaling is effective, but that's all captured in the environment of the customer. Our primary means of feedback is monitoring system health post change.

Demetrios [00:28:22]: Oh, nice.

Willem Pienaar [00:28:23]: We can look at the system and see that this change has been effective. And we can look at the code of the Environment, whether it's the application code or the infrastructural code, basically as like a masking problem. Do we see or can we predict the change the human will make in order to solve this problem? And if they do then make that change, especially if it's a recommendation, then we see that they've actually been green lit what we've done. Right. They've actually approved our suggestion.

Demetrios [00:28:50]: Yeah.

Willem Pienaar [00:28:51]: That is not super rich data source because the change that they may make be slightly different or we may not have access to those systems. A more effective way is interaction. So if we present findings and say, here's five findings and here's our diagnosis, and you say this is dumb, try something else, then we know that was bad. So we get a lot of negative examples. Right. So this is bad. And so it's a little bit lopsided. But then when you eventually say, oh, okay, I'm going to prove this and I'm going to blast this out to the engineering team or I'm going to update my pagerduty notes or I'm going to, I want you to generate a pull request from this information.

Willem Pienaar [00:29:29]: Then suddenly we've got like positive feedback on that in the user experience. It's really an implicit source of information, that interaction with engineer and that gets attached to the, these memories. But ultimately at the end of the day, it's still a very sparse data set. So these memories, you, you may not have true labels. And so for us, a massive investment has been our evaluation bench, which is external from customers, where we train our agents and we do a lot of really hand the handcrafted labeling. Whereas even a smaller data set gets the agent to a much, much higher degree of accuracy. So you want a bit of both, right? You want the real production use cases with engineering feedback, which does present good information. But the eval bench is ultimately what is the firm foundation that gives you that coverage at the moment.

Demetrios [00:30:20]: But it feels like the evals have to be specific to customers, don't they? And it also feels like each deployment of each agent has to be a bit bespoke and custom per agent. Or am I mistaken in that one?

Willem Pienaar [00:30:35]: The patterns are very. So the agents are pretty generalized. The agents get contextual information per customer, so it gets injected like localized customer specific procedures and memories and all those things. But those are layered on the base which is developed inside of our product. Right. Like in the Mothership or actually it's called the Temple of Cleric. Nice, I like. So we distribute like new versions of Cleric and our Prompts our logic, our reasoning, generalized memories or approaches to solving problems are imbued in a divine way into the cleric and its center.

Willem Pienaar [00:31:16]: It's a layering challenge. Right, because you do want to have cross cutting benefits to all customers and accuracy driven by the eval benchmark, but also customization at their, on their processes and like customer specific approaches.

Demetrios [00:31:32]: All right, so there's a few other things that are fascinating to me when it comes to the UI and the UX of how you're doing things, specifically how you are very keen on not giving engineers more alerts unless it absolutely needs to happen. And I think that's something that I've been hearing since 2018 and it was all on alert fatigue. And how when you have complex systems and you set up all of this monitoring and observability, you inevitably are just getting pinged continuously because something is out of whack. And so the ways that you made sure to do this, and I thought this was fascinating, is A, have a confidence score. So be able to say, look, we think that this is like this and we're giving it 75% confidence that this is going to happen or this could be bad or whatever it may be. And then B, if it is under a certain percent confidence score, you just don't even tell anyone and you try and figure out isn't actually a problem. And I, I'm guessing you continue working or you just forget about it. Explain that whole user experience and how you came about that.

Willem Pienaar [00:32:54]: Yeah, we realized because this is a trust building exercise, we can't just respond with whatever we find. And the agents can. Sometimes they're just not, especially during the onboarding. Excuse me, during the onboarding phase, that don't have the necessary access and they don't have the context. Right. And so at least at the start, when you're training the agent, you don't want it to just spam you with this raw ideas. And so the confidence score was one that I think a lot of teams are actually trying to build into their products as agent builders. It's extremely hard in this case because it's such an unsupervised problem.

Willem Pienaar [00:33:30]: I'm trying to not get into the raw details because there's a lot of like effort we've put into that. Like building this confidence score is a big part of our IP is like how do we measure our own success? Come up with sources of information, Divine.

Demetrios [00:33:44]: Name for the IP or something. It's not your ip, it's your. What was it when Moses was up on the Hill and he got the revelation. It was, yeah, this is not your ip, this is your revelations that you had.

Willem Pienaar [00:33:56]: Yeah, the. But since the high level is basically that it's really driven by this data flywheel, it's really driven by experience and that's also how an engineer does things. But those can be again, like two layered, like from the base layers of the product, but also experiences in this company. So we do use an LLM for self assessment, but it's also driven and grounded by existing experiences. So we inject a lot of those experiences and whether those are positive or negative outcomes. And as an engineer, you can set the threshold so you can say, oh, nice. Only extremely high relevance findings or diagnosis should be shown. And you can set the conciseness and specificity so you can say, I just want a one sentence or just give me a word or give me all the raw information.

Willem Pienaar [00:34:52]: So what we do today is we're very asynchronous. So an alert virus will go from a quest, we'll find whatever information we can and come back. If we're confident, we'll respond. If not, we'll just be quiet. But then you can engage with us in a synchronous way. So it starts async and then you can kick the ball back and forth in a synchronous way. And in a synchronous mode, the Sorry. In the synchronous mode, it's very interactive and lower latency.

Willem Pienaar [00:35:18]: We will almost always respond. If you ask us a question, we'll respond. So then the confidence score is less important because then it's like the user is refining that answer saying, go back, try this, go back, try this. But for us, the key thing is we have to come back with good initial findings. And that's what the conference score is so important. But again, it's really driven by experiences. Just to reiterate why this is such a complex problem to solve, you can't just take a production environment and say, okay, I'm going to spin this up in a Docker container and reproduce it at a specific point in time. At many companies, you can't even do a load test across services.

Willem Pienaar [00:35:56]: It's so complex. So all different teams, they're all interrelated. You can do this as for a small startup with one application running on Heroku or Vercel, but doing this at scale is virtually impossible at most companies. So you don't have that ground trick. You can't say with a hundred percent certainty whether you're right or wrong. And that's just the state we're in right now. Despite that, the confidence score has been a very powerful technique to at least eliminate multiple most true positives. Like or when we know that we don't have anything of substance, just being quiet.

Demetrios [00:36:30]: But how do you know if you got enough information when you were doing the scan or you were doing the search to go back to the human and give that information? And also how do you know that you are fully understanding what the human is asking for when you're doing that back and forth?

Willem Pienaar [00:36:55]: Honestly, this is one of the key parts that's very challenging. It's a human will say the checkout service is done and you need to know that they are probably maybe based on who the engineer is talking about production or if they've been talking about developing a new feature, probably talking about the dev environment. And if you go down the wrong path then you can spend some money and like a lot of time investigating something that's useless. So what we do is even at the initial message that comes in, we will ask for a clarifying question if we are not sure about what you're asking, if you've not been specific enough. And most agent builders, even if cognition's Devon, they do this that initially they'll say okay, do you mean X, Y and Z? Okay, this is my plan, okay, I'm going to go do it now. So there is a sense of confidence built into these products from a UX layer and that's where we are right Now. It's with ChatGPT you can sometimes say or with Claude something very inaccurate or vague and it can probably guess the right answer because the cost is not multi step. Right? It's very cheap, you can just quickly fix your text but for us we have to short circuit that and make sure that you're specific enough in your initial instructions and then over time loosen that a bit as we understand a bit more what your teams are doing, what things are, what are you up to? You can be more vague, but for now it requires a bit more specificity and guidance.

Demetrios [00:38:21]: Speaking of the multi turns and spending money for things or trying to not waste money and going down the wrong tree branch or rabbit hole. How do you think about pricing for agents? Is it all consumption based? Are you looking at what the price of an SRE would be and you're saying oh, we'll price a percentage of that because we're saving you time. Like what in your mind is the right way to base off of pricing?

Willem Pienaar [00:38:56]: We're trying to build a product that engineers love to use. And so we want it to be a toothbrush. We want it to be something that you reach for instead of your observability platform instead of going into the console. So for us, usage is very important. So we don't want to have procurement stand in the way necessarily. But the reality is there are costs and this is a business and we want to add value and money is how you show us that we are valuable. So the original idea with agents was that there would be this augmentation of engineering teams and that you could charge some order of magnitude less but at a fraction of engineering headcount or employee headcount by augmenting teams. I think the jury's still out on that.

Willem Pienaar [00:39:38]: I think most agent builders today are pricing to get into production environments or into these systems that they need to use to solve problems to get close to their Persona. And if you look at what Devon did, I think they also started at 10k per year or some pricing and I think it's now like 500amonth. But it's mostly consumption based model. So you get some committed amount of compute hours that is effectively giving you time to use the product. For us, we are also orienting around that model. So because we're not ga, our pricing is a little bit like on Flux and we're working with our initial customers to figure out like what do they think is reasonable, what do they think is fair. But I think we're going to land on something that's mostly similar to the Devon model where it's usage based. We don't want engineers to think about, okay, if there's an investigation, it's going to cost me X.

Willem Pienaar [00:40:35]: They should just be able to just run it and just see this is valuable or not and increase usage. But it will be something about like a tiered amount of compute that you can use. So Maybe you get 5,000 investigations a month or something in that order.

Demetrios [00:40:51]: Okay, nice. Yeah, because that's what instantly came to my mind was you want folks to just reach for this and use it as much as possible. But if you are on a usage based pricing then inevitably you're going to hit that friction where it's. Yeah, I want to use it, but it's going to cost me.

Willem Pienaar [00:41:14]: Yeah, yeah. So you do want to have a committed amount set aside at the front. And we're also exploring like having a free tier or like a free band. Maybe the first X is just, you can just kick the tires and dry it out and as you get to a Higher limits, think you say, okay, external taps.

Demetrios [00:41:34]: So we haven't even talked about tool usage, but that's another piece that feels like it is so complex because you're using tools, you're using a different, you're using an array of tools. And how do you tap into each of these tools? Right, because it's. If you're looking at logs or are you syncing directly with the data dogs of the world, how do you see tool usage for this? And what have been some specifically hard challenges to overcome in that arena?

Willem Pienaar [00:42:11]: Again, this kind of goes back to why this is so challenging. And especially one of the key things that we've seen is agents solve problems very differently from humans, but they need a lot of the things humans need. They need the same tools. If you're storing all of your data in Datadog, we may not be able to find all the information we need to solve a problem by just looking at your actual application running on your cloud infra. So we need to go to Datadog, so we need access there and so engineering teams give us that access. If you then constructed a bunch of dashboards and metrics, then, and that's how you've laid out your runbooks and your processes. To debug issues, we need to do things like look at multiple charts or graphs and infer across those in the time ranges that an issue happened. What are the anomalies that happen across multiple services.

Willem Pienaar [00:42:58]: So if two of them are spiking and cp, they're interrelated, so we should look at the relations between them. But these are extremely hard problems for LLMs to solve. Even vision models, they're not purpose built for that. So when it comes to tool usage, LLMs or foundation models are good at certain types of information, especially semantic ones. So code config logs, they're slightly less good at traces, but also pretty decent. But they really suck at metrics, they really suck at time series. So it's really dependent on your observability stack how useful it's going to be. Because for a human, we just sit back and look at a bunch of dashboards and we can see like pattern matching instantly you can see that these are spikes, but for an LLM they see something different.

Willem Pienaar [00:43:50]: So what we'll find is over time these observability tools at least will probably become less and less human centric and may even become redundant. You may see completely different means of diagnosing problems. And I think the honeycomb approach, the trace based approach with these high cardinality events is probably the Thing that I put my money on is the dominant pattern that I actually see winning.

Demetrios [00:44:17]: Because can you explain that real fast? I don't know what that is.

Willem Pienaar [00:44:21]: So, so basically what they do is, or what charity majors and some of these others have been promoting for years is logging out traces, but with rich events attached to these. So you basically can follow like a request through your whole application stack and you can log out like a complete object payload at multiple steps along the way and store that in a system where you can query all the information. So you've got the point in time, you've got the whole tree of the trace as well. And then at each point you can see the individual attributes and fields and so you get a lot more detail in that versus if you're looking at a time series you've basically seen, okay, CPU is going up, CPU goes down. And what can you glean from that? You basically have to like, it's like witchcraft trying to find the root cause. Right. But the datadogs of the money have been making a lot of progress. Sorry, the datadogs of the world are making a lot of money and selling consumption and selling the witchcraft to engineers for years.

Willem Pienaar [00:45:22]: And so there's a real incentive to keep this status quo going. But I think as agents become more dominant, we'll see them gravitate to the most valuable sources of information. And then if you give your agent more and more scope, you'll see Dead Dog is rarely involved in these causings, so why are we still paying for them? So I'm not sure what it's going to look like in the next two or three years, but it's going to be interesting how things play out as agents become the go to for diagnosing and solving problems.

Demetrios [00:45:52]: Yeah, I hadn't even thought about that. How for human usage it's like maybe datadog is set up wonderfully because we look at it and it gives us everything we need and we can root cause it very quickly by pattern matching. But if that turns out to be one of the harder things for agents to do, instead of making an agent better at understanding metrics, maybe you just give it different data and so that it can root cause it without those metrics and it will shift away from reading the information from those services.

Willem Pienaar [00:46:29]: Yeah, if you look at like chess and the AIs and like the stockfishes of the world, that's just one AI that's plays against grandmasters. Even the top players have learned from the AI so they know that a pawn push on the side has been extremely powerful, or a rook lift has been very powerful. So now the top players in the world adopt these techniques they learn from the AIs, but that's also because it's always a human in the loop. We still want to see people playing people, but if you just leave it up to the AIs, like the way they play the game is completely different. They see things that we don't. And I know I didn't address your answer at the start fully, but these tools are grounding actions for us. So the observability stack is one of them. But ultimately we build a complete abstraction to the production environment.

Willem Pienaar [00:47:20]: So the agent uses these tools and learns how to use these tools and knows which tools are the most effective. But we also build a transferability layer so you can shift the agent from the real production environment into the eval stack. And it doesn't even know that it's running in an eval stack. It's now suddenly just looking at like fake services, fake kubernetes, clusters, fake datadogs, fake scenarios, or fake world. So these tools are an incredibly important abstraction. It's one of the key abstractions that the agent needs. And honestly, it's memory management and tools are the two big things that agent teams should be focusing on, I'd say, right now.

Demetrios [00:48:00]: Wait, why do you switch it to this fake world?

Willem Pienaar [00:48:04]: Because that's where you've got full control. That's where you can introduce your own scenarios, your own chaos and stretch your agent. But if you do so in a way where the tools are different, the worlds are different, the experience are different, there's less transferability when you then take it into the production environment and suddenly it's going to fall flat. So you want the, like a real simile of the production environment in your new tool or your eval bench.

Demetrios [00:48:31]: And are you doing any type of chaos engineering to just see how the agents perform?

Willem Pienaar [00:48:39]: Yes, that's pretty much where our eval stack is. It's chaos. We produce a world in which the reproduce chaos. And then we say, given this problem, what's up? What's the underlying cause? And we see how close we can get to the oxygen cause. Yeah.

Demetrios [00:48:53]: Perfect opportunity for an incredible name like Lucifer. This is the. This is the seventh layer of hell. I don't know, something along those lines.

Willem Pienaar [00:49:08]: Yeah, we've got some ideas on the blog post that will have some more players on this idea. So, tpd, I think one thing to note is that this is a very deep space. So if you look at self Driving cars, lives are on the line and so people care a lot and you have to hit a much higher bar than a human driving a car. It's very similar in this space. Right. Like these production environments are sacred. They are important to these companies. Right? They are.

Willem Pienaar [00:49:37]: If they go down or if there's a data breach or anything that their businesses are on the line, CTOs really care. The bar that we have to hit is very high and so we take security very seriously. But the whole product that we're building requires a lot of care and there's a lot of complexity that goes into that. So I think it's extremely compelling as an engineer to work in this space because there's so many compelling problems to solve, like the knowledge graph building, the conference scoring, how do you do evaluation, like how do you learn from these environments and build them to your core product, the tooling layers, the chaos benches, all these things. And how do you do that in a reliable, repeatable way? I think that's the other big challenge is if you're on AWS or TCP or using this stack or different stack, if you're going from E commerce to gaming to social media, how generalized is your agent? Can you just stamp it or can you only solve one class of problem? And so that's one of the things that we're really leaning into right now is the repeatability of the product and scaling this out to more and more enterprises. But yeah, I'd say it's an extremely complex problem to solve. And even though we're valuable today, true resolution, end to end resolution may be like multiple years. Just like with self driving cars, it took years to get to a point where we've got waymos on the roads.

Demetrios [00:50:56]: Yeah, that's what I wanted to ask you about was the true resolution and how that like that just scares me to think about first of all, and I don't have anything running in production, let alone a multimillion dollar system. So I can only imagine that you would encounter a lot of resistance when you bring that up to engineers.

Willem Pienaar [00:51:22]: Surprisingly, no. There's definitely hesitation, but the hesitation is mostly based on the uncertainty. Like what exactly can you do? And if you show them like we literally can't change things, we don't have the access like you literally like the API keys are read only or we can strain to these environments and if you introduce change through the processes that they have already, so pull requests and there's guardrails in place, then they're very open to those ideas. I think a big Part of this is really engineers really hate infra and support so they yearn for something that can help free them from that. But it's a progressive trust building exercise. We've spoken to quite a lot of enterprises and almost all of them have different classes of sensitivity. You have your big fish customers, for example, that you don't want us to touch their critical systems. But then you've got your internal airflow deployments and your CI CD, your GitLab deployment.

Willem Pienaar [00:52:21]: That thing falls over, we can scale it up or we could try and make a change. There's zero customer impact. And so those are the areas we're really helping. Teams today is on the lower severity or low risk places where you can make changes and if you're crushing those changes over time and engineers will introduce you to the more high value places. But yes, right now we're steering clear of the critical systems because we don't want to make a change that is dangerous.

Demetrios [00:52:50]: Yeah. And it just feels like it's too loaded. So even if you are doing everything right because it is so high maintenance, you're. You don't want to stick yourself in there just yet. Let the engineers bring you in when they're ready and when you feel like it's ready. I can see that for sure.

Willem Pienaar [00:53:11]: Yeah. Also behaviorally, engineers who conscience their the tools they reach for the processes. In a wartime scenario when something is a relaxed environment, they're willing to try AI and experiment with that and adopt that. But if it's a critical situation, they want to introduce an AI and add more chaos into the mix. So they want something that reduces the uncertainty.

Demetrios [00:53:35]: Yeah. That reminds me about one of my major things that I notice whenever I'm working with agents or building systems that involve AI, the prompts can be the biggest hangups. And the prompts for me sometimes feel like I just need to do. Obviously I'm not building a product that relies on agents most of the time, so I don't have the drive to see it through. But a lot of times I will fiddle with prompts for so long that I get angry because I feel like I should just do the thing that I am trying to do and not get AI to do it.

Willem Pienaar [00:54:28]: I don't really have an answer for you. That's just the nature of the beast.

Demetrios [00:54:31]: Yes, exactly.

Willem Pienaar [00:54:33]: I do want to just double click and say everybody has that problem. Everybody struggles with that. You don't know if you're like one prompt change away or 20. And they're very good at making it seem like you're getting closer and closer but you may not be we found success in building frameworks to do evaluations so that we can at least extract it either from production environments or evals there's samples the ground truth that makes us know or gives us confidence we're getting to the answer Otherwise you just. You can go forever. Right. Like just tweaking things and never getting there.

Demetrios [00:55:07]: That's it and that's frustrating because some yeah sometimes you take one step forward and two steps back and you're like, oh my God.

Willem Pienaar [00:55:16]: It's quite hard with content creation I think it's harder in your space I.

Demetrios [00:55:20]: Have all but stopped using it for content creation that's for sure like maybe to help me fill up a blank page and get directionally correct but for the most part, yeah I don't like the way it writes I don't really. Even if I prompt it to the maximum, it doesn't feel like it gives me deep insights yeah stopped that but.

Willem Pienaar [00:55:42]: You'Re still on GPD 3.5, right? I.

+ Read More

Watch More

Cleric AI SRE: Towards Self-healing Autonomous Software // Willem Pienaar // Agents in Production
Posted Nov 15, 2024 | Views 1.2K
# SRE Agents
# Cleric AI
# Agents in Production
Building an ML Platform: Insights, Community, and Advocacy
Posted Sep 28, 2023 | Views 562
# ML Platform
# Data Scientists
# Wolt
Building an ML Platform from Scratch: Live Coding Session, Building an ML Platform from Scratch: Live Coding Session - Part 2
Posted Jun 09, 2021 | Views 602
# ML Platforms
# Open-Source
# GitHub
# Aporia
# Aporia.com