Towards Faultless AI Agents
Willem Pienaar, CTO of Cleric, is a builder with a focus on LLM agents, MLOps, and open-source tooling. He is the creator of Feast, an open-source feature store, and contributed to the creation of both the feature store and MLOps categories.
Before starting Cleric, Willem led the open-source engineering team at Tecton and established the ML platform team at Gojek, where he built high-scale ML systems for the Southeast Asian Decacorn.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Agent-based systems are set to disrupt every single job, with the end state being a hybrid human-machine workforce. However, LLMs applied in an agentic way are still fraught with reliability and UX issues. In this talk, we'll explore specific strategies for overcoming these issues, such as context awareness, graceful degradation, and user-centric design principles.
Towards Faultless AI Agents
AI in Production
Demetrios [00:00:05]: Mui buen. Amigo of mine, Mr. Willem Pinar, is back at it. What's up, dude?
Willem Pienaar [00:00:15]: Hey, what's up, D? Hopefully I'm not a combo breaker today. Hopefully I can keep things on schedule. But it's been a fun day.
Demetrios [00:00:24]: It's going to be hard. Well, I'm not going to help you here because I'm going to show everybody the other evaluation metrics, stuff that I was showing beforehand. And I want to just mention, because we've got this evaluation survey, you can see some of the data that's already come through, and I think this one is relevant to Abigail's talk and also just to people that care about evaluation, which is probably everyone. How does your organization approach real time evaluation of large language models when deployed in production? Hmm. There's all kinds of answers that are coming through here. I don't know if you can even see that, Willem. It might be, like, too small for. Yeah, yeah.
Demetrios [00:01:06]: I'll make it a little bit bigger and I'll say that continuous performance monitoring is up there. That's one. Then you have user feedback, analysis, collecting and analyzing feedback from end users. The only problem with that, as you probably know, Willem, end users are lazy. Like myself, I never actually want to give a lot of feedback. I just want to say thumbs up, thumbs down. And the other one that is high up there is manual review of quality assurance. And you know what that sounds like? You don't.
Demetrios [00:01:41]: You're not going to give me anything.
Willem Pienaar [00:01:44]: It doesn't sound very exciting.
Demetrios [00:01:47]: It sounds expensive. That's what that sounds like. That's what I'll say on that one. So the other thing that I was going to mention is that I found the tool I was using to test a bunch of different output from the endpoints that we were talking about from Dan's chat. So you can add different models here and you can get basically a playground of stuff. So we could do all the mistral models if we wanted. And I think let's just make sure that we're all good. Have you played with this? Can't even.
Demetrios [00:02:26]: Nobody can see what I'm even sharing. Hold on.
Willem Pienaar [00:02:28]: You're not sharing anything visibly.
Demetrios [00:02:30]: Oh, my God. Hold on a sec. Here's what I'm sharing now, hopefully. Can I get a prompt, Mr. Willem?
Willem Pienaar [00:02:37]: Yeah, let's do. What's the capital of Spain?
Demetrios [00:02:41]: What is the capital of Spain and how many people live there? So what you didn't see me doing earlier is I was playing with this and I was choosing different endpoints. So I think I have all of the mistral endpoints selected from different providers. This one, I think, is from together. This one is from Mistral itself. This one's from Cloudflare. So this one's a little bit different from Octo. We've got. And I know I'm throwing you off on your timing, so it's a little bit different, but let's see what happens.
Demetrios [00:03:25]: Let's see what happens. Look at how fast octo is. Dang. But Mistral tiny also damn fast. And so you can evaluate this pretty nicely. And this is something that Dan was showing us with his tool, also how you can evaluate the speed. He's taken a bit of a different approach, but it is cool to see what people are saying. Like, it's way too small.
Demetrios [00:03:53]: So there we go. Hopefully that makes it a lot better.
Willem Pienaar [00:03:58]: Dude, I'm just asking for some valid JSON output. That's what the first thing that I.
Demetrios [00:04:02]: Would do, output this in JSon. Is that what you want? That's fine.
Willem Pienaar [00:04:12]: Just add that.
Demetrios [00:04:15]: Okay. Output this in JSon. All right, let's go. Let's go. A. Dude, the together model is actually doing pretty. It's. No, this is good.
Demetrios [00:04:37]: They're all pretty dang good. I'll take it.
Willem Pienaar [00:04:40]: All right.
Demetrios [00:04:40]: Anyway, Willem, you got to talk to give, man. I hijacked your moment, dude. Well, I'm excited because you've been working on agents pretty hardcore for the last, what, six, eight, nine months?
Willem Pienaar [00:04:59]: Yeah, six to eight months.
Demetrios [00:05:02]: So you've got all kinds of fun stuff to share with us about your trials and tribulations of your agent building skills. So I'll hand it over to you, and I'll be back in ten minutes, baby.
Willem Pienaar [00:05:13]: Sweet. Cool. So, I'm Willem, CTF Cleric. We're focused on building AI agents for freeing engineers up from the production environment. We're talking a bit about the road towards faultless AI. Today, we're all being identified. The world is going to be different in a few years. AI agents, specifically large language models, being applied towards automation, basically unlocks a new class of automation that we couldn't previously achieve.
Willem Pienaar [00:05:48]: So what we can do today is we can react to novel situations and environments and learn from those environments and have AI agents basically take actions for us. So if it's producing blog posts, generating code, if it's drug discovery or contract review or research assistance, there are teams of people focused on each one of these jobs and applying agents to automating that today. So I'll talk a bit about that and where things go. Well, where things don't go well. But specifically what I'm referring to is autonomous agents, and that's really where the value is, where you can let these agents run and they can reliably achieve a task that you needed to do previously, unlike a copilot, where every step of the way you're leading the interaction and the engagement with the agent. So that's really the focus of this talk and where things go wrong and what needs to be done to fix that. So just want to quickly touch on what is an agent. So these agents function typically by basically are given objective.
Willem Pienaar [00:07:00]: They break down that objective into sub goals and decompose it, doing planning. So this plan is a multistep graph, and often it's designed as a dag, and the agent executes that graph by calling tools, and it can do things like integrate the calendar, it can execute code, it can query pretty much any API that you can give it access to and even interact in the real world in some sense. And so these agents also have memory, so they know short term, like this is the task I'm working on, and this is what's relevant to the job I'm doing. But it also has infinite long term memory. So they can use vector stores, they can bring in embeddings of information, and they can use long term memory also from document stores or even graph databases where the real world representation can be stored. And so these agents work estate machines. They basically traverse this plan and execute and then course correct if things go wrong. But for the most part they're planners and doers, so they think, come up with a plan, and then they execute that plan and step back if they make a mistake and continue along indefinitely until their objective is met.
Willem Pienaar [00:08:15]: But most autonomous agents today are still unreliable, and that's for various reasons. So if you're building autonomous AI agents, this is obvious to you, you've seen order GPT in. A lot of these frameworks blow up, but if you speak to people that are building products, they're still running into hitting their heads against walls. And for various reasons, one of which is there's three of them that come to mind. Context windows are still limited in length, so there's a dark art to what exactly you feed to the model in order to make it do certain things. So when it's generating a plan or when it's calling an API, what exactly is that plan? What exactly are the parameters to that API? And if you have to kind of constrain the real world and condense that into context length, then that's already lossy. There's also challenges with, like, long term planning. The further the horizon of the plan is, the more unstable it becomes.
Willem Pienaar [00:09:14]: So they showed in the international planning competition that only 12% of plans actually executed correctly when built with an LLM. And so this is just very unreliable. And then finally, also, if you're using language for planning, an LLM is just inherently unstable, specifically for tool inputs. So failure is really a feature. But this is also something that we as humans do. We fail, we come up with plans, we try it, it doesn't work, we backtrack, we make different, we come up with different ideas. And so the system needs to be built with that failure as an expectation instead of saying it's an exception. So what you want is you want reliable plans so that they're cheap to execute and you want to be able to course correct if something goes wrong.
Willem Pienaar [00:09:58]: So in this one paper that recently came out, more agents is all you need. One of the approaches they take is they try and improve plan generation, and they do this with a sample and vote method. So they use agents in a collaborative setting, they generate many different samples, many different plans, and they take the plan that is the most frequently occurring, and they use heterogeneous amounts of agents and different strategies to come up with these plans, and they introduce entropy, and the majority vote then is selected. So if you're one agent, you take that plan that occurs the most and you just run that plan. But now, in some cases, if you're doing code gen, these plans could all be diverse. And so you can use a similarity based approach to kind of condense or cluster them and choose the one that's most likely in that cluster based on cluster comparison. Or you could even take concurrency into accountant branch and run multiple of these plans simultaneously. If you're willing to throw money at the problem, you can run multiple plans simultaneous way.
Willem Pienaar [00:10:57]: And they also showed in this paper that for grade school math, that the more agents you apply in this collaborative setting to generate plans, in this case, it's grade school math. But for any kind of complex task list, the better the accuracy becomes. And this is specifically interesting because they also show that the more complex the task, the better smaller models are at achieving that. So for planning, if you, especially if you're on low cost hardware, small models are now able to compete with bigger models, 3.5 can compete with the four, et cetera, et cetera, for plan generation. So basically, again, there's a dial, and if you dial it up, you can pay more money and get better plans. But that is a trade off that you need to calculate on your own. Something that's also extremely important for agents is critics. So you can generate plans, but before you run them, apply critics.
Willem Pienaar [00:11:46]: And so in this case, you could do things like, there are two types of critics, hard constraints and soft constraint critics. The hard constraint critics are things like, you're generating code. Will this code actually execute on this machine? So if it fails that test, then you should just not use that plan at all. And soft constraints can be something like, hey, is the style of this writing in line with my tone and my style? And so sometimes you'd have a score, and sometimes you'd still allow it through, or you'd rank them, but basically, you'd create this set of critics, and then you would update after your generation, phase your board. If you've passed all of these critics and if you haven't, regenerate the plans and then send them through another loop of the critics. And so another technique that is also very useful is back prompting. So, imagine executing a plan where you want to send a file to somebody. You need access to a Google Drive.
Willem Pienaar [00:12:38]: You realize as you execute the plan, you don't have that access. You want to back prompt that failure. So you want to say, there's a tool exception saying, we don't have access to Google Drive. Take it upstream into one of the nodes where a plan is being generated, introduce that into the prompt, and then have a new plan being generated. This sounds very easy, but this is actually very hard because if you're running any kind of concurrent agents. So in our case, we spawn thousands of agents. How do you know which of these facts are relevant to other agents? So you need another layer that's also doing relevance or significance or the other properties of that finding based on the plan that another agent is conducting or executing. And so this is also shown in one of the papers to be an extremely effective technique and tree of thoughts and other approaches also use it.
Willem Pienaar [00:13:27]: Another architecture that's becoming very dominant now is the neurosympolic architectures. And it's basically what we've spoken about thus far. It's splitting your execution into two parts. One is the fast thinking. So if you consider, like, the fast thinking, slow thinking Daniel Cunneman Methodology, the LLM is the fast thinker. It's instinctive, it comes up with an answer, but it's not always right. But you need another system as a symbolic engine or something that can do system two thinking, or the slow thinking reason about something. Be deliberate and verify that.
Willem Pienaar [00:14:00]: So, DeepMind solved these geometric problems with alpha geometry, and they did this with a language model and a symbolic engine that would validate those results and then circle back, and if any of those failed, it would prompt the model again for new generations, and if it succeeded, it would construct or add that to a synthetic database, and they generated hundreds of thousands of examples. And this model also applies to agents. Plan generation works in exactly the same way. And so that's effectively what I've described earlier in this talk. And they're very successful. They beat silver medalists and almost competed with gold medalists, which previously, with AI, was completely unheard of. So that's very successful. So, the road to autonomy, and that's my final slide, is basically break up your complex, larger jobs into smaller ones.
Willem Pienaar [00:14:53]: Some of them can be fully autonomous, and some of them will still be human in the loop. But the real holy grail is, when do you introduce the human in the loop? So, if you have self awareness. So, that's something we're very focused on. You can drop from a full autonomous mode into a semi and bring in a human, but otherwise, you can just continue. So adding that self awareness is a critical step. But over time, regardless, we can add value with agents in some of these subworkflows. So, if you are interested in banishing humans from the production environment, come and chat to us at Cleric. That's it.
Demetrios [00:15:26]: What a strong ending.
Willem Pienaar [00:15:29]: Slide.
Demetrios [00:15:31]: Cleric careers. What was itcareers? Cleric I. Cleric IO careers. Go. Banish some humans from the production. I love it, man. This was awesome. So, a lot of stuff you put in there.
Demetrios [00:15:48]: I'm going to have to absorb that over the next couple of days because that was very cool. And I think the main thing that I will probably be asking you about later on is that trade off that you were talking about. Like, when you get more agents, you get better results, but you can also do it with smaller llms, and so you don't necessarily need to burn the bank to get these tasks done, which I think is pretty promising. So I like seeing it, man. I appreciate you coming on here, teaching us this stuff. Cool. As always, it's a pleasure.