Building Reliable Agents // Eno Reyes // Agents in Production
Eno Reyes is co-founder and CTO of Factory.ai, an organization developing autonomous systems called Droids that automate software development tasks. Prior to Factory, he was an ML engineer at Hugging Face working on enterprise LLM deployment, fine-tuning, and productionization.
The field of agentic system design is an exciting and quickly developing area with significant implications for software use across various sectors. These systems differ from conventional software by employing decision-making processes that are non-deterministic and often unpredictable. By drawing insights from diverse fields such as robotics, cybernetics, and biology, we can begin to cultivate an understanding of how to construct systems that are more dependable than their individual probabilistic components would suggest.
Adam Becker [00:00:04]: Next up, Eno, let's see, are you around?
Eno Reyes [00:00:07]: Hey, how's it going?
Adam Becker [00:00:09]: Okay, I don't know if you. Where are you located?
Eno Reyes [00:00:12]: I'm based in San Francisco.
Adam Becker [00:00:13]: San Francisco. Okay. So far enough away not to have to swipe on the llama. One of the things that we are. Our last conversation just now, the last talk, describe some of the challenges of distributed systems, even in more like traditional kind of software, let alone the moment that you're introducing agents and having to introduce a whole new conception of reliability even for agents like multi agent systems. That sounds like an increasingly consequential and difficult problem to solve and it sounds like something that you've been spending your time on. So I'm going to be letting your. You're sharing your screen.
Adam Becker [00:00:55]: Okay, cool. You know, I'm very excited to see what you have to share with us and how to build agentic systems reliably. I'll be in the background and I'll be taking in some questions. So I'll come back in in about 20 minutes. And if folks have questions until then, please write them in the chat below and I'll get to them very soon. Eno, the floor is yours.
Eno Reyes [00:01:21]: Amazing. Well, appreciate you having me and welcome to everybody here. My name is Ino Reyes, I'm the co founder and CTO of Factory and today I'm going to be talking about, as the slide says, building reliable agentic systems. This slideshow is basically a bit of a speed round on some concepts that the Factory team has encountered and some ideas that we applied that have helped us raise the bar on reliability for our core product suite of. And so for some additional context, Factory's mission is to bring autonomy to software engineering and what that means concretely, we're building a platform that helps software engineers in organizations both small and large, collaborate with AI systems so they can work on all of the tasks that a software developer might need to work on. So that of course includes coding, but also goes a little bit beyond. A lot of tools are very focused on for example the IDE experience and writing lines of code. And as we see it, there's a lot of overhead to being a software developer that touches code, touches documents, touches things like review, organizational practices, onboarding and across this whole world of tasks, there is oftentimes a need for not just a simple one off AI interaction, but rather a broader multi step complex workflow.
Eno Reyes [00:02:46]: Uh, and, and so bringing autonomy to software engineering for us means we need to build systems that can handle the complexity of real world environments. Um, and so you know, basically to kind of like kick this off. I, you know, I like to start with a definition because the term agent is really overloaded these days. There's a lot of people that have talked about agents. Every kind of AI company says they build agents. But what really is an agentic system? You know, there, there's a lot of different traits and features that I think you can kind of pin to agents. But what we found that amongst the many different workflows, and some people use the term cognitive architectures, we've seen there are some boundaries, but it's actually really hard to draw an explicit boundary around an agent. But there's three characteristics that we think or we like to think of as being characteristic of an agentic system.
Eno Reyes [00:03:45]: And so that is planning. Agentic systems make plans which decide one or more future actions of the system. This can be simple, like just outputting a single action, or it can be complex, like a multistage, multitask plan with potentially coordinating even multiple actors. Agentic systems also make decisions. That means they evaluate multiple pieces of data and, and select their next action or state or some decision based on, you know, typically an algorithm or selection process. A lot of people will sometimes call this reasoning if the decision making process is sufficiently complex or, you know, maybe even to us it may seem semi unexplainable or we can't really put towards the exact algorithm. And the third thing is, and arguably the most important is the environmental grounding. And so what this means concretely, agentic systems read and write information to an external environment.
Eno Reyes [00:04:48]: And this enables them to achieve goals within that environment. And so ideally all of this information that the agents are able to react to and adapt help it achieve its kind of core goal. And so what we're going to do is we're actually going to jump into these three concepts and basically I'm going to talk through some high level problems and some solutions that our team came up with to address these high level problems. So let's talk about planning first. You know, one of the, one of the biggest issues with building kind of, you know, long term plans with an agent is that they tend to drift away from their core responsibilities. If you've ever played with tools like baby AGI or a lot of kind of basic agents, you might ask it to do one thing. And if that task requires a bunch of different individual sub steps, it's very difficult to get it to converge into a single path. And so, inspired by the Kalman filters used in robotics and control theory, our team has put together A couple techniques that basically passes intermediate context through the plan to ensure that the plan as it continues is focused and reliable and it ensures a convergence.
Eno Reyes [00:06:07]: You know, obviously there's some cons here, is that in that, you know, if you are off in your initial prediction of what, for example, the next step should be, or an individual step of reasoning that can propagate as you pass that through each additional step. The second technique that we've encountered is subtask decomposition. And basically what that means is if you break down your complex plan into smaller, more manageable tasks, it allows for a finer control over that plan execution. And so this is super helpful if you are trying to understand exactly how to solve complex problems. But the risk is that you have a higher potential of introducing an incorrect or misleading step. And so one misleading step could be the issue. The third is what we call model predictive control. And so this is another idea taken from robotics.
Eno Reyes [00:07:04]: If you've ever worked on self, on or around like self driving cars, you'll probably also have heard this. But basically a plan is dynamic. In most complex tasks, you need to constantly reevaluate what am I doing and what is the next step that needs to be taken. And so your initial plan at the beginning of a task really should probably look a little different from your final plan once it's executed. And so adaptive planning based on real time feedback can be a super helpful way to ensure that your agent is able to dynamically adjust to problems or, you know, the encountering issues. The kind of benefits here are pretty obvious. I'd say the downsides are a little harder to track, which is that you really are introducing a significantly higher risk of straying from that initial correct trajectory. And that kind of goes back to that first point of, you know, if you are doing these complex problems with agents, you really do need to think about how do we keep them consistent, but also able to adapt to new information.
Eno Reyes [00:08:06]: And then finally, the last thing that we found to be quite helpful is really defining what your plan criteria looks like. You know, the success criteria and structure of a plan is ultimately going to decide exactly how successful or, you know, competent your agentic system is. This can be implemented in a bunch of different ways. You know, whether it's instruction prompting, few shot examples, you can do static type checking of your plans. You know, one big thing that I'm not going to do is going to be too prescriptive about how you implement these techniques. Since all of you have different systems, it's a little hard to say, you know, do this specific thing, but hopefully this kind of when if you've built an agent or you want to build an agent, this is like a helpful thought of hey, we need some detailed instructions, some success criteria and we need that to be just defined so that the agent isn't just running off trying to do this from scratch every time. The kind of con here is that it is difficult to build and scale multiple explicit criteria. So if you have a very open ended problem space, it's not always easy or obvious what the criteria should be because you may encounter just totally different versions of the same problem.
Eno Reyes [00:09:24]: So that's planning gonna move over to decision making now. You know one of the, one of the biggest issues with LLMs is that they are next token predictors, right? So the output of an LLM which is likely to be a core part of your agentic system, although people build agentic systems without LLMs. But you know, the output of an LLM is ultimately conditioned on all of its inputs prior, which means that as you build out a response, every additional like layer of your response is influenced by that initial thought process. And so there's kind of a need to sample from multiple different outputs in order to get a better sense of what the correct answer might be. And so we call these consensus mechanisms that can be a lot of different things. People have tried things like prompt ensembles. So give a bunch of different prompts that vaguely recreate the same question and then take all the answers and do something with it. Another is cluster sampling.
Eno Reyes [00:10:28]: So imagine just getting 20 answers from the same prompt and clustering the answers that look similar and then doing something like for example, sampling from each of those clusters and comparing. And then finally there's self consistency which is probably one of the easier ones which just says look, output a bunch of times and take the most consistent answer. All of these methods tend to improve the accuracy of your decision making but ultimately require greater inference cost. So there's a trade off. The second kind of decision making strategy is explicit reasoning or analogical reasoning. And this is actually probably one of the best studied LLM techniques. And like kind of like confidence boosting techniques for LLM based systems. You've probably heard of things like chain of thought checklists, chain of density.
Eno Reyes [00:11:20]: You know people have actually got really creative tree of thought with. Basically if you explicitly outline the reasoning strategy or analogies for guiding decision making processes, you can significantly reduce the complexity from that decision making process and make it way simpler for an LLM based system to accomplish those tasks. The kind of like pro here is with that reduced complexity, you're able to get significantly more consistent responses. And consistency though doesn't always mean accuracy, but it does mean that, you know, the process that's followed will tend to look pretty similar. The kind of downside here is it does reduce the creativity or like the natural instincts of the LLM because you are forcing a reasoning strategy. So if your reasoning strategy is not sufficiently general or just downright incorrect for a given problem, it's going to significantly reduce your agent's ability to solve that problem. So, you know, kind of trade offs here. This is.
Eno Reyes [00:12:25]: It felt like this was necessary to say, but one kind of obvious way to improve your agentic systems decision making ability is to work on the model weights. So if you have an open source model that you know is really good and you have specific tasks in mind, decision making tasks like for example classification or tool use, you should probably consider fine tuning or otherwise kind of impacting the actual end model weights. The kind of huge obvious downside is this is expensive and locks in quality. And so you know, if you have a known distribution and you want to improve your quality on that distribution, this is a great strategy for us. We found that rarely is fine tuning the actual solution. Oftentimes base models are not just good, but actually better at a lot of problems because they are so good at like larger distributions of tasks. But your mileage may vary and you may be building a system where fine tuning is the problem to be solved. And probably the most interesting one is simulation.
Eno Reyes [00:13:31]: And so this is a pretty advanced decision making strategy. But if you are able to operate in a domain where you can simulate decision making and simulate multiple decision paths, this can be extremely valuable. And so the basic outline of the problem is, uh, you want to sample and then simulate forward multiple decision paths and then determine the optimal choice based on some reward criteria. This doesn't need to be like a full RL experiment language agent tree search I, which I mentioned earlier, is a pretty advanced technique but a very doable one for exploring decision trees. And it basically imitates like Monte Carlo tree search using LLMs. And so you can imagine that if you're operating in a domain like software development where you have tests which can be executed and the success criteria of that test is, you know, does it pass or does it compile? You're actually able to kind of model through these decision making processes and end with a, with a path that leads you at the least towards whatever your reward that you wanted to set up was. The, the benefit here is just is quality Generally, the cost here is this. There's a couple of downsides.
Eno Reyes [00:14:46]: One, this can be slow, this can be expensive. But what we found, actually the biggest downside here is that this is really difficult to model and debug. So you are putting a ton of assumptions into your system when you build these simulations. And you rely on simulations for decision making. And so you have to be really cautious about building these out. And the scope of what you're trying to simulate and where you use this can really impact the amount of time your team is spending, the quality of the outputs, all this kind of stuff. And I know this is pretty rapid fire, but hopefully you're taking notes and if you have any questions or if you're honestly just building the space, hopefully this serves as a nice point of reference for how other people are thinking about building these things. So onto environmental grounding.
Eno Reyes [00:15:40]: One of the most critical aspects towards building agents and building agentic systems are the interfaces that your agentic system has to access the rest of the world. And so, you know, you really do need to be building dedicated tools for your agentic systems. And in particular, if you're building an agentic system that interacts with computer systems, you should really be thinking about where you want the agent to, you know, be able to access things and what abstraction layer you want your agent to be operating on. You know, for example, calculator is a very simple tool, right? If you give an agent access to a calculator, it'll be able to do math a lot better. If you give access to a, you know, custom API wrapper around a couple of key actions in a platform, the agent is going to start to become a lot more fluent in that platform. If you give your agent access to a sandboxed Python script that allows it to execute potentially arbitrary code to achieve its goals, it's obviously going to be able to do way more. But with that breadth comes decreases in accuracy. So, you know, as you kind of move up and down that layer of abstraction, you should think about where you want your agent to be operating and where it can be most effective at kind of taking the reins of the tool and where you need to actually just hard code or define those interfaces.
Eno Reyes [00:17:01]: You know, one big downside here as well is it's. It's difficult to design and maintain these tools if they are like agent or LLM specific. The second biggest thing for environmental grounding is explicit feedback processing. So your agent is likely to be operating in a complex environment, which means the feedback from different interactions, processes is going to be Quite large. One example of this is if you were working with coding agents, you might encounter log data quite frequently. And this could be truly hundreds of thousands of lines of standard out outputs, debug logs. And so, you know, when you think about this log data pre processing, this is really critical to building your identic system. If you think about humans, we ingest huge amounts that, you've probably heard this before.
Eno Reyes [00:17:50]: We ingest huge amounts of data, right. The bandwidth of our eyes, ears, you know, our sensory capabilities is enormous. But we focus in on the details and the information that matters. Agents don't have a complex biological architecture. So you actually have to define that signal processing, that feedback processing. And if you can parse through the noise and give your agent more of like the critical data, it will be significantly better at solving problems. The third thing here is bounded exploration. So you really do need to allow your agent to gather more information or context about the problem space.
Eno Reyes [00:18:30]: And so if your agent just kind of charges off on a task without any information gathering, or it's like one initial information gathering and then you just let it run, this could be effective if your agent is really, really tightly scoped. But as you build more increasingly general systems, the need to enable your agent to retrieve and learn and otherwise like gain more information becomes pretty critical. And so this helps the agent understand the problem a little bit better. But you know, you have also have the potential to overload context or stray from like the key path because retrieval is difficult and getting just the accurate information can be tough. And then finally human guidance. So this is kind of like a last thing to say, which is that, you know, you're probably building an agentic system that will somehow interact with a human. And so you should involve the human's input to improve your agent's reliability. And this requires really careful UX design thinking about where it makes sense to intervene or interact with an agent.
Eno Reyes [00:19:35]: You should really like think deeply about this. And, and if your agent is kind of, if your opinion is, oh well, we'll have, we'll build the agent and once the reliability increases, we won't even need it to interact with humans. I would recommend kind of rethinking how you're building this, the system, because at the end of the day, most of these systems are not going to be reliable enough to do large tasks on their own 100% of the time. And so even if you disengage and you aren't able to accomplish a task, let's say 10% of the time, 20% of the time at scale, that's actually a ton. And so you do need to think about the design of these interactions. Yeah. So that is pretty much it. Hopefully.
Eno Reyes [00:20:16]: This was interesting and this jostled some ideas and just wanted to throw this up here. We are hiring and so if you think that bringing autonomy to software engineering is a compelling problem, we really do think this is the highest leverage problem in the world. And so feel free to email me or the team if you're interested in joining.
Adam Becker [00:20:34]: Thank you very much for this, folks in the chat. Absolutely loved the presentation and we'll be sharing this out as soon as recordings are ready. I hope this one is one of the recorded talks. I know that some of them are. I'm very curious about the kind of work that had led you to even, like, reason so abstractly about so many of these things. I mean, this is like you've managed to pack so many different things into this. I have a bunch of questions. We're not going to get through most of them for sure.
Adam Becker [00:21:07]: Let me see if anybody in the audience dropped a question. Okay, not a question yet for you. So let me try to take a stab at this even, like first, in one of your first slides, one of the things that you spoke about was kind of like trying to come up with ways to make sure that the reasoning converges. Do you remember that? Any tips on how to go about that?
Eno Reyes [00:21:33]: Yeah, absolutely. This is quite challenging because if you have a very simple problem, and now I can try and ground it to very concrete examples. Let's say you are building an agent for code review, right? And you just want your system to review code. The boundaries here are actually pretty tight. When I create a pull request and then there's maybe some logic around, like, you know, what justifies a code review for a pull request? I want a system to go and look at it and I want it to look at the other components and then I want it to leave a code review, like some pretty, like, undeniable things. If that system doesn't end with a like comment in your GitHub, that system failed, it's pretty easy to put that on Rails and say, like, look, we should converge on this specific reasoning down the, down this path. For a system that's like a general coding agent that you might want to task with something like, hey, I want to build a feature. There are truly a million different ways that your system can do that.
Eno Reyes [00:22:33]: And so, you know, if you have really, really guided boundaries, you can build that into the system. If you don't, you have to use techniques like this, like what I mentioned, model predictive control, kind of like these Kalman filter like systems in order to. In order to make sure your system keeps going down a path that is aligned with what you originally intended it to do.
Adam Becker [00:22:56]: There's another question here from occurred, if I'm pronouncing it right, what is the most differentiating factor? So I imagine now let's say this is how I always suspect that the question arose. If you're trying to put together this agent and there's all these different things that you can do, where should you start? How do you know that doing that thing will help you differentiate? Suppose everybody is using the same model. What is nevertheless going to be the most useful to driving you to the best outcome?
Eno Reyes [00:23:28]: I really think that it's what exists in this slide right here. Your ability to build the system. And in particular I'd go as far as to say the AI computer interfaces are the differentiating factor. Most of the model providers are building really, really capable systems. O1 is really great at planning, backtracking, reasoning. A lot of this stuff is kind of, you get out of the box with these O1 style models. What O1 does not have out of the box is an integration and a tool and a definition of that tool that works with some other system that you've built. And so if you are building agents, highly recommend the interfaces as a place to start.
Adam Becker [00:24:11]: Very valuable advice. Thank you very much. Next we have Brooke. And to be honest, one of the things that I wanted to do is to ask you about the simulations that you had there. And I think this is going to be exactly the talk that Brooke is going to, you know. Thank you very much. If you can please stick around the chat and answer some of the other questions that folks have had, that would be lovely. Thank you very much.
Adam Becker [00:24:36]: Eno.