Catastrophic agent failure and how to avoid it // Edward Upton // Agents in Production 2025
speaker

I'm currently focused on building production-ready browser agents, working on both the intelligent agents themselves and the underlying infrastructure needed to support them at scale. My work emphasises creating resilient, reliable systems while establishing robust evaluation frameworks to ensure these automated browser agents perform consistently in real-world applications.
Before diving into the world of autonomous agents, I spent time at Netcraft developing services to combat online fraud for top banks and major brands. This experience gave me deep insights into building secure, enterprise-grade systems that protect organisations and their customers from digital threats - knowledge that now shapes how I approach creating trustworthy and dependable browser agents.
SUMMARY
Increasing powerful agents are leading to increasingly higher stakes automation. Between fighting fraud at scale using LLM-pipelines to handling healthcare and insurance data with browser agents, I've observed my fair share of consequential agent failures. In this talk I'll share what I've learned about circumnavigating the agent failure landscape.
TRANSCRIPT
Edward Upton [00:00:08]: So yeah, my name's Edward and I'm a founding engineer at Asteroid. A brief summary of what Asteroid does as a company is we provide agentic browser solutions to solve kind of repetitive tasks on the browser and we build kind of the infrastructure for that, the platform as well as we kind of work with companies to help them migrate using agents. And from my experience at previous companies and at Asteroid, we kind of see examples of agent failures which are inevitable. But today we're going to kind of have a brief glance over kind of specific real world examples of when these failures become catastrophic and how we try and avoid those at Asteroid. So as like from a high level, our customers nowadays from agents expect to kind of replace human tasks with agents, but with the same qualities that human would have. So that includes human level accuracy. So they expect the same kind of performance of the agents as a human would, kind of adapting to changes and kind of properly understanding the context of the domain that they're using consistency. So if you provide the same inputs, you kind of get the same outcome and accountability.
Edward Upton [00:01:21]: So when there's a human doing the work, you fully expect them to be the one you can blame if something goes wrong with agents. Bit harder because like, do you blame the model builder, do you blame the infrastructure provider? But what we're seeing is that our customers kind of expect these three core components, especially in the domains that we work in. So we work in healthcare and insurance, and these are domains, as we'll see, that are quite susceptible to not just the kind of expected agent failures, but catastrophic, where they have real world negative impacts. And a key thing we kind of have learned is that a working agent is one that has zero false positives. False positives, as we'll see in a moment, are kind of the worst kind of outcome to have with an agent. So what makes a failure catastrophic? So kind of the standard kind of height use case of agents. You see when like a new model's release, people are going and using chats to kind of test them out. But an actual useful agent isn't operating in that kind of enclosed domain.
Edward Upton [00:02:25]: They're working on with capabilities that have real world side effects. This can be things like interacting with the world like a human would. We're seeing more with like full terminal access. And this is kind of one component that means an agent can really influence the outer world outside of its little sandbox. And so its actions really do matter. And kind of the other components that makes a failure catastrophic is not being able to just simply click undo. So naturally agents work in domains that have a lot of state. So we specifically work in the browser space and browsers, whenever you click a button or move to different pages or whatever you do, there is kind of a state building up.
Edward Upton [00:03:10]: And it's not easy to just undo the states. We wish we could, but you can't just undo your actions. And that kind of goes hand in hand with kind of the real world side effects is that you are making changes to the outside world or your internal agent states that you can't just simply undo. And what this means is when you reach a failure, it is quite often when you're using like a real world useful agent, it's quite often catastrophic. And it's quite important that you easily either stop yourself getting to that failure or you identify that a catastrophic failure occurred so that you can manually intervene and do the best you can from a human level to kind of rectify the issue. So I'm going to go through two examples here of real world situations that we've seen agents complete and the kind of the main problem with what happened and how we can solve it. The first example is where a scam intelligence agent mining product which basically received emails and then sent out emails to try and mine for scam intelligence, acted kind of untoward. So what happened was we received a spam email from a bank.
Edward Upton [00:04:22]: What actually had happened is the bank had sent out an overdraft email and within that overdraft email they'd mistyped the email and the email they accidentally sent it to was a spam trap which we then provided. We got provided that data. So then we kind of were like, oh, this might be interesting, this might actually be a scam email. Now in this product we did actually have a human approval process, but the issue with that is the human approval was quite tedious. It's not necessarily an enjoyable thing to do and you're looking at like thousands and thousands of emails so you kind of could easily mess. What happened is we, we actually humanly approved that message as one that we had authorized this kind of chain of intelligence mining to kind of go ahead with. After that the agent was like, okay, awesome, we're going to go and try and mine this for data and pretend it's be the, the actual bank account holder and proceed to have conversations with the bank. Eventually the bank obviously became suspicious and was realized that they probably weren't talking to the real person and they actually thought they were talking to somebody who was a real human who was acting for the actual bank account holder.
Edward Upton [00:05:42]: And that was against their terms. And so they sent a notification out that they were going to completely block the account. Now it's quite easy to think, oh, it was the bank blocking that was the problem. But actually the problem arose from that human approval step. And we've noticed that human approval, and I'll get onto this a bit later, but human approval is. Having human approval is really good, right? Having human in the loop, especially with new agents, is a really good idea. However, if you've not got a good enough tooling for it, humans just tend to make mistakes and it's no better than the human just doing the full task. Right? So that's kind of the key.
Edward Upton [00:06:20]: Step number two here was kind of the key issue with this solution was we didn't have the best human in the loop solutions. The second example was one at Asteroid where we had an agent that was booking healthcare appointments. And again, this is a very sensitive topic. And so we received a request from a customer to book an appointment on behalf of the user. Now we ran the agent, the agent starts running, it starts filling in the form. However, at some point it went down a branch that meant that it incorrectly didn't go to book an appointment. When the input data suggests that it should have booked the appointment, but because it still reached a like output, what we deemed at the time as a successful state, at the end the agent outputs success. And so from our level, we didn't see anything gone wrong.
Edward Upton [00:07:09]: From a customer's level, they were reviewing all the executions, they were like, nothing's gone wrong here. But this was actually a catastrophic failure, right? Because we didn't correctly evaluate the output. And so we incorrectly classified it as a success. And the side effect here is that the user, the actual end user, didn't get an appointment booked. And in the case of it being a serious health problem, this is not acceptable at all. And we weren't able to. This is kind of come back to the unrecoverability component of failures, is that we weren't able to recover or like human intervene because we didn't even know. And when you're running hundreds and thousands of executions a day, you can't go through each one and log kind of analyze.
Edward Upton [00:07:54]: So the kind of two key learnings from these two things are that you, you want human the loop when you, when you, especially when you're building a new agent and it not to be like good tool set, it's not a tedious process. And the second component is that you want good evaluators. So how do we solve this at Asteroid. The first kind of very obvious solution here is to really scope what your agent can do. It can be very easy to kind of do a full react based agent where you just give it a massive prompt and it just kind of builds and thinks and builds and thinks, just kind of goes on its own branch. What we found is we actually started with that asteroid and what we found was in the real world, our kind of customers we have don't want this. They're happy to sacrifice a bit of the building time to have more predictability that is still agentic and can still adapt to different inputs and slightly different pages, but it's a bit more predictable. Also what we noticed was it's very good to have always this kind of being able to fail at any point, right.
Edward Upton [00:09:02]: You don't want the agents almost have a bias to keep them progressing. So the way we build agents and the platform we've built means that the, the kind of the agents, they don't have a bias to fail, but they can always fail out if they need to. Because at the end of the day it's fine if you have a false negative, not if you have a false positive. So if you have a false negative when it could have actually completed the task, we can come in, resolve it. It didn't actually have any side effects the outside world, whereas if you have a false positive, you can't really get out of that state. The kind of second key thing to discuss this a bit earlier about the human in the loop is having really good tooling and visibility and almost like a debug platform for your users. Our users are like quite at the moment fairly technical, so they're happy to go in and kind of look at the log. But it's also useful to have ones where a human can kind of synthesize.
Edward Upton [00:09:53]: Your more typical consumer can kind of understand very simply what the agent is doing. And this is like, because agent building, especially in these kind of domains is quite a collaborative process. If you're not just going to have one prompt that does everything from the start to the end, there is going to be kind of this iteration cycle of building more determinism, choosing where you want a like kind of more agency and then adding your kind of rules and your guardrails and so visibility into what the agent is doing with like a recording and each step and where that happened is really important. This has really provided us with tools to build better agents and improve the platform. But it also gives the confidence to the end user that they can see how the agent is working and it's following kind of a roughly defined path. And especially if they're in like the healthcare insurance industry, they kind of want to know that it's following this path because they don't want it to deviate and kind of do something incorrect. The other key thing here is always escalating, especially in the building step if need to. If it's just stuck because it's maybe missing some input data, you shouldn't just fail because there might have also been side effects from the agent running up to that point.
Edward Upton [00:11:13]: So if you've already like half submitted one part of the form but not the other, you've kind of half filled in the form, right? So you don't want to fail too early if there's a way of recovery and then kind of feeding that back in to improve the agent for the next execution. The third kind of feature, I know a lot of talks have spoken about this today, is the kind of evaluation, right? Your agent is only as accurate in its output as the evaluators you use. And what we found is that the agents themselves, you shouldn't trust their own evaluation. It's useful because they've got kind of the full context. They're kind of in the flow of the task. However, you do need a separate system that has not got this kind of goal of just completing the execution. They actually want to critically analyze the kind of full log and then say, okay, given these evaluators and what we found is that different customers need different kind of evaluators, right? Getting to the end of the execution for one customer might be a success. For other customers, it might actually be a failure.
Edward Upton [00:12:21]: If you think of the case of like QA maybe if you're kind of testing where a user can go in a platform, if they can get to a certain page and they shouldn't have been able to, you want that to be evaluated as a failure. So having these flexible evaluators that run as run as a discrete system after the execution is very important so that you can ensure that you can identify when the agent is failed and when you need to come in and provide human assistance. Cool. So thank you very much for listening. I only had short amount of time to talk here. If you want to find out more about what Asteroid does, the website is there and you can go onto our platform and build an agent. We've recently released kind of this graph based workflow which makes it quite a bit easier to kind of build agents from the ground up. You can still just kind of make it one Node that builds a full agent from scratch, but you kind of play around with it and kind of see what kind of node structure is kind of good to make sure that you build these reliable production ready agents.
Edward Upton [00:13:22]: Agents. And if there's any questions, I'll happily take those.
Adam Becker [00:13:29]: Excellent, Edward, thank you very much. Let's see if any questions come from the chat and if not, asking you a couple myself, although I know that we're running a little bit short on time, so if you don't mind, while folks are probably thinking, can you go back a couple of slides a little more? Yeah, yeah, yeah. This one? No, no, this one. Yeah. Well actually, actually slide two. Yeah, use case two. So what do we do about the fact that. So basically the.
Adam Becker [00:14:14]: The agent thought that it was confirmed, but because it had reached the confirmation.
Edward Upton [00:14:19]: Page.
Adam Becker [00:14:22]: That'S what I'm understanding from the third stage. It did reach a confirmation page. It could just be that it's confirming a different thing or maybe it isn't the right confirmation page. Then it walked away with a completely different impression then that there was a gap between its impression and the reality that had then not gotten picked up by subsequent steps.
Edward Upton [00:14:45]: Right, yeah, exactly. So, so we see that a lot of forms, especially healthcare forms they have, they're very branching, right. So depending on the kind of the. Maybe it's the symptoms you pick or kind of what you're requesting, whether it's an appointment, whether it's whatever, you often can still reach a kind of success page, which basically just means they've received your inquiry, whatever the inquiry might be. So what we noticed with this specific example is that it's filled in the symptoms in such a way that it didn't think, I think it was like, didn't think that it needed to go and have an appointment because it wasn't serious enough, when in actual fact the symptoms were serious enough to go and have an appointment. So it still reached this like final success page because the inquiry has received, it probably said some text like you don't actually like form submitted, but you don't actually need to go and see the gp. We haven't put you an appointment. And the issue with that is that the actual, if you look at the symptoms that were provided and if they were inputted fully correctly, but you can't rely on agents performing 100% accurate.
Edward Upton [00:15:49]: Right. You just got to make sure that they don't mainly don't false positive and that you can go back, whereas in this case we didn't even get that visibility.
Adam Becker [00:15:58]: Right, right. So. But so this is connecting to a point that Brahma san in the chat is making and it's, it's leveraging the intuition that you. Okay, so you are now building up some intuition around how forms are laid out in. And what is the logic behind forms, let's say in healthcare. And the branchiness means that you then have to kind of like correct for them with that type of like domain expertise. Brahma is asking the evaluation seems so specific to individual domains. Is that the case? And is there something more horizontal that can be built? Is there a flow that is generalizable despite individual differences in these domains?
Edward Upton [00:16:43]: The what we found is that, and especially with our like early customers is that they, they are, they are, they want solutions, right? To two issues. Which are, they've been filling this form for years, right? It's the same form, it might be the same form across multiple healthcare providers. And a lot of the time they, they don't, they want to have this. We can see that the last node in this graph, the only way you can get success is with this text being on the screen and then the evaluators being like, okay, we've reached success. But did you actually fill in with these symptoms? Did you actually fill those symptoms in? Did you actually go a different branch? And I think what the key thing here is that it's very. Yes, you could get maybe 80% of the way there with just a one prompt react agent, very generalized. But our customers don't want that. They are coming from not using agents at all.
Edward Upton [00:17:42]: And they have humans doing everything. So it's really accountable and they kind of trust that humans are doing the job correctly and that they don't want to transition all the way to this very generalized approach which doesn't even work for them. The technology isn't there. We'll come to find as we run more agents for customers that we'll see these generalized things that we can generalize between agents and maybe new nodes that we can put in that help make the user trust our agent but are still under the hood more general to other use cases. So it's less hard coded.
Adam Becker [00:18:14]: Yeah. Okay, we got another one from Tom. There's a couple here. One is so Tom, thanks for your question. Question for Edward. Is there an intractable trade off between agency and determinism? So the goal is autonomous but safe agents, but autonomy tends to unpredictability and safety tends towards non LLM workflows. Is it all about finding these balances?
Edward Upton [00:18:38]: Yes, for sure. And different customers have different use cases, so that's why we built this we moved to a graph based platform is that it allows you can just have one node with a really general prompt. It's just like here's the goal, here's the input data go and I don't know, book me a holiday. And that's kind of the state of the art examples you see. However, most of our customers so far kind of want a more safe do this node some determinism. So we have like sometimes in some nodes you just have a full script that runs because users need to know that that step will succeed. For example, logging in is one that you just want to you want to script. And also for speed these agents aren't free in terms of price and speed.
Edward Upton [00:19:25]: So if you know that loads of different agents can use the same login page, you can just abstract that out. There's no reason not to abstract it out other than if that page updates right. And having the kind of backup agency if you need it is useful but if you can determinism for the kind of customers we're dealing with is kind of important.
Adam Becker [00:19:46]: Got a couple of other ones Nancy's asking. Would you say these methods are enough to mitigate false positives or are there more specialized approaches to false positives?
Edward Upton [00:19:55]: In particular you'll I think unless you human label and even then humans make mistake but unless you human label every single outcome, you're always going to have false positives. When we took kind of a different view on how to reach success, we reduced our false positive rate drastically to the point we rarely see them nowadays. For the customers that like really want to make sure they never reach false positives, the kind of unrecoverable like booking a GP appointment kind of those kind of tasks I think other yeah it's kind of like a. It's still a balance, right? It depends on customers. You might have kind of a general consumer who wants to automate a very simple task. They don't really care about building evaluators because it's not a mission critical problem.
Adam Becker [00:20:42]: Okay, last one here from Aditya. How would we evaluate the reasoning cases? Let's say like information extraction, summarization, etc. Is it just LLMs judge with human in the loop? Is that the only solution or is there something more robust that's possible to use?
Edward Upton [00:21:01]: Is this like reasoning of the actions that are performing so like prior to running like before the agent does an action or is this more the evaluation side?
Adam Becker [00:21:14]: Let's say both.
Edward Upton [00:21:16]: So yeah for reasoning we find we don't need to do Reasoning too much. And the reason why is we provide the data to the model like the agent, fairly structured already. And it kind of because you've got these separate nodes, what you're actually asking in each node is often not actually that complex of a task a lot of times because again that's what our customers want. They want to know that the agent is 100% of the time going to complete this task correctly. In the case of like evaluation and kind of analyzing an existing execution, you are, we find that the kind of the summarization of what was done and the kind of output classification, sometimes the agent, agent's own kind of output is not 100% correct. The evaluators often clean up the kind of cases where it's incorrect with its own classification and that's because they're scoped and the compute is used for a different purpose. At that point you're not trying to be creative, you're just trying to kind of classify. Yeah.
Adam Becker [00:22:22]: By the way Edward, I think Aditya is doubling down. He says mainly for evaluation purposes.
Edward Upton [00:22:29]: Yeah. So yeah, just basically what I just said, the evaluation you can provide a lot more, you can trust a lot more that the computer using there is going to be a lot better than kind of the general agents computing when it's trying to do its initial kind of output like which node should it go to. Whereas the evaluators are kind of irrespective of what the agent did. B, make a decision and you can use different models, different models there. We use different, completely different prompts and users can build their own prompts to basically evaluate whatever they want and output different classifications. So yeah, I hope that answers the question.
Adam Becker [00:23:09]: Okay, last one. Promising last one. Fernando is saying humans still make tons of mistakes. Fatigue, not paying attention, lots of other reasons. How close are you to having a smaller percent of false positives than humans? Do you believe?
Edward Upton [00:23:25]: Like very, very close up. Like it's difficult to, like you'd have to. We're operating over so many different domains and we would need customers over many different, many different customers to give us this data. And like we expect humans. But I imagine lots of time those, those statistics might be biased within the companies and the advantage of using agents right, is that they don't have like a past. They're not, they're not wanting to hide their mistakes. If they make a false positive and we'd like oh no, this made a false positive, there's not really any hiding going on there. Whereas if a human is doing it, they're often going to try and recover it themselves.
Edward Upton [00:23:59]: The false positive might not be known. So it's difficult to get this kind of data. What I think is the tricky part with agents is the accountability side. Right. Who do you blame if an agent fails? Do you blame the suppliers of the input data? So sometimes we see that the input data is missing and then we might fail, but we could have maybe deduced, like if a phone number starts with a certain code, you can deduce that they're in this country. But again, like, who do you blame? Is that the agent builder, is that the infrastructure provider, or is that the user who's kind of executed that run of the agent? And that's kind of the difficult part I think, at the moment is kind of figure out how good does our platform need to be that as the infrastructure provider. And sometimes the agent builders aren't responsible for the agent failures. And it's kind of most of the time it's down to the data that was provided.
Edward Upton [00:24:51]: Therefore, like a legitimate negative case, a legitimate failure from a user's perspective.
Adam Becker [00:24:59]: Edward, thank you very much for joining us. Drop your LinkedIn in the chat below and I'll make sure to put it so that everybody can follow up, connect with you, and send you questions as they have them. Edward, thank you very much.
Edward Upton [00:25:13]: Thank you very much. Thank you for having me.
