LLMs & the Rest of the Owl // Neal Lathia // Agents in Production
AI agents change what we can achieve with software; but also change how we have to think about building software systems. In this talk, I'll share some of the lessons we've learned while building a powerful AI agent for complex support settings.
Andrew Tanabe [00:00:04]: And I'm excited to welcome Neil Lathia to the stage. Neil is the co founder and CTO of Gradient Labs and he's here to talk to us about LLMs and the rest of the owl. So Neil, I'll let you take it away. You've got about 20 minutes here and I will come in about two minutes before the end as a time check and then also be ready to facilitate some Q and A at the end. So take a look.
Neal Lathia [00:00:28]: Audio still good on your end, right? Awesome. Hi everyone. I can't see anyone, I can't hear anyone. So this is like me speaking at the wall and at Andrew. So welcome to stage four. My name is Neil, I'm co founder and CTO at Gradient Labs which is a startup young startup in London uk. And before that I was the director of Machine Learning at Monzo Bank. And Demetrios has kindly invited me previously onto his podcast, which I highly recommend that you check out.
Neal Lathia [00:01:03]: So at Gradient Labs we're building an AI agent that automates complex customer support work for enterprises end to end. And I know that customer support and AI agents are two words that seem to go hand in hand these days, but we've launched ours a few months ago and we're now running it live with a few companies and specifically the type of companies that care about really precise, high quality outcomes. And so I'm here today to talk a little bit about what it's like to run an AI agent in production and some of the things that we've learned. So I'm the title of my talk, the Rest of the Owl is riffing on that meme of how to draw an owl. You draw two circles and then you draw the the rest of the owl. And it feels a little bit like that today with AI agents. So I've met countless engineers who have really successfully prototyped with LLMs. And actually the beautiful thing about that is that prototyping with an LLM is now easier than it was to prototype with any of the more traditional machine learning models.
Neal Lathia [00:02:16]: But the age old story of the mlops community is still true that getting a prototype into production is hard. And now getting an LLM product agent agentic workflow into production is actually a new kind of beast. So if I think about like the sort of systems that I've been building in the past, I've tiered IT into these three levels. So in my previous role I had certain MLOps sort of systems to build model registries, feature stores, and the key question in building those is really, does it work well, are the features up to date, are the models being served in real time, and so on. But then looking to building systems that are more traditional ML, like fraud detection, the mindset shifts from not just does it work well, does it serve predictions quickly, but you also now need to factor in is it generally making good decisions? And today in terms of building an AI agent, we've got this third thing to worry about, which is does it work well, is it generally making good decisions? And also what are all the million ways that it can go wrong? So thinking about where we are today with AI agents, it's still a relatively newish topic, but I'm going to make some assumptions that production AI agents are still going to be built by a set of people who are coming together to automate some sort of bounded problem and where they really want to have some impact on the world by shipping it to production. So it really the inverse of that would be an AI agent that build itself, that is AGI, and it's just going to go out and break havoc on the world. So we're hopefully very far away from that. And particularly if you're in a startup like mine, I'm sure you're surrounded by smart engineers, you've got some mission that you want to achieve like ours, and you really want to get it out there and live as soon as possible.
Neal Lathia [00:04:32]: So in light of that assumption, the rest of my talk is kind of focusing on these two things. The spectrum of challenges that AI agents bring to the forefront, and a few insights about how we're tackling them. So here's one way that we could cut up the AI agents into production problem space. So I've broadly taken four big areas. On the left, you got to think about how your AI agent is going to interface out with the rest of the world in order to do its job. On the right, you've got your LLMs that are going to be serving completions and whether they are your own LLMs or those of hugging face. Well, if you're running hugging face LLMs or invoking OpenAI anthropic Google, there's just so many out there and then this meaty middle of the sandwich, which I suppose is the part that teams building AI agents are spending majority of their time on. So on the top, all of the things that the AI agent needs in order to do its job, and on the bottom, the code that is the AI agent itself and defines how it operates.
Neal Lathia [00:05:53]: So let's dive into each one of these one by one, starting from what Feels like the simplest bit. All agents need a way to be invoked and need a way to be told to do some work and then need a way to let the world know that that work has happened. In the world of customer support, that's going to be an element of integrating with support platforms, such that when conversations get started, the AI agent can pick it up and start replying and resolving customer issues. So the first key difference here is that a more traditional, even machine learning based system is effectively an interface that's going to sit there and wait for requests, receive them, and then spit back their results when they're, when they're ready. But with an AI agent instead, we need to not just be thinking about requests for it to do work, but also getting it to stop doing work when things in the outside world change, or getting it to do different work when certain things in the outside world don't happen. So an example of that is a customer writes in, they trigger the AI agent, it starts doing its job to find a reply, and then halfway through that, the human customer writes in again. So technically, all the work that the AI agent has been doing under the hood is now no longer relevant because new information has come in. The inverse of that is the problem of customers abandoning a chat or not replying, or so all of these sort of situations where you won't be getting any API calls anymore.
Neal Lathia [00:07:38]: But maybe those are triggers to get your AI agent to reach back out to the customer and see if their problem has been resolved. One way that we've been thinking about this is thinking about handling fast and slow race conditions between the AI agent and whatever is on the other side. Particularly because even a basic AI agentic workflow that's going to make two or three LLM calls is going to be much, much, much slower than, you know, an API call that is retrieving a row from a data. So it's just that constraint that AI agents will invariably be slow that leads to a variety of different race conditions jumping all the way to the other side. On the LLM layer. This is the part where you could argue this is just another way, like sort of outbound API calls. I'm going to import the anthropic or OpenAI client, put in my API key, and I'm off to the races. Well, in a production system, we need to now dive a little bit deeper and start thinking about the range of problems that can happen in this space.
Neal Lathia [00:08:48]: At its most basic, as you might be aware, then making an API call to an LLM that successfully completes doesn't mean that it has successfully given you a good completion back. There's this concept of LLM calls that fail successfully that you need to manage. At the same time with an AI agent like a customer support one, there'll be requirements for it to think about differentiating between making fast and sort of very time bound LLM calls, like answering all your customers versus making more delay tolerant requests like to validate the knowledge that it holds is still up to date or process new documentation asynchronously. All if you are in a smaller startup like ours under potential constraints of rate limits from providers. So this is kind of an area that I've seen attracting a lot of potential abstraction where, you know, it feels like the type of problem where a better abstraction would solve all of this. But it's really difficult to do that in practice because the performance of a prompt is strongly tied to the specific models that are being invoked. So it's a little bit like in the previous generation of deep learning models where it's hard to separate the hardware from the model. In this case it's hard to separate out the LLM call from the prompt that's calling it.
Neal Lathia [00:10:26]: Now we're into the middle of the sandwich and I've put it into two sections. So it's sort of very well known that the quality that you'll achieve from an AI agent that needs to source information from a knowledge base is very much tied to the knowledge base itself. And here it's. There's a lot of popular discourse on vector databases, but in practice, what we've seen in a production system is that just taking all your documentation, integrating it and throwing it into a vector database is almost like a guaranteed way to reach a bad outcome. And the insight behind that is that most documents that are written, especially documents that are internal to a company, have not been written with an AI agent in mind. And so they either make certain assumptions like that you know what, what the company is, or they don't differentiate between information that is public and disclosable to customers or private. And so really there's an entire world of things for you to stay up and worry about at night, about making sure that the knowledge that you do pass through to the agent is clean and correct. And the biggest thing that we found in running our AI agent for the last few months is that the even larger problem than just trusting company documentation is the one of missing information.
Neal Lathia [00:12:01]: And in that category I could probably spend an entire talk on it. But the short headline is that most information inside of a Company resides in people's heads and so never finds its way to documentation at all. And so there's an entire different approach that we've been adopting to solve for that. On a more technical front, there's two bits that I like to point out. The first is that scaling a vector database is notoriously hard. And there's been a lot of talk about using stuff like approximate nearest neighbors as a way to get something to work slightly faster. But as soon as you enter domains where you want answers from your AI agent that are a very high quality, there is a problem with Ann because approximate results are almost notoriously insufficient where high quality is of utmost concern. And then lastly, this one is really more down in the weeds is one of the most important things that we do when we ship an AI agent in production is to want to understand the things that has done in the past.
Neal Lathia [00:13:21]: And so just having a knowledge base that stores live knowledge makes it impossible to debug an outcome that happened potentially a few days ago if your knowledge has changed. So similar to the story with feature stores, the importance of point in time references suddenly has become really, really important. That's all about knowledge. But naturally AI agents are attractive solutions, not just because they can do rag, but also because they can invoke tools. A lot of demos are fascinating demos and they do things like making Google searches, I don't know, amending code in a repo opening issues, and all these sort of things. Once we venture, I think, I think I'm back. Okay, thanks Andrew. Sorry for that interruption.
Neal Lathia [00:14:32]: Once we are outside of the domain of prototypes that can invoke, you know, public API tools and you're into the domain of an AI agent that is doing something for a company, you can rarely if ever allow your AI agent to make calls to private or sensitive APIs, just when you're evaluating it. So that's just the challenge of how do you have a, an agent that, how do you evaluate an agent that is using tools when it can't actually use those tools during an evaluation? Secondly, most of the tools demos that we see are retrieving data, maybe from the public Internet or from another source. And the converse is now when you're using tools inside of a company, you need to start reasoning about the actual data that you get back. I'll make a very simple example that you've retrieved a customer's fraud status and then everything looks fine, but then the customer says something else that has triggered the fraud engine to change. Valid, Sweet. So the final one is that of agentic workflows themselves. And here I was thinking about what to say. The best that I can think of is that when you go to engineering conferences about building large distributed systems like Banks or like building Uber or building these flavor of systems, it seems that there's like industry wide conceptual frameworks for how to build these things regardless of the actual programming language that is being used.
Neal Lathia [00:16:23]: So you'll hear about Monzo bank for example being built with microservices. I don't feel like we're there yet with AI agents, so at least I don't think we're there yet from the sake of uniformity. But we do know that AI engineers want to be thinking about like the abstract behavior of the AI agent and not the lower level things like, you know, making sure that LLM calls get retried. So just to wrap up, here's some snippets on how we are thinking about this today at Gradient Labs. So the overall problem space I think, like I mentioned earlier, customer support automation is in my view the second most popular application for AI agents and LLMs. And it immediately brings to mind retrieval augmented generation. But we have certainly found that that is barely the tip of the iceberg for some of the largest enterprises that we are working with where just a rag based agent is going to probably only cut into 10% or less of the actual problems of what their humans are working on. So when we're building, we build AI agents from a high sort of medium and low level perspective.
Neal Lathia [00:17:45]: From the high level perspective, we architect our AI agents as state machines. So in one of our blog posts I made a very simplistic one of like imagine a state machine of you and I talking to each other. So you start talking and I enter the I'm listening state, you finish talking. So I figure out that it's my turn to start talking and we go back and forth like this until our conversation is complete. In practice, our AI agent is a state machine that has many more states than this. And some of them are deterministic, some of them are more agentic, but it allows us to control the flow between the different states using all the things that I mentioned earlier that matter, right? So what input we're getting from the outside world, timers, signals that things have changed and so on. And critically, whoever is working on this level of the AI agent is not working on anything that happens below it. But let me dive one level deeper.
Neal Lathia [00:18:46]: I've actually put a screenshot from our production code base to try and make this real. And this is I Call it the agentic bit. That's in between. And here's a very simple example where our AI agent is going to classify a customer conversation for its clarity. And if it thinks it's not clear, then it's going to respond to the customer and ask them to clarify. So this is like the most trivial example of something that an AI agent may need to to do. I think the key thing that I wanted to highlight here is we chose not to adopt any AI LLM framework just yet. We felt it was a little bit too early, especially given how much sort of like deep surgery we wanted to do here.
Neal Lathia [00:19:33]: But really the sort of developer experience that we wanted to have is that when you're writing this agentic in between, you're thinking at a behavioral level of what the AI agent is doing. So you're not thinking about the state level, you're not thinking about the LLM call level. You're somewhere in that middle ground, but hopefully you're writing codes that even when read, kind of makes sense. We've also adopted this approach of returning events every time that every single decision is made. And down at the lowest level, all our LLM calls are being executed with a durable execution engine, SHOUT out temporal so our AI engineers don't need to be thinking about like very low level behaviors like retrying things that have failed and so on. We're trying to strike that balance between enabling the agent to use different models, spread across different providers, and to have AI engineers just say, I want to use this model and not care about whether it's coming from Bedrock Azure or from another provider. And also quietly log everything that they're doing so that we can track all those completions and costs. And where it ends for us right now is we are live with a few partners.
Neal Lathia [00:20:57]: We've been running hundreds of conversations per day. And this is the thing that drives us is customers who find that experience so wonderful that they actually go ahead and thank our AI agent for its service. So that's my talk for today. Thanks everyone for joining this stage. I'll point you to our blog if you want to follow along on some of the technical write ups that we've been sharing. But feel free to also reach out to me by my email or on LinkedIn.
Andrew Tanabe [00:21:26]: Thanks, Neil.
Neal Lathia [00:21:27]: Thanks.
Andrew Tanabe [00:21:27]: That was really great. We had a lot of activity on the chat there and I just wanted to take a couple minutes here to help with the Q and A. So we've got a question from Demetri Tree asking about, you know, with all these different Failure modes that you're talking about, these sort of different levels of complexity that come in, especially when you go from research to, you know, to production. You know, how do you test these agents before you deploy them? And if you know, are you using, are your teams, are they using AB rollouts, some sort of champion challenger, you know, model rollout, something like this, how do you go about testing it in practicality?
Neal Lathia [00:22:09]: So certainly in my previous roles, building the more traditional ML systems and things like AB rollouts, things like shadow deployments, were very, very, very feasible. With an AI agent. Testing it is like a multifaceted thing. Again, could use a whole talk on that. But one of the things that we have been really focusing on is being able to run simulations of what the AI AI agent would do in different scenarios on live, like production grade traffic. So that's an entire other part of our platform where we can spin up a PR of a version of our agent, run, you know, thousands of simulations and then go in and either manually or automatically evaluate those simulations. The critical thing that that changes from the more traditional engineering is if you think of the traditional engineering as a road from local dev to staging to production. This is straight into production in an isolated evaluation environment as opposed to a live production environment.
Andrew Tanabe [00:23:21]: That's cool. It's interesting that you just put it right onto production data in a safe way. But then you get that diversity, that bell curve of weird edge cases, which I think is a lot of what you're talking about there. One more quick question here. Let's see from Vishal, just asking what kind of specific nfr. So non functional requirements must be explicitly designed for in these agentic workflows where you know, where that agent is running. Are there sort of specifics there that are different from a standard situation?
Neal Lathia [00:24:00]: So you mean like what? Well, I suppose the, my understanding of the question is like we are talking with right now, they all have slightly different needs with respect to how the agent behaves. So the very simple one is language. And if you imagine an AI agent that is only going to take off 70 or 80% of your customer support, then for that rest of the 20% it's going to need to hand it off to a human support agent. And so that means that the AI agent can't talk in languages, that it can't hand off the conversation to another human. So if the AI agent spoke to your customer in Italian and then hands it off to your human agents, none of them speak Italian. That's an example of the sort of requirement that we would lock down and add in a constraint. And then different companies have different policies with respect to, for example, letting the AI agent close out conversations, how long it should take them to do it, what different channels they want to support, and so on. Right.
Neal Lathia [00:25:12]: So it's such a growing thing where, like companies want to have the dials. And when you enter the more high risk companies, it definitely even goes all the way down to the different intents that customers have. So in a bank, maybe they don't want your AI agent to talk to you about fraud. Right. But they do want to let it talk about other things. So those are the sort of things where we need to design a framework where we can control the AI agent both at the functional level, like set these timers for these values as well as the more like behavioral level. Like, don't talk about these things, but do talk about those.
Andrew Tanabe [00:25:50]: Yeah. I'd imagine it gets even more complicated as you get into like really heavily regulated industries and, you know, you're really just focusing on customer support here. But that's just such a small part of the application.
Neal Lathia [00:26:02]: Absolutely.
Andrew Tanabe [00:26:03]: Cool. Well, thank you so much, Neil. It's been great to have you and look forward to seeing where you go next.
Neal Lathia [00:26:09]: Cheers. Thank you, everyone. Bye.