9 Commandments for Building AI Agents
speakers

Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.


At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Building AI agents that actually get things done is harder than it looks.
Demetrios, Paul, and Dmitri break down what makes agents effective—from smart planning and memory to treating tools, systems, and even people as components. They cover the "react" loop, budgeting for long tasks, sandboxing, and learning from experience.
It’s a sharp, practical look at what it really takes to design useful, adaptive AI agents.
TRANSCRIPT
Demetrios [00:00:00]: At any given point in time, there will be a bottleneck or something that doesn't work. It's like, oh, we need an authentication solution. We need a better rag solution. We need a better embedding store. We need the better model. Yeah, I don't think there's a simple vector where I say like, well, in that space, we'll never buy again or never build.
Dmitri [00:00:17]: How do you make sure that your agent has this idea? Okay, this is ambiguous. Actually don't know what to do. So I need to go back and forth with a person to actually understand what the problem is. Prefer to have this clear borders between things. Reason, act, observe, plan, act. It hardly ever happened like this.
Paul [00:00:46]: Today we're talking with Paul and Dmitri about building effective AI agents and the design principles that go into that. Paul is the VP of AI at Process and Dimitri is a senior Director of Data Science. I myself am Dimitrios host of the mlops community podcast that you are listening to. Let's get into the conversation.
Demetrios [00:01:13]: For all this talk of AI workforce, I still haven't seen enough agents helping me with my work.
Paul [00:01:18]: Dude, your email has got to be basically your whole. When you opened your laptop, the notifications that you had were overwhelming me just in the first five seconds.
Demetrios [00:01:30]: Yeah. But I mean, it would be helpful if they actually start to take away some work. So far they're helping a lot, creating.
Paul [00:01:36]: More work, but it goes to that idea of the cognitive load and then how you can get this overload because information is so cheap now.
Demetrios [00:01:45]: Yeah.
Paul [00:01:45]: And how it can be so noisy. I continuously reference your idea on. We plugged an agent into GitHub and we turned it off about six hours later because it was just constantly pinging and pinging and pinging.
Demetrios [00:01:59]: Super verbose, like giving all sorts of commentary that just then requires you to read that, process it, judge it and act on it or not.
Paul [00:02:07]: I call it like the agents or the LLMs. And AI in general is disrespectful of my time. It doesn't recognize that you have a.
Demetrios [00:02:17]: Finite amount of time.
Dmitri [00:02:18]: Yeah.
Paul [00:02:19]: And so it will be verbose and it will give you all of this information or it will ping you and just kind of like about nothing important. And so that discernment. I'm not sure if you figured out a way to get through that filter.
Demetrios [00:02:35]: Oh yeah. So I think there's a lot of interesting lessons from building agents for E commerce, where each of the initial experiments we were doing, we were just building, let's say a chatbot for shopping and ordering food, for buying A car for looking through real estate. And it was sort of like you get people coming in and they say, you know, they're looking, let's say on OLX, right, which is our second end marketplace. And they type in, you know, iPhone 15. That's their start of the conversation is iPhone 15 as if it's a search engine. Enter. And then comes back like 10,000 tokens of. Yeah, you know, we've got this one over here, expensive search.
Demetrios [00:03:12]: And then you're. Yeah, instead of just scrolling and you realize that that's suboptimal. So how do we deal with that? To ask you a question, we're starting to go into much more of an intuitive UI with Genai, which is you give people some information, of course, but, you know, which is. Tends to be more visual, much more structured. But at the end you have buttons. Is that. Listen, you now need to choose. Do you want more options? Different options, Cheaper options, Options near me.
Demetrios [00:03:37]: And so you kind of pre select what the next thing is the user wants to do.
Paul [00:03:42]: Yes.
Demetrios [00:03:42]: And you make it a clickable thing as opposed to then putting that sort of high cognitive load burden of them. I'm actually having to like describe all the other things that they want.
Paul [00:03:51]: So anyway, so this, this feels like one of the principles in building out agents. We want to talk today all about the different principles that you can have when. And you need to think about on a technical level with Dimitri. But then here, just like from your side of the fence, how do you enable the team to go out and make sure that they are fulfilling these different principles? And so when I think about this, I reference the last talk that we had with Floris and how you all are doing hackathons all the time and you're trying to really push the boundary of what are some wild ideas of where we can throw in agents. Through that, I imagine you learn a lot.
Demetrios [00:04:36]: Yeah, there's a lot of different things here. The big open question. Yeah, we do a lot of different. Whatever thons. Right. Vibeathons, Codathons, hackathons.
Paul [00:04:44]: Oh, Vibeathon.
Demetrios [00:04:45]: We did do a Vibeathon recently where.
Paul [00:04:47]: That sounds different than what I thought it was going to be for a minute.
Demetrios [00:04:52]: And so the goal here is the role of our team is to basically figure out where can we move AI into production, agents into production, into real products, into the real teams, where is it ready? And so the only way to do that, because the field moves so fast, the tools evolve so quickly, is to continuously, you know, roll up your sleeves and try them in an AI team, you may assume that we all know and see and use all the tools every day and we try to, but if you don't actively spend time on exploring that, you also fall behind. Right? We tried Devin a year ago, we tried, you know, Manus, the moment came out six months ago, we tried deep seek all these things and if you try them again today, they're significantly better. And many other tools that we tried didn't get significantly better yet. Right. So having this sort of continuous drive to experiment, to try to hack around and see if they work is something we have very much in the DNA of the team to make sure that we also understand what's real and can go into production. And at the end, our goal in the team is to make sure we build real things that are useful and aren't just a demo. And so to have the distinction of what's ready and what's not, you need to get your hands dirty continuously and try it all the time.
Paul [00:06:14]: I want to ask you about how you think through forward compatibility because like you said, a, you test tools at a certain point in time, those tools are how they are, but the teams continue working on them, the technology continues to get better. That's great. If you come back to it and you keep revisiting it, that seems like a lot of work. But also there are things that in my head are continuously getting better and you can try to make them better and brute force this capability from the agents or the LLMs, or you can wait and maybe it'll get better in six months. So thinking through the forward compatibility and how that looks in a future that's moving so fast, like the six months down the line, is there going to be a longer context window and then you don't need to think so much about trying to hack together a solution. Or is it something that you need to invest time into building yourself?
Demetrios [00:07:21]: Yeah, it's a great question. So at any given point in time when we build agentic systems, AI systems more broadly, there will be a bottleneck or something that doesn't work. It's like, oh, we need authentication solution, we need a better rag solution, we need a better embedding store, we need the better model. Right. Maybe the model doesn't perform at the same level and you can think about, well to solve the model problem. In theory, I could just go and train my own model. But I'll tell you, the fastest high value depreciating asset I know are LLMs. Right? Like companies spending hundreds of millions of dollars in training these things and then three months later there's a new one and they switch.
Demetrios [00:08:04]: Right. Like it's a commodity in many ways. Whatever. There's a new one, everyone forgets about the old one. Right. And all the traffic goes there. So you don't want to end up in a situation. We're a small team, even though we spend a lot of time on building the right things.
Demetrios [00:08:18]: Like, you know, we will train models, but only for areas where we are very confident that, you know, there's no, nobody's going to solve that for us because it's a specific domain and specific area, or we need it to be efficient, so it needs to be fast and so on in terms of so. So forward compatibility here. Like again, going through the list of problems could be, I don't know, maybe this, the model we use today is too expensive. Should we therefore not like, work with that model or should we just assume that in six months it's going to be, you know, 50% cheaper? And we actually operate under a couple of principles where we believe that technology is going to kind of move in that direction. No matter what one is, models will get cheaper at a certain rate and we make some assumptions on that. Second is model context will continue to get bigger.
Paul [00:09:08]: We've seen that happening. That's so true.
Demetrios [00:09:10]: And so you don't want to continuously optimize too much on the rag side and chunking and so on. If you know that eventually you'll have a near infinite context on certain modalities. We have some expectations, whether it's voice or video or imagery, that they will be able to get better over time. So you can already start to build solutions that today are 80%, but soon they'll be 99%. So with that intuition of where things will move, we're fairly confident to start building already and sort of take that in six months, it'll be cheaper, it'll be better, and so on. But there are some areas where we actually need to build a solution, like software to authenticate as part of a rag pipeline. To give you an example, right. We need to retrieve documents that are inside the enterprise, inside the company, and we need to make sure that only the certain types of agents can access that and use them for answering a certain question.
Demetrios [00:10:08]: We had that problem two and a half years ago, and we're pretty sure everybody who's building these things has that problem. Right? Because you need to also give agents the right permissions and so on. So then the question is, do you build for that yourself?
Paul [00:10:22]: So that is kind of the framework that you're working under. And that's why you continuously are revisiting the different tools and you're seeing, did they get there yet? All right, we've got our eye on these set of providers like the Composios or the Arcades or whatever it may be. You know, you need auth. And so you're keeping an eye on the space, you're testing it and then you're waiting and then you're testing it again and you're waiting and you're seeing is it advancing in that rate that we're looking for?
Demetrios [00:10:53]: Absolutely. And there's just so many of them. And the problems you're trying to solve are real observability, rag authentication, you know, logging, and so all those things are real evals, real problems, and they all try to solve those real problems, but many of them aren't there yet. And so we try it, whether it's a Langsmith, an Arcade, a Composio and sometimes, actually most of the time. Unfortunately, we still have concluded that once we move to production, the tools are not there yet. But again, we're small teams, so I very actively encourage everybody in the team to make sure if there's somebody out there that solves this general problem, let's use that tool as opposed to for us to build it. Unfortunately, often still the tools aren't ready yet. But we're starting to see with Arcade a good example which we're now adopting because it's good, good enough for what we want to build.
Demetrios [00:11:49]: And there's several others now in the tool suite that we'll talk about. But there's another thing, actually what's in our benefit is that it's much cheaper to build software, far too much easier because we've got all these tools now. So in some cases we're actually faster, better off just to build our customized solution with of course, the dev ins, the cursors, the windserves of the world. Because it's much quicker than to go out there and evaluate five vendors and test their solutions, which are never perfect for your situation, get legal agreement, do price negotiations and so on. It isn't just only a economic like, is it what's cheaper, what's faster? Right. And sometimes we just need to move fast. And so we'll just put something together and some parts of the software that's fine to build our own, but what we see is it's much easier these days and cheaper to build your own than it used to be two, three years ago, where you just don't have time to build whatever. Right.
Demetrios [00:12:50]: Your own authentication solution or content management system, whatever. Right.
Paul [00:12:55]: So are there vectors that you feel like have been completely demolished, that you never want to build or you never want to buy anything in that space because it's so easy to build and that time to production is so fast?
Demetrios [00:13:13]: Well, and it's never say like always or something or never or whatever. I think you, you know, as engineers, I think we always have the natural propensity to build. To build yourself. Right. It's fun. We think we prefer to do that and so we understand it. So that's kind of where it's been. And then of course as a team, you need to sort of make sure that you choose wisely.
Demetrios [00:13:37]: Build or buy building is becoming much easier and cheaper. So that's one reason where for small things we will build ourselves more frequently and buy hasn't been ready for many of the cases where it concerns agents in production. But the space of options, what you can buy, of course in two years from now will be much richer. Right. You have all these startups that will mature, their solutions will mature communities, open source projects, standards like mcp. And so we built basically a solution for MCP a year and a half ago it was called the Tools controller and now we have mcp. And we're grateful because now, you know, that makes all of our work much easier. Right.
Demetrios [00:14:17]: Because there's an agreement standard to this. So I think there will be some areas where, yeah, I don't think there's a simple vector where I say like, well in that space we'll never buy again or never build. It really will depend a little bit on also what others are doing with compatibility with the models, with the other parts of the software stack you're interacting with.
Paul [00:14:38]: You've mentioned a few times different things that you're encountering and then later tech comes out.
Demetrios [00:14:45]: Yep.
Paul [00:14:46]: Or a open source project or the standard like mcp. Do you see the folks that are on the team building tools and specifically becoming more adept at the tool building process? Or is it more in the vein of I just want to have the agent and I'm going to build this agent that goes and does something for me.
Demetrios [00:15:17]: So right now we are definitely thinking about most of the things we build as let's say a set of building blocks that the end user can quickly stitch together to solve their need. And I'll give you a specific example. So our internal AI assistant token has been around for a while as we talked about in other episodes, you know, answered millions of Questions for, you know, tens of thousands of employees around the world. And now people have, you know, lots of use cases as part of their daily workflows. Hey, I get this Excel sheet. I need to kind of upload it and write a memo around that, like my monthly report or something. Or I've got some customer calls that are coming in, I need to transcribe them to figure out, you know, what's being said in the sentence. So there's, you know, examples where people are starting to basically integrate this as daily updates, daily reminders of things that they're doing.
Demetrios [00:16:14]: Give me a plan, give me actually an agenda for the day, generate a ICS file based on these things I want to do that I can upload to my calendar. So I've got all my invites there.
Paul [00:16:23]: That's cool.
Demetrios [00:16:25]: But you will be doing something for the me and the next person in the team will do something else. We are now with Tolkan building spaces where people have basically the ability to create their own tools, integrate into their Salesforce Environment, their Monday.com environment, their Databricks environment and have this general agentic capability that then they can point to those different tools to solve a certain problem. And many of those people are not, let's say native AI practitioners, right? So we need to make it really easy for them. They're almost like, think of it like GPTs but you know, accustomed to, for us to process to commerce to our world. Right. And that means that we have a certain set of systems that we integrate with like I mentioned databricks, but also other cloud environments with data in them or financial systems, SAP and some of our standard tools that we make available. So anybody in finance can connect their part of that database or anybody in HR or the legal teams or the data engineers and so on to create their own agentic system. So then it's about offering people a general agentic capability with a set of tools and giving them a no code way to describe what they're going to do, this agent or this space, this workspace.
Demetrios [00:17:44]: And this token workspace is going to be meant for as an HR support bot, right? And then basically this HR employee points to what data that HR support bot needs to have access to how it should behave. They describe that click. And now it's published to everybody in Prosys and they can now get questions about paternity leave, about a performance review processes and so on created by somebody that has you basically know and didn't have to program or understand any of the underlying Agentix systems to build it.
Paul [00:18:16]: Do you feel like as you are using agents more and more, it becomes clear that that is the way forward where these tools are going to be built with the no code solution. But then you have the engineers on the back end that are making it as easy as possible for the subject matter experts to put their magic touch on it and do their stuff.
Demetrios [00:18:44]: That's exactly right. So we actually have this theme, cross process AI workforce. And Sean will talk about that, where we want to make sure that everybody's got access to their own AI workforce. Consider that everybody should be able to have a team of junior analysts, some interns that are agents that they can create and have them work alongside them. And so we have very specific initiatives, objectives for the coming period to actually measure number of agents that are created across the company, how often they're being used, how many questions they're asking to actually push the adoption of these basic capabilities for anybody in the organization that doesn't need to be software engineering, it actually should be anybody. And we do that through token and many other tools, of course, as well.
Paul [00:19:35]: Okay, so going to these ten commandments that Dimitri is going to talk to us about, I wanted to get specific on one that potentially isn't understood the same way by every person, which is the memory piece. I know there's almost like two ways to think about memory. One is I know that you like buying a certain type of shoe or I know that you like these movies, traditional kind of recommender system. I remember that you do these things. And so I'm going to know that for the next interaction with you. And then there's another memory, which is the agents remember how to do things and so they have this tool in their tool belt. Like you were saying, when you're connecting the different systems together, it would be great that an agent learns how to do something within a system and then it always has that in their memory so that it's not just guessing when it is trying to accomplish that same task.
Demetrios [00:20:39]: Again, this is such an interesting topic. I think the memory piece is. It's an obvious area that we need to build. But once you start to think about the types of memory that these agents need to have to accomplish certain tasks, they become. It becomes much more granular and in a way, nuanced, complicated. What you need to build short term memory, long term memory, you know, like character, right. How do they behave over time and so on. So there's lots of things to be said.
Demetrios [00:21:11]: Let me focus on the piece of the agentic memory, which is related to Getting better at a certain task which I think was where you were going because there's all sorts of other memory like memory about me because you talked to me 25 times and so on. But let me focus on getting things done for you better and faster. Whether it's helping me with the PR and the software world and a gentic system like Devin or Manus or the E commerce agents that we're building with the large commerce model and so on to help you find what you need. I think the overarching sort of thing that we're seeing here is that these agents will are now starting to learn from their experience with the world. And what does that mean? If I go in and use an agent in a commerce setting like we do at ifood and OLX and take a lot and Emag and the companies of the Prosys group, this agent, let's say I come in with a search query and you're going to have a conversation with me about this. I don't know, let's say an apartment you want to rent in Poznan in Poland and it will help you find what apartment you'll say well I have this kind of requirement needs to be 60 square meters, this is my budget, it needs to be in this area and you're sort of refining that and eventually you'll know whether the agent will know when this person contacted the renter and said that's a successful event. Right now we've helped them something that they thought was interesting and so in some cases we also know when they actually made the agreement to move to that apartment. That journey is now something we can say okay, that was a helpful agentic journey.
Demetrios [00:22:51]: There's also other cases where the person isn't fined or it doesn't succeed or we ask the agent to help browse the web but it didn't find the information it needed and so on. That's also useful because that's a negative example in this case it wasn't helpful for the intent of the user. Those experiences, if you store them you can actually train your model to do better reasoning because all of these are reasoning agents now. So they have a model that they. The first thing it does is it reasons like Deep Seq, we've all seen it and now 010304 and so on are first making a plan and thinking through how should they approach this problem. Now that's generic reasoning. It's great. It's generic reasoning for answering a certain question.
Demetrios [00:23:35]: Any question on ChatGPT, Claude, Gemini and so on. What we're trying to figure out is what does it look like if you use the experience of the agents storing when it was successful or not successful those experiences to fine tune the reasoning. So now the reasoning becomes really good to help you build a certain type of software in your coding environment, because you always care about having these standards. Or it becomes really good at taking a query in food delivery search query and helping the user find what they actually were going to want to order.
Paul [00:24:11]: And when you say fine tune, you mean actually renting GPUs and fine tuning, not just throwing it into the context models.
Demetrios [00:24:21]: It's not even just only fine tuning. There's all of the training variants you can imagine that we're working with now to basically create models that understand what the best path to success looks like. And the reason we're able to do that is because we can actually very quickly gather data on what works or not, because these are real experiences, real products in the world. And so that allows us to create that flywheel, collect experiences, got to the objective or didn't get to the objective. And it's sort of a reinforcement learning, not in the traditional technical reinforcement learning sense, but you do store the successful path and that becomes input in the next training round.
Paul [00:25:05]: And you're saying that is creating more memory, where that is, it's like inherent memory in the model now.
Demetrios [00:25:13]: Right. And so that is, it's a form of memory because you're storing all the successful paths to a destination. Right. Or successful action. Because we have got hundreds of millions of interactions with users, those paths are stored their memory. Right. I actually learned how to help somebody find their shoe. Right.
Demetrios [00:25:32]: I actually learned what is the best way to help advise somebody on finding an apartment to rent in Poznan.
Paul [00:25:38]: Sorry, I thought you meant find your shoe like you lost it.
Demetrios [00:25:42]: Oh, no, yeah, sorry. Like buying a shoe. Right? So finding the shoe that you want for your next race.
Paul [00:25:47]: I lost my legend and I needed to find it the model. Help me. Yeah, and so, yeah, all right.
Demetrios [00:25:51]: Or if somebody says, hey, I want something healthy, quick for lunch at the office, like, that's a search query that, you know, surprisingly, I guess surprisingly, we're very bad at like, at answering today, right?
Paul [00:26:04]: What's healthy for me? What's. What is quick?
Demetrios [00:26:07]: What's. Yeah. And does it matter if you're at the office? Does it matter if it's you versus me? Like all these different things that by being very deliberate about what memory you're creating and giving the agent access to can make a big difference.
Paul [00:26:25]: I hadn't heard or thought about that way in thinking about memory, as you're taking successful journeys, you're also taking the non successful journeys and then you're giving the agent, you're spending cycles on fine tuning it so that the agent has that part of it. I had always looked at it as something that you do after the reasoning model is there or after you've set up your agent and it's just like memory that you're bolting on top of it with caching it, for example, you can do that.
Demetrios [00:26:58]: So you can definitely do that. So some areas where we've done similar things is, you know, we've been playing around with web browsing agents like many others. So you give it a task go and I don't know, find me the cheapest iPhone 15 seller today or help me order salad for lunch, let's say, in the food ordering space. And then this agent goes, starts to browse. It goes to the various websites. And what we saw initially was, wow, we tried this 50 times and it only succeeds 10 times and maybe sometimes five times because it got stuck on a captcha, got stuck on. It didn't scroll down, it didn't really find the items because the search term it used wasn't good. But what's cool is that you give it a simple task, go and find me a cheap healthy lunch to order to my home.
Demetrios [00:27:54]: Then it goes to all the food delivery providers, it looks at the restaurants that are open, it goes and browses the web. Right. Like you or I would. But that search space is actually pretty broad. Right. It will take all these steps and these paths, if you will, and out of the 50 tries, let's say five are what you would have liked or you would have done yourself. Right? And we then store those five so that next time somebody comes in and searches for food, in a general sense, it will access those five paths as a reference successful way to get to that. There will be slightly different things because maybe I was ordering in a different area and you're ordering a different country or you gave a different preference for whatever you wanted.
Paul [00:28:37]: Or the website got updated.
Demetrios [00:28:39]: Well, the website got updated then. Yeah, exactly.
Paul [00:28:41]: Then you're screwed.
Demetrios [00:28:42]: Well, you're going to find five more ways not to do it, not to succeed. And that can be accessed perfectly as in a cached way. It doesn't need to be trained in the model. We also try those things. But I think the important piece is once you start to be deliberate about storing what's successful and what isn't, the path to get to a successful outcome, you can then feed that into the model either as a cache or through RAG or eventually if it's large enough, like in our case, we actually also train models on that as well.
Paul [00:29:20]: It really goes back to what I was talking with Yonis about yesterday and he said, oh my God. My take is that evals are your moat. And so the better that you can get with your evals, the better you can expect your agents to perform.
Demetrios [00:29:39]: It's a really important realization. If you don't know whether what your model or your agent did was good, very hard to improve. Like that's simple, right? I think that notion everybody understands. But then what does that mean in the real world? That means you need to actually get feedback from the users or know what they were trying to achieve and know what they were trying to achieve to know whether they were successful or not. And because we've got basically 2 billion consumers that we serve across various parts of the world interacting with the platforms in different ways, we do have ways to know whether they got to their outcome. Did they find what they were looking for? Did they buy something, did they contact a seller, if they're on the second hand marketplace and so on. That is an evaluation. Right.
Demetrios [00:30:29]: Evaluation is a technical term we would use as AI developer. Basically was it good or not, Was it successful or not? And that Flywheel is indeed a moat because that means you can then feed it back into your genetic systems. They can use that to get better the next time. The similar question comes in and I.
Paul [00:30:46]: Like how you're saying there's many different ways once you know if it was successful or not, which is step one.
Demetrios [00:30:52]: Yeah.
Paul [00:30:53]: Then you can figure out how to incorporate it into the technical side of the system with caching it or updating your rag system or fine tuning it. I also like that you are hyper focused on certain tasks or certain verticals per se. For example, if it is in E commerce and you're just trying to figure out the path of one of these food delivery systems when someone is using that, you know, there's only a select amount of things that someone is trying to do when they get on to app. And it's probably along the lines of ordering food.
Demetrios [00:31:36]: Yeah. Or ordering travel and so on. And we found that especially the generic agents, the broad ones, they're, they're very bad at some of these specific things. Like if I actually wanted to get some help in booking a flight, the agent gets stuck. Like things that we think are very easy these agents fail at. So an example is Scrolling is not something they automatically do or understand or think about. If I go and select my departure and arrival, destination, airport, or look at the dates. Typically every travel website has a little pop up dropdown scrolling.
Demetrios [00:32:14]: Pick out whatever calendar you need to pick out the dates. The agents suck at that. And so we were trying to help users do those tasks. And sure, you could perform at 50% on OS, World Benchmark or whatever other web browsing benchmark these agents are being benchmarked against. That actually doesn't translate very well to the use cases we are helping our users with. And so then storing this memory, right? And say, okay, actually this is the way that you help the user find the flights from A to B and you need to scroll, right. Or you need to actually translate these dates into an action on a grid with a calendar and so on. Those are the kinds of things that we need to build and be very specific on to get to a certain level of accuracy for the user.
Demetrios [00:33:01]: Because if it's not like, I mean, this case being 95% successful or useful, people won't use it. Right. If it gets stuck all the time.
Paul [00:33:11]: Chooses the wrong flight.
Demetrios [00:33:12]: Yeah, exactly. Doesn't accept cookies. If you're in Europe, you need to accept cookies. A lot of agents don't know that.
Paul [00:33:20]: They haven't been trained on enough European data.
Demetrios [00:33:23]: Yes.
Paul [00:33:24]: I want to bring in the experts now though, so let's get to that conversation. Action Live. We are here with Demetri in the studio. It's great to have you. I want to talk all about this unified rule set for building AI agents that you have put an enormous amount of thought behind. When I saw it, I thought, man, this is so good. We need to have a full conversation and podcast and break down some of these. Hopefully we can get to all of them.
Paul [00:33:53]: But I want to hit the most important ones to start. That being said, AI agents is a contested term. You have a special definition for it. Give it to me.
Dmitri [00:34:04]: Yeah, so, well, thanks, first of all, thanks for having me here. Yes. I think that for me, AI agent is a solution that is able to achieve a task by selecting by itself the path towards a goal and also by defining itself what the end is. So where to stop. And I know that it's not complete. And you can have so many corner cases where you would point to a simple piece of software and say, well, it kind of does that. And I know that some people say, well, what would be good to add here is memory. So it can learn from the past, it can learn from its mistake.
Dmitri [00:34:51]: And do better.
Paul [00:34:53]: And.
Dmitri [00:34:55]: Yeah, I think that it's a fair point. But again, there are so many cases where you say even without memory, you can build a successful agent.
Paul [00:35:03]: Yeah, it does give you that spectrum of. It's autonomous in a way. It also knows when to stop, it knows when to get more information. It's not a workflow that just gets kicked off and it's not something that you have to hard code and say do this, this, this and this with a little LLM sprinkled in there and then you call it an agent.
Dmitri [00:35:22]: Exactly. So not everything which uses LLM and flexible execution path is an agent. So we need to be careful here. But also not everything which has certain rules, predefined rules, is not an agent.
Demetrios [00:35:40]: Yeah, yeah.
Paul [00:35:42]: There are a few fundamental principles when building agents and I think you, with this document, you put the most fundamental at the top. What is that?
Dmitri [00:35:55]: Well, for me the most important part is how does a cycle of an agent look like? Right. So when we started building agent, by the way, we didn't know it was called agent yet. It was a couple of years ago and we did this piece, I believe that you talked to some of my colleagues about data analysts. Right. And we essentially implemented what is now called react cycle. So Reason act, where we say, you know what, most of the problems that you need to address, they will be implemented by doing two separate parts. One part is what we call comprehension and reasoning and another one is execution or acting. This is why react as a name is still a very popular concept.
Dmitri [00:36:47]: And depending on a task at hand, you will have a very different distribution of complexity between comprehend and between execute or between reason and act. So let me give you an example. If you think about data analyst type of agent. So something which takes a problem and, and tries to extract information and provide you with an answer significant. The most complex part of this would be comprehension. So how do we take what user asks and bring it to the point when we know what we actually what data we need to retrieve, what report we need to write. It's very difficult. My favorite example, it's real one, somebody goes to data analyst and says, what is the fastest growing company? Fastest growing by profit, revenue people, sales.
Dmitri [00:37:42]: Fastest.
Paul [00:37:43]: So open ended.
Dmitri [00:37:44]: Fastest growing. Last month, last year. What do you mean? Right. So and how do you make sure that your agent has this idea? Okay, this is ambiguous. Actually don't know what to do. So I need to go back and forth with a person to actually understand what the problem is. And then once you arrive, to understand what the person Actually means what is the fastest growing company from our portfolio based on revenue in the past year. This is very easy to give to LLM to say, okay, please translate this into SQL for example, get this information and then I gonna combine it and present it in the report.
Dmitri [00:38:30]: So for this type of task, comprehension is extremely long and difficult and complex and sometimes almost impossible. And execution is somewhat simpler. You can also look at a simpler task. Let's say you give your tool a document and say, you know, extract all information and tell me a couple of insights about A, B and C. In this case, the comprehension of a task is very trivial. The execution still not as complex as in the previous example, but bigger than comprehension. So the distribution is different. And this is the first thing that we did, and I think react was the first kind of a conceptual approach that we saw from most companies implementing.
Dmitri [00:39:23]: When we start talking about agent and then we start having more discussion, okay, but what else is there? So what I like to add is observe, think, act and then reflect. And it's more or less taken these initial things. So comprehension, reasoning and splitting it into two parts. One is observe. And this is understanding user and circumstance. Let me give you an example. We did a small demonstration where we created an agent that can order food for you. So you sit in the office, you basically click a button and say, hey, I'm in office, I'm late, I want my usual.
Dmitri [00:40:21]: And that's it. And 20, 40 minutes later there is a courier at the door and basically you've got your usual stuff. So significant part of initial agent work is actually just looking at understanding user circumstance. So saying, hey, this is Demetrius, he's in process office, it's late. So it immediately shrinks the space where solution can be found, right? And then you come and say usual. And it basically just solves it right? So it knows what to do. It hands it over to execution part and then basically after it's easy. So hopefully it illustrates a bit the observed part.
Demetrios [00:41:10]: Yep.
Dmitri [00:41:12]: Reflect part is something that is not usually implemented, but at the moment you see it in very specific scenarios. The best example to give is if you've got a coding agent, because coding agent, you've got an assignment, it's executed, writes a code, and then you reflect on the code, whether it's by trying to execute it or trying to write an execute unit test or whatever it sees the result of execution. It says, okay, I'm not doing well on one or two or three matrixes, so I need to repeat the loop. And this would be a part which is difficult to do without reflect because then you more likely to create solution with errors and or suboptimal solutions.
Paul [00:42:06]: So the observed part is really this filter of do I have enough information? Is the ask clear enough? If it can pass through that filter, then you can reflect and say, can we set something up so that if I do not have some piece of information, I go back and I ask that. And so it's almost like you're getting through the observe part. There's one filter, the reflect part is almost like another filter on top of it.
Dmitri [00:42:39]: Reflect part is at the end once you execute something like what did happen during this execution. I see, yeah, the observed part starts at the beginning. So we understand what the circumstance we are in. And then think part is. Do we understand what needs to be done and how it shall be done? And think part is more often connected to discussion about ways that we've got static planning, dynamic planning, we can touch upon it later, and so on. One of the things that I think people find confusing, we always, I think, prefer to have this clear borders between things. So we say, reason, act, observe, plan, act, observe, think. It hardly ever happened like this.
Dmitri [00:43:30]: It's not so bad because quite often when you think about what you need to do or you think about context of request, you actually also need to do something. Allow me to again illustrate it with examples that I mentioned before. Data analysts. So what helps if there is an ambiguity in the request? That analyst can go read some documentation about company, maybe go to database and look what data is available so it doesn't come back to you with this range of random questions trying to decrease ambiguity. It basically come back to you with more precise, okay, but do you mean this or this? Like this is defined like this and so on. And then it drives your kind of communication with agents, so it creates these loops.
Paul [00:44:28]: And sometimes the loops can be very small, other times they can be much bigger. But the key is that you have various loops happening throughout your agent journey.
Dmitri [00:44:38]: Yeah, absolutely. What I put there as a cycle is not prescriptive in the sense you always have to do it. So there are some use cases where you say, okay, reflecting is not critical, so it will not break. Some other cases you say without reflect, you actually never get to this top performance. In some cases you say, you know, observe. It's more about personalized and whatever. Like if you've got a very simple task like give this recording, transcribe it and translate it, how much context do you need to understand? Probably zero. You just need to think, okay, so I need to actually call transcription tool, I need to call translation tool if it's separate from LLM.
Dmitri [00:45:25]: And then I need to basically call something else and I'm done. Right. So observe and reflect kind of this additional things that depends on circumstance could be make or break stuff, but the core element is still.
Paul [00:45:42]: Are you looking at that as the agent will figure it out or you are trying to understand the use case?
Dmitri [00:45:49]: And then at the moment I think that based again on our experience, it's still a bit with a developer. So people who build agents, they say okay, this is what we need. But of course, the more sophisticated you've got agents, the more you need this freedom. Indeed.
Paul [00:46:13]: Now there's another design principle that you have which is extracting all capabilities as tools. And I liked this because it's not just other systems or other tools that you can use, like a Gmail tool, etc. But even humans can be a tool that you use. If I am understanding it correctly, I.
Dmitri [00:46:37]: Think that you are.
Paul [00:46:41]: More.
Dmitri [00:46:43]: Adventurous, let's put it this way, than me. I did not think about humans as tools, to be honest. I was thinking about other agents as tools. This was a bit of limit of my risk taken. But now you're absolutely right. So I think that basically if you look at Agent, what we again discussed at the beginning, it's important that Agent has various tools available because it will need flexibility to decide how to execute a task. And I think that in the simplest way, everything that it can call, everything it can use, whether mechanical or human, it's a different story, could be and should be abstracted away as a tool with some simple interface for input output. We recently had some ideas from various companies of how it may look like.
Dmitri [00:47:39]: But essentially if I call a specific tool, it's called via API. If I call another agent, by the way, it could be also done as a call to a tool. If I ask human for help in a way, you can abstract it in such a way that it basically looks like just a calling a tool and then you've got a lot of freedom. I think.
Paul [00:48:07]: Yeah, I was thinking humans just because it's almost like if you think about human in the loop in a way that is kind of a tool because you're getting the okay approval which can be a tool.
Dmitri [00:48:19]: I absolutely agree with you. So I see both sides of our argument. So you're basically saying, look, sometimes when we give a request to Agent, we already in a way provide ourself as a tool. And I can relate very simple example, sometimes I'm using one of these agents to help me to create a documentation and what I put in instruction for my request and saying, if you need more information, ask me. So this is explicitly saying, okay, use this interface to send the question to me and I respond to you. And in a way from agent perspectives, there is no difference sending me text and getting text from me or sending it to some API text and getting it from there. In the same way. If you think maybe another similar example.
Dmitri [00:49:13]: So you use operator like system quite often. They specifically say, look, you give it an assignment, it will execute, but if it gets stuck, it will ask you exactly. And it could be in a specified point, for example when you need to provide a payment, or it could be in the point where it says, you know what, I don't know what to do. So it's like error resolution. And in this case again, absolutely agree with you. It uses you as a tool.
Paul [00:49:41]: That's it. So the greater theme here is how can you think about abstracting as much as possible a way to make everything a tool Call. Tell me about the sandbox idea for the code.
Dmitri [00:49:57]: Yeah, so I think it's also very important. So what we saw over and over again and again, it's not only us. See, the same idea when you are using other products like Anthropic or OpenAI is that you've got tools, but sometimes you've got tasks that cannot be executed by a single specific tool. And this is where you want to have a flexibility to just write a code and you see it's happening. You basically have a code executor with limited capabilities to actually write a code. If you want to create graphs, if you want to run data analysis, if you want to ingest data, I don't know many different options. And I think that this is critical because it addresses this gap in okay, there are so many small tasks that you cannot foresee, first of all, and build it into a standalone tool. And second of all, you actually don't want to overload your agent with a list of 1000, let's say tools it can select from.
Dmitri [00:51:14]: It's much better to say, okay, you've got this core tools that you often need and then whatever other what 5% of scenarios you can cover with doing.
Paul [00:51:27]: Code, you build it yourself.
Dmitri [00:51:30]: The code? Yes, you build it yourself, essentially.
Paul [00:51:33]: And this brings up this question of when is enough enough? I think we mentioned how right now there's a very popular thing of long running agents and the more time that you give it to think and to act, the Better the outcome. And you had said there's a scenario where someone gives an agent a lot of time to try and execute a task and it realizes there was a library that it needed or it needed something merged, a PR request merged in a library. So the agent went and sent an email.
Dmitri [00:52:08]: What was that story? Look, this is anecdotal. It popped up in my exfeed, I think. So let's take it as an anecdote. But I absolutely believe it when I read it because this is also something that we saw in some instances in our work. If you give a task to an agent and you do not limit execution and it's dynamic execution path. This is, by the way, probably where we need to talk about static plan versus dynamic plan. Yeah, it's a good point.
Paul [00:52:40]: Yeah, yeah, bring that in.
Dmitri [00:52:41]: Yeah.
Paul [00:52:42]: So.
Dmitri [00:52:44]: One of the things that people work on a lot is when you have a task and the task is given to an agent, the agent has or think about the task and then there are multiple options on how to go about it. One is you create a plan, execution plan, saying, okay, so this is my data, I gonna call this tool, I gonna take result of this, call that tool, that tool, that tool, and then I come to an end. This is more or less referred as a static execution plan. It's actually very nice for simple, predictable tasks. Again, going back to one of my previous examples, let's say transcribing and summarizing the meeting. Well, it's very clear what needs to be done. There is no surprise between function calls about outcome. So you can do static planning.
Dmitri [00:53:43]: It's really good in the sense it increases chances of convergence so the agent will reach some result. But of course, for longer tasks, there is increased risk of running into error in one of the intermediate steps and it cannot correct because the plan is fixed. Now the alternative to it is a dynamic planning. So you basically say, okay, I understand what I want to do, I know my first step, I gonna call it. And then I gonna observe result and I gonna call it again something again, and so on and so forth. Dynamic planning is great in the sense that it decreases probability of getting stuck on a particular error, but it increases probability of agents never converging. So it can constantly go and go and go. And this is something that what I say, how do you explain to agents this very human concept that better is the worst enemy of good? Right? Because in theory it can go forever trying to improve on your task and trying to get better and better results.
Dmitri [00:55:00]: So now going back to this anecdote that you mentioned, let's say you've got agents that wants to write a code, and it can write a code, it can test, it can see if it works or not, and then improve and so on and so on. And this agent has access to Internet, so it's not a big leap to understand that at one point it say, you know what, I can actually Google stuff. And we see already now, if you use, I don't know, O3, for example, it does a web search. If it runs into an error it cannot resolve from its brain and it does web search. And then if you give it also permission to, I don't know, write an email, it can contact people asking for help, like package developers, if it can go on web beyond web search. So with tools like operator and so on, it can go on Reddit, on forums, it actually posts questions. So again, back to what you mentioned earlier. It starts seeing Internet and human as tools to achieve a goal.
Dmitri [00:56:09]: And if you do not create any bounds on what your next step could be, then anything can happen. And again, it won't be a stretch to imagine if you give it a credit card, it will hire somebody to do a job for it.
Paul [00:56:24]: Yeah, it will go on Fiverr and then say, can you do this for me? Which is a fairly interesting piece to look at because if you would like the best result possible, potentially you're okay with that and all of these outcomes. But now, like you're saying is there's that fine line of how much better can we get it if we let it go and we let it hire someone on Fiverr and we let it post on Reddit, all of that type of thing when it gets stuck. Or maybe it's just that it needs to incorporate you, the human in sooner and say, I'm stuck here. Do you have any plans for how I can make this better? So in that design principle, thinking about when to incorporate the human in or how to say enough is enough is something that I think is still. Is that an open question in your mind or have you figured out a way to do that?
Dmitri [00:57:25]: No, I think it's pretty much open question. I think that when to stop is an open question. I think that if you just leave it to the model, you observe two behaviors typically. One is it's overconfident. It achieves a goal. Generally, if you look at LLMs nowadays, even on a step of reason and comprehension, it's very difficult to get model to acknowledge it doesn't know something. Right? So you ask, okay, do you have enough information More likely than no, it will say yes, even though it's obvious, like it's not enough information. And the same way if you ask a model, okay, you executed, do you think you did well? And get to a good result, it's more likely to say yes, an opposite side is.
Dmitri [00:58:19]: And again, it depends on what kind of instructions and guardrails you put in place. It will say no and just continue. So it's very difficult to strike a balance between this overconfidence and constant doubt. And I think it's still an open question.
Paul [00:58:37]: Well, it's funny because that was one of your pillars of what an agent is, is knowing when to stop.
Dmitri [00:58:44]: Yeah, yeah.
Paul [00:58:47]: Now talk to me about the memory piece. We talked with Paul for a minute in the beginning part of this episode on memory and how you have these two different types of memory paths. You also had mentioned this before, there's the memory of being able to complete a task in a way that it can do that reliably each time. And Paul was saying, well, this is where for us the evals are so important. Because if you can say what the path that got it to success is, then you can update that information in the model's abilities or toolkit in various ways. Maybe it's through caching or maybe it is just updating it in the RAG system or it's going as far as fine tuning. So for you, how do you look at memory in the space of design principles for building agents?
Dmitri [00:59:47]: Yeah, so when I look at memory, I think about a couple of things. First one, you've got short term memory. I borrowed a term from somebody that I like. It's called scratch pad memory and it's a memory that you use within the task. So basically you start with a blank list and as you execute the task, as you call different tools, you interface with a user, basically write down what happens. Right. So this is short term memory from within a task. If it's a conversation, it's a conversation memory, and so on.
Dmitri [01:00:27]: And then you've got a long term memory. And a long term memory is basically what happened overall across many interactions, across different sessions, across different tasks. And for me, this memory, I also look at it from two different angles. The first angle is more or less personalization. So you've got an agent and it executes tasks on your behalf and it learns something about you remember going back to observe like understand user and circumstance. So we're talking about getting intelligent about user. So next time you ask my favorite. It actually knows what your favorite.
Dmitri [01:01:11]: It knows your Communication styles. It knows a lot about you.
Paul [01:01:15]: Basically preferences.
Dmitri [01:01:17]: It's preferences a bit. People sometimes call it like profiles, memories, whatever, but it's learnings about you that need to persist and need to be used later during execution task. Now the second type of memory is, okay, I did this task, what did go well, what I need to do better. And this is not related to you, it's related to the way that agent executed the task. Allow me to illustrate. So again, going back to something like creating a report based on some business data, an agent looks at conversation, says, you know what, I spend a lot of time trying to search for city which didn't exist because I didn't know that there are Unicode or that the city name could be encoded in the data storage differently. So next time when there is a task like that, I first need to understand what the spelling is, if it's correct or not, what encoding is. And I need to verify that the city is known before I start pulling all informational source looking for something which is not there.
Paul [01:02:35]: It's almost like the agent is doing a retro on its own.
Dmitri [01:02:38]: Exactly.
Paul [01:02:39]: Execution.
Dmitri [01:02:40]: Exactly, exactly. And this is also part of reflect. So when I was talking about cycle, I was given example with code writing and then reflect was really within the loop. But you can also have reflect a bit outside of the loop, if you will. Or at the end of overall execution, just looking at everything that happened so far, saying, okay, these are my learnings and I think that this is very important part. You remember at the beginning I said what agents are and one of the elements is memory and a memory in a way of learning. And this is a part that contributes to learning.
Paul [01:03:30]: How are you reincorporating that learning into the next time that the agent does that? How do you have this retro? The retro where the agent now understands I could have done it better in these ways. What are you doing to update its understanding so that it does do that?
Dmitri [01:03:48]: So essentially what you do is you put this information into context. So you essentially change instruction to your agent. So you've got a place saying, okay, so you've got this task and by the way, and this is also where RAC comes into play. So you need to. Eventually you will not be able to put all memory in, whether it's personalization memory or execution memory or whatever. You need to find elements pertaining to user and to task at hand which are relevant and you pull them from long term memory and you put it in the context and you say, okay, this is what you need to take.
Paul [01:04:27]: Into account because Basically you're saying, hey, by the way, the last time you did this, you said that you should have done these things. Make sure to remember to do that.
Dmitri [01:04:39]: Yes, essentially not phrased exactly like that. More prescriptive saying, like a good practice in this case to do this, this and this. You verify for this, this and this.
Paul [01:04:50]: But yes, okay, fascinating.
Dmitri [01:04:53]: And you also go through the memory. So it's not necessarily part of agent execution. It could be part of environment. It could be something outside where you say, okay, so you've got this learnings, learnings, learnings, but you need sometimes to compress them, to generalize them, because you cannot have like 100,000 execution and then you've got 100,000 learnings. 90% of them are the same. Right. So you just need to be able to generalize and compress.
Paul [01:05:25]: Yes, yeah. And so then it's like this reflect part plays into another one of these ideas. Tell me how it's different from the internal critic. Is that the same thing or are they separate?
Dmitri [01:05:41]: It could be. Again, so I was trying not to separate internal, external reflect in a way. So if you think about agent writing code, reflect is in fact internal critic. So it looks at the execution. It says, okay, this is what the test shows we didn't do well. Right. It could be also look at something in other scenario, look at a document written and say, you know what the instruction of user was to summarize this document, but convey this emotion, did we do it or not? And it says, now this emotion is not really properly conveyed, so we need to do another step. So this is in fact internal critic, but you can call it also reflect step.
Dmitri [01:06:38]: Yeah, but it's a reflect within execution. And then you've got another example which is reflect at the end of execution or on outside. And you can mix it, obviously.
Paul [01:06:50]: Yeah. There's a diagram somewhere in there that maybe we'll have to create where it shows these loops that you're going through, especially with the way that you're thinking about it with the observe and then the think and the other one was act and then reflect. And maybe reflect is at the beginning, maybe it's at the end. Maybe it is at the beginning and the end and it's continuously reflecting after each step. It does sound like that can get expensive or.
Dmitri [01:07:21]: It does. It does. And this is why. So in principle, this reflect at the end, you don't need to do it at the end. You can do it on every step as well. It's just extremely expensive. Right. So many solutions that I've seen so far and many solutions that we've built are actually doing it at the end.
Dmitri [01:07:41]: And it's simply because it adds delays, it adds cost. But yeah, the more you can reflect, the better one. You just need to make sure that eventually you converge and not reflect, reflect, reflect. And again, better is worse of enemy of good. One thing that I wanted to say just based on your question, because you referred to what Paul says, learning from previous execution. So one of the things that you can do with the memory is actually optimize your execution path. So not only go for trivial things like the one I mentioned where you say okay, to avoid this mistake, like this is a good sequence of step of whatever, but to actually optimize execution path altogether. An example of this could be you spoke to Floris, right? So he at one point was working on an agent that uses web to browse websites.
Dmitri [01:08:41]: And essentially the way that we implemented it is if you got a task, your execution is to go website and start basically browsing it. So finding out what is clickable, what the fields are that you can fill, information and so on. And then you basically said, okay, so I need to click on the menu, I need to look for restaurant, for example in food or ordering case. And then I need to click here to add to basket. Then I need to clear so all the steps. But of course, once it's done and it's successful, it looks at the whole execution and says, you know what, I actually don't need to wait. And at this step, I don't need to wait until the whole web page is loaded. I can just put a trigger saying, okay, once this element loaded, I can activate it and then I can do it next, next, next.
Dmitri [01:09:39]: And suddenly Instead of spending two minutes browsing, you can do it in under 40 seconds. Not wasting the resources of website, also not wasting resources of yours. So we moved from slow execution to fast execution. And then another thing that happened that looks at this slow execution, say, you know what, actually if we do this sequence of steps, we can predict where we end up on this website. Because actually website is a limited amount of pages of views essentially. And this place where we end up has a specifically formed URL that I can form already at the beginning, knowing where I need to go. And this is what we call reflex. So next time I've got the task when again I'm going to say order me usual, it says, okay, I just go there.
Dmitri [01:10:35]: And in parallel I'm also trying to do this fast. And in parallel I'm also trying to do this slow. If I'm good with reflex execution, I'm good. If not, I fall back on the fast. If files fail, I fall back on the slow. So slow. An example of this would be website changed, element changed. Right.
Dmitri [01:10:59]: But this is also an example where reflect is very important. It's not critical in a sense that your application will not work without reflect, but it's very important in terms of optimization of performance. And this is why again, memory in relation to execution is important.
Paul [01:11:19]: Yeah, there's so many things that I want to comment on there especially because if you think about the way that that happens, if it gets done with reflex, then how do you call off the rest of the actions? Like how do you call off slow and fast?
Dmitri [01:11:41]: To be honest, this depends on application and it's just implementational detail. So the concept itself is actually not new. I saw it many years ago. I believe Siri was working on Apple this way. When you remember the early version of Siri, you would say like call mom and it would call mom. So what would happen as opposed to now?
Paul [01:12:08]: It doesn't do that for some reason. It's been five years and it got worse.
Dmitri [01:12:12]: Yeah. Model change errors don't. But essentially if I understand what happened is it would take your voice and it would start processing it locally on the phone and at the same time it would send it to the server where you've got much more powerful processing. And then if the phone says okay, I processed, actually understand what needs to be done, it just communicates back to the server saying okay, I don't need your support, stop. Yeah, and it's a bit of waste of resources, but it's significant optimization in terms of how quickly you can get.
Paul [01:12:51]: Results and the accuracy of that result.
Demetrios [01:12:53]: Yeah.
Paul [01:12:55]: Now, speaking of wasting resources or budgeting, I wanted to touch on the idea of how you think about budgeting every action. And I'll bring this up because I spoke with Zach probably three, four months ago and he was putting together different agents and an agent builder at his company. And when folks build agents at his company in staging, they then have this little number that says if you are to push this to prod, we estimate it costing this much money because of the scale and the amount of LLM calls that we're going to be making and the resources needed. You were looking at budgeting in a different way with long running tasks and being able to say if you exceed a $2 budget then just stop. Because I don't want this all of a sudden to get racked up to a $10,000 OpenAI bill.
Dmitri [01:13:57]: Yeah, yeah. So I think budgeting is extremely complex question, to be honest, because one thing is to say, you know what, we just want to limit the cost of execution and I think it's not particularly difficult. You can estimate cost of execution again, as you just said, based on number of LM calls or maybe some other things that it does. We actually did it again, going back to one of the things with a food ordering agent, we actually can very precisely say, okay, so it did this food ordering, it looks for options, it reached out to restaurant, it looks whether it's easier to deliver order it via platform or via restaurant directly. And it selected the best option cost wise. And this is the order that was made for this amount. And from there we can also say okay, but it runs so many calls, we know on average what infrastructure costs. So we say, okay, it's actually saved user like $8 and it cost 1.5 to run.
Dmitri [01:15:17]: So and you can put a very easy cap saying, you know what? On average we are bringing this value for user. So if you exceeding this value, Please stop. It's a business discussion, it's not technical discussion. This is one thing where budgeting becomes extremely difficult in my opinion is when it becomes a part of trade off discussion. And unfortunately this is where I do not see a lot of interesting solution. And please tell me if I just miss it. So remember, agents have tools and if you look at most of the cases now on the market, these tools are not overlapping. So you cannot replace one tool with another.
Dmitri [01:16:12]: What happened? If you can very simple example, you can have three image generations. Each takes different execution time, each slightly difference in the quality of delivered results and each cost very different. Did you actually see a good solution that is able to navigate this trade off? What is the task at hand? Which tool is the best option in terms of speed of answering, quality of answering the cost? So this is where I think budgeting will be extremely important moving forward because we built again, we built agents with more and more tools and internally I already see a couple of use cases where we've got like two tools doing more or less the same. So how do you let tool choose? How do you communicate what is important? What if the importance changes over execution of your task at one point? Again going to human example, how do you explain to the agent, to an agent, the concept that it's better to have an answer now than a good.
Paul [01:17:31]: Answer later tomorrow or next week.
Dmitri [01:17:34]: Exactly. So this is where I think budgeting will be important and this is yeah, where I Still see a lot of opportunities for development.
Paul [01:17:44]: It's almost like you want an urgency knob that you can dial and you can say I'm cool with this one going as long as you want or I need this as soon as possible or somewhere in between.
Dmitri [01:17:55]: Yeah. So if it cost me less, it takes me a week and it's not critical task. Fine. If I need an answer now, how urgent is my now you see how it's very difficult. Also if you think about a human, you work in a team, how easy it is to convey and also understand the sense of urgency. And people say, oh, I pay whatever if it's done. But then whatever is whatever it's if it's done. And then you realize, well, it's actually.
Paul [01:18:31]: Not whatever and it's not really done.
Dmitri [01:18:34]: Yeah. And whatever is different for different people. And definition of done. Exactly.
Paul [01:18:39]: It also reminds me of the paper that I read back in the day called Frugal ML. And there were a few different ways that they were exploring how to bring down the LLM cost and some of that was by throwing in two questions in a context window, two similar questions and getting the output. And then another one was being able to have a dynamic router. And so if the question was simple, it would go to a simple open source model that was a smaller one, hosted, self hosted. If the question was more complex, it would kick it off to Back in those days it wasn't a reasoning model, but it was something of the sort, the bigger model. Now what you're saying though is imagine if tools had that same kind of router or it just was a capability that an agent could understand. These are my options and considering I need X amount of urgency versus Y amount of budget, I'm going to choose this tool.
Dmitri [01:19:43]: Yeah, exactly. It's exactly that. And I'm really eager to see solutions there that actually address it. But yeah, we're still in a phase when we've got just a handful of tools and again, you don't really have these options where you can trade one characteristics for another.
Paul [01:20:06]: That's all we've got for today. But the good news is there are 10 other episodes in this series that I'm doing with Prosys, deep diving into how they are approaching building AI products. You can check it out in the show notes, I leave a link.