Sign in or Join the community to continue

The Agent Landscape - Lessons Learned Putting Agents Into Production

Posted Feb 20, 2025 | Views 353

# Agents Into Production

# Agent Landscape

# Prosus

Share

speakers

Paul van der Boor

Senior Director Data Science @ Prosus Group

Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.

+ Read More

Floris Fok

AI Engineer at @ Prosus Group

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Demetrios chats with Paul van der Boor and Floris Fok about the real-world challenges of deploying AI agents across @ProsusGroup of companies. They break down the evolution from simple LLMs to fully interactive systems, tackling scale, UX, and the harsh lessons from failed projects. Packed with insights on what works (and what doesn’t), this episode is a must-listen for anyone serious about AI in production.

+ Read More

TRANSCRIPT

Demetrios [00:00:00]: Welcome everyone to a conversation between myself and my good friend Paul van der Boor to talk all about how the process team is using AI agents in their companies, specifically process and maybe sometimes their portfolio companies. We get into all the nitty gritty details on how they have been innovating and what some of the challenges have been specifically around using AI agents. Let's get into this conversation station. We're going to be talking all about agents and more specifically how you all in this global group of companies, including Delivery Hero and OLX are using agents. You've got over a thousand ML practitioners in this group and you're bringing agents and AI use cases to the over 30,000 people that make up the global collective group. And knowing that you've had some hard earned lessons, and that's really what I want to dive into, technical, hard earned lessons, user adoption, ux, ui, all of the fun stuff, because you've been doing stuff since 2022 when ChatGPT first came onto the market and trying to figure out how to make that more useful. We should probably start with just like what we were talking about yesterday when we had a bit of a Mexican standoff and we said, so what is an agent? We both looked at each other and we're like AI that can do stuff, right? You might have a better example of that.

Paul van der Boor [00:01:43]: Let's start with that. So what is an agent? Right, so an agent in my simplest description is essentially an LLM that interacts with the world. I like that. We've obviously had LLMs or anybody in the space working with LLMs for years now, right? And the first versions of them, GPT2 and so on, we've been playing around with everybody, sort of figuring out what's coming. But at the end of the day, those LLMs were fairly isolated tools, right? That's a little reasoning engine in a box and you can give it a token and it gives you a token back. And that was of course super impressive. And we saw that with the ChatGPT moment that sort of jumped in sort of everybody's lives. But one of the things that we saw as a fairly obvious next act of generative AI is agents, when these LLMs would be able to interact with the world and how they interact with the world, obviously it's almost like, I see it like a dial, you're dialing it up, right? So the first things we saw were, well, maybe they can access the web, maybe they can access compute environments, maybe they can access APIs, maybe they can start to interact with the browser.

Paul van der Boor [00:02:55]: So that's sort of this idea of an agent. And to your point, we've been working on this for years because at Prosus, we're probably one of the world's largest tech investors, focused heavily on E Commerce, serving about 2 billion consumers through those various companies that you mentioned.

Demetrios [00:03:14]: 2 billion.

Paul van der Boor [00:03:14]: 2 billion. That's a lot. Yeah, that's a lot, right?

Demetrios [00:03:18]: How big that is?

Paul van der Boor [00:03:19]: Yeah, in Ifood in Brazil and Swiggy in India and Stack Overflow and deliveryhero and OLX and many other companies. Hundred companies in the group. It's, you know, there's a ton of different opportunities for AI in general. And then if you go to, you know, agents, it becomes incredibly interesting and exciting to see all the different things we can, we can build. So that's why I've been investing in this space for, for a long time. With our team here in Amsterdam, we organized this conference.

Demetrios [00:03:48]: Yeah, that's kind of the, the inspiration for this whole series is because we did the conference together as a virtual conference. And then we realized we want to create more and go deeper. Because what I saw in that conference was that you all are doing some very advanced stuff with agents. And the conference was all around agents. It was agents in production, it was a virtual conference. We saw the most cutting edge things that are happening with agents out there in the world. And my conclusion was we need to have more conversations because I want to hear what you've done. And so this episode is going to be us breaking down the agent space, what it's comprised of.

Demetrios [00:04:29]: And then we'll bring on Flores, our good friend, who's going to talk about some hard earned lessons that you all have had. What agents you've tried that died. So the whole graveyard of agents and then what agents you actually were able to stick with and have been providing business value. How are you looking at that business value? How are you like putting metrics around if the agent is useful or not? And so before we bring him on, we should talk a little bit more about what components make up an agent.

Paul van der Boor [00:05:03]: But maybe taking a step back and thinking about these agents and why is it so hard to make them work and why is the graveyard still populating itself fast is because there's a lot of unknown pieces. So think of an analogy, right? I don't know if you've ever built things with raspberry PI or with Mars rovers, have a young son. So we're building a Mars rover, which is essentially a Raspberry PI, which connects to a bunch of sensors, a microphone, a camera, And a memory chip so you can see sort of what its path is. And for me, if I take this to the world of Genai, what we basically have today with these powerful gen AI models, large language models, is just the reasoning engine is basically the Raspberry PI without anything else. Right. And that was the LLM. And now in the agentic world we're trying to figure out how do you put this into a system that can interact with the world, that can actually understand what history of interactions it had. So memory like this Mars rover needs to know what's it, its path, where does it need to go? It needs to maybe be able to take actions and you know, decide to go left or right.

Paul van der Boor [00:06:20]: In the agentic world it's I need to access an API to fetch information or I need to store, you know, create a file and store it somewhere.

Demetrios [00:06:27]: It needs wheels, that's it.

Paul van der Boor [00:06:28]: So and the tools are part of it because that gives the LM the ability to interact with data that's up to date. Because of course, you know, what the data seen as a, during training time, you know, has a sort of endpoint very different. So you need other data, maybe that's proprietary data or data related to something that happened today. It may need to actually generate its own data to not just read but also write because it knows that you and I are interacting and you've asked me certain types of questions over time. So that needs to be stored in memory. When does that memory get accessed then? When it generates an answer, it may want to actually think and critique that answer. So it's not just a one shot token prediction, but it generates a plan based on what you're asking it to do. It can go to that plan, critique it, look at the end of the steps it's followed whether that plan materialized, go back to a step and revise.

Paul van der Boor [00:07:32]: So that entire system is what you need to have working not just once, but reliably, especially if you're going to ship it to real customers in production and so on. And that's been the journey we've been on. Right. But it's from, you know, the journey has been moving from small unit of reasoning, the LLM, the Raspberry PI, Power PI, the Raspberry PI that now needs to be reliably connected to all these other pieces so it can help with much more sophisticated taxes. And in fact that's the promise, right? So moving from just a simple Q and a device to something that can actually help with much more sophisticated complex tasks, especially in the E Commerce journey in our ecosystem at Pros, there's a ton of opportunity to do that.

Demetrios [00:08:20]: The Raspberry PI to the Mars rover. That's what we're doing right now. I like that metaphor. And there are some specific difficulties that arise when you start doing that. I think two things that we wanted to call out, especially right now are evaluation can be really difficult and you're looking at latency requirements and then cost too. Yeah, because as you mentioned, if you're doing all of these LLM calls, it can add up quickly. And we're going to go over all that fun stuff later on in different episodes and break down specifically how you can do evaluation and what you can look for. But now we should probably talk about the different ways that you can use agents or that agents manifest, I think we could call it.

Demetrios [00:09:12]: Because back to that conference, we saw lots of different ways that agents are being used. And in a broad sense, I kind of bucket agents into. You have agents that are like the computer use agents that came out from Anthropic and it can use your whole computer and take over. Yeah, you have web agents which are a little bit more of a intermediary between they're not using your whole computer, but they're using your browser.

Paul van der Boor [00:09:38]: Right.

Demetrios [00:09:39]: And then you have agents that are interacting with the world through APIs, which I think is probably the most common design pattern these days. And then you also can have voice AI agents. And so interacting with agent that you're talking to on the phone or maybe on a zoom call, there's probably room for us to throw in there like agents that you see in video games, NPCs, maybe. I don't know if you want to call that a full blown agent, but it feels like that could be one also.

Paul van der Boor [00:10:14]: Yeah, for sure. I mean, you're describing the spectrum of complexity and I think it also gives us a sense of where we're headed in the future. Some of it was very near. So indeed, the natural first set of tools that you want to give these agents to are APIs. Because they're well documented, well structured, you know what needs to go in, you know what you expect back, you can test against that. It makes evaluation a lot easier. So that's the first thing that if, you know, the agents in production that we are working with are typically going to be using well defined APIs that are fairly simple compared to doing, let's say, more open ended web browsing, for example. Of course we're testing and you know, we can share what we've learned and why that's also, you know, very hard and more expensive and takes time.

Demetrios [00:11:02]: We're going to have a whole episode on web agents. And so JR is going to come and be our resident expert because that's the fun thing too, that we should mention. We get to pull from all the folks that are working at Process who are doing deep dives on each one of these topics, and they get to come and tell us what they've learned over the past six months, just focusing only on that.

Paul van der Boor [00:11:26]: Right. So that's exactly what we're going to be doing. And I think if you walk along the sort of levels of sophistication, we need a framework for this. So somebody, I don't know, maybe somebody's come up with what are these sort of levels of complexities for agents. But, you know, from APIs to browsing, and just in these two levels, if we call them that way, there's so much opportunity still to make these things work. I mean, in the world of E commerce, online marketplaces, platforms, you know, there's so many things that if you just can give an agent access to the web, to the app, to the APIs that they can help you with. All of a sudden, think of booking trips, ordering food, helping you pick products and so on. Let's say after web browsing, the next thing is sort of just giving them access to a computer, a desktop.

Paul van der Boor [00:12:07]: And we're working with various companies, startups out there that do exactly that. And you can see the progress, like on benchmarks like OS World and so on, that these agents can now, you know, basically create pivot tables for you. They can, you know, with very brief instructions. Right. Or they can download files, they can process them and so on. And that is going to come soon, 2025, probably give us a whole bunch of new exciting things and products on that front. Then if you go one step further, they can start interacting with the real world. Right.

Paul van der Boor [00:12:39]: So robotics.

Demetrios [00:12:40]: Yeah, that's another agent that I didn't even talk about. That's true.

Paul van der Boor [00:12:43]: So that's sort of the levels of sophistication that I think we expect to kind of see maturing over the next months. Quarters. Sometimes things go faster.

Demetrios [00:12:53]: There's some prime real estate for some real thought leadership there with that map of the difficulties.

Paul van der Boor [00:13:00]: Let's do it. They're volunteers.

Demetrios [00:13:03]: So one thing that we didn't really talk about is why use agents and why not just use traditional methods? Because it seems like we add a lot of complexity. There is that benefit of, hey, I can just tell something to Go do it and it'll do it for me. But a lot of times you end up banging your head against the wall because it is so difficult.

Paul van der Boor [00:13:27]: I think the simple answer is that you just add so many more possibilities that you can or tasks that you can have these systems do for you. Once you move to an agentic world, you give them access to tools that it's the obvious next thing to do because we've kind of gone through the question answer world and we're doing that. And sure, these AI assistants are part of our lives now and they do that reasonably well. Of course there's a ton of room for improvement, but as they become agentic, there's many more things they can do. And again, at Prosys, we're a large ecosystem of companies that help our users do things easier, better, faster as they interact with our e commerce platforms. We see that the agentic capabilities allow us to do much more on that front. And by the way, I will also say that one of the things you notice, and we'll talk about what we learn as you try to apply this to the current systems, is that the world is not ready or it hasn't been made for the Agentix systems. The interfaces, the API, sure they exist, but they're not made for agents to interact with them.

Demetrios [00:14:36]: Well, they break all the time too. It's really hard to get a very trustworthy API that even just for weather. A weather API that, yeah, you can go looking and just getting that which you would think is a solved problem and it's simple, that's hard. And so trying to go a few steps deeper and get more complex APIs or each API is different and so and it's constantly changing.

Paul van der Boor [00:15:02]: Right.

Demetrios [00:15:03]: And if you're not up to date with those changes or making sure that you have some way to keep your agents up to date, then you're looking at a whole world of hurt.

Paul van der Boor [00:15:13]: That's right.

Demetrios [00:15:14]: Preaching to the choir right here. There is something I want to bring up too, around how process works with the different companies, because I tend to funnel everyone that I know that is starting a company to you because you're in such a unique position. And this position is that you have a ton of users, a ton of ML talent, and you know what problems need solving. And so there's users from the portfolio companies, but then there's also internal users because the process group is gigantic. Maybe we can touch on that a little bit before we jump into it because that gives more insight on how agents are valuable to you all and how, you know, like what is actually worth doing versus not.

Paul van der Boor [00:16:04]: Yeah, that's a great point. So because the setup we have as pros, we're a large global tech investor, the largest in Europe with operations all over the world. Right. We've got food delivery in India and in Brazil and classifies in Eastern Europe and education technology in the US and media in South Africa and many other companies. About 100 companies in the group, all with a tech angle, with their own tech hubs and AI teams. We are in a very unique position to be able to work with them closely on lots of topics. Our focus of course is AI and increasingly now agents to figure out how can we solve real user problems. Like I mentioned, we have about 2 billion consumers across the group.

Demetrios [00:16:50]: 2 billion.

Paul van der Boor [00:16:51]: That's a big group and they're all over the world.

Demetrios [00:16:54]: Every time you say that, I'm going to react with this year it's 2.

Paul van der Boor [00:16:57]: Billion, probably next year it's 3 billion. Oh my God. And we do believe that the agentic systems that we're building are going to be able to solve lots of our user problems, helping make bookings, make transactions easier, find the right products they're looking for, learn things faster. And so all of these, let's say real user problems are things we're trying to solve for. And when we, you know, our team, the AI team at Prosis is based in Amsterdam and our job is to work very closely with the AI teams in the group companies to help them basically accelerate some of the cool use cases we think are going to be really valuable for the group in an E commerce space. In particular. In doing that, we identify typically what problems are. So when you build logistics systems in production, you know, we talked about all the issues you need to make it affordable, you need to make it safe, you need to make it scalable.

Paul van der Boor [00:17:57]: So identify problems that we go out there to find, you know, if anybody's offering them as a product or as a solution. So we'll typically talk to founders, startups in this space. When we like what they're doing, we, you know, either partner with them as design partners, you know, we also can invest in them. You know, we've tons of examples there and that's cool.

Demetrios [00:18:20]: On the design partner front, it is so valuable to have a company that is so advanced and understanding what is important and then just to be able to plug in with you all. I know that a few companies I've introduced you to, they come back to me and they're like, oh my God, thank you so much because Again, the whole reason we're doing this is I think you all are doing some of the most advanced stuff when it comes to agents. When you get companies that become design partners, they get to see how advanced you really are. And so if they're on the cutting edge, they get to see the scale of what you're doing and then recognize like if their tech holds up to that scale.

Paul van der Boor [00:19:02]: Yeah, I think it's a valid point. I think the problems we face today are probably challenges that many others will face soon as well, whether that's in months or years, as they also start to build these systems in production at scale.

Demetrios [00:19:16]: One of those things was the cost, right?

Paul van der Boor [00:19:18]: That's one. Yeah, that's a great example where we've been continuously modeling what's the impact of agentic systems in production on the cost profile. And the numbers we were looking at at the beginning were just simply cost per token. But you realize that's not really representative because as you use agents to answer questions or fulfill tasks, they use many more tokens, can do many more things. We measure for internal assistance token that you spoke about how much time they save per user. Right. Per question. So these systems can do that much better.

Paul van der Boor [00:19:51]: They consume more tokens, so they become more expensive, but the value you get for that is higher. And so we model these things.

Demetrios [00:19:57]: I blatantly stole that, by the way. I took that from you guys from the Agents in Production conference and made a blog post on it in response to some other VC who was talking about. He was saying the same narrative that you see all over the Internet when a new model comes out or a new update from OpenAI or Anthropic comes out and they say like the cost is just plummeting per token. And so I took your or Euro's. One of the other speakers from the conference Insight, which was price per token is going down, but price per answer is actually going up because of these complex systems that we've got going on and how many LLM calls you're making.

Paul van der Boor [00:20:37]: We all intuitively know the costs go down, but we measure this. Right? To give you an idea, over the last summer we looked at how many tokens do we use to answer a given question in our internal assistant. It went up by 150%. So more than double just the number of tokens per question at the same time. That same period, it was about a three and a half month period, the cost per token went down by about 50%. But because the questions we're answering or the tasks we're doing with token also become more useful. Right. We're saving people more time because they're using it more.

Demetrios [00:21:11]: Right.

Paul van der Boor [00:21:12]: They're also using it more. So then the tokens per user go up. So ultimately the budget that we have in tokens budget starts to double, actually goes up. Right. And then we measure how much time we save per question and that also goes up. So we actually model this and we have real time insights we use. We will benchmark various models on quality, on cost. We have our own leaderboard as you know, and so on, we'll talk about that.

Paul van der Boor [00:21:38]: But one thing is everybody knows the cost per token goes down, but what's the ROI on that that you get? Right. So does the return on investment is a dollar of the cost per unit of intelligent goes down, but does the return you get on that intelligence change as you build agentic systems and so on? So we're in a position to measure this across various tools.

Demetrios [00:22:01]: Cost per unit of intelligence.

Paul van der Boor [00:22:03]: Yeah, that's a way to describe a token. Right. So you've got a token as just basically a part of a generation and together it's a system that has some intelligence. And so anyway we see the cost per unit of intelligence trends to zero over the long term.

Demetrios [00:22:24]: Zero?

Paul van der Boor [00:22:25]: Yeah. I mean look at the. I think we spoke about this at the marketplace. The cost of a token equivalent to GPD 3.5 two years ago dropped by 98%. But of course we've got more sophisticated models now. The reasoning models won and so on. But you also asked me about which companies we work with. Right.

Paul van der Boor [00:22:48]: So I'll give you one example. We learned that, we saw that as you build systems that get more degrees of freedom because they're agentic, they can do more things. They can, you know, just.

Demetrios [00:22:59]: You saying that scares me.

Paul van der Boor [00:23:01]: Yeah, but it's also. Well they, they can think of the.

Demetrios [00:23:04]: Way you're wording it just is.

Paul van der Boor [00:23:06]: They're generating, they're generating answers, they're going.

Demetrios [00:23:09]: Out to sources, they've got my bank details.

Paul van der Boor [00:23:12]: If you've given them to token, it will be a great partner for you. It'll be very safe. Yeah, but we realized we need to make sure we understand what's the risk. And so we invested in a company called Prompt Armor. Then their mission is basically to quantify through, think of it like pen testing for gen AI systems what the risk is. So we work with them. Of course, we invested in them.

Demetrios [00:23:37]: So trying to get my agent to buy stuff for somebody else, well this.

Paul van der Boor [00:23:42]: Is more like on the security side. So it's Like a running pen test. So it's on the infra, it's on the entire system. So you just have think of it like you have a chatbot or system. It's Genai power that can give you answers. It can go out there and try to do prompt injection attacks, try and do data exfiltration and all of the other new vectors. Basically, you open up a whole new risk surface area that you need to understand. And that's one example where of course, we invest in them because we think it's a promising product.

Paul van der Boor [00:24:14]: But it actually coming back to the ecosystem process is they can now offer their product. If we believe it's useful, we work with them, to everybody in the group, everybody that's building Genai Systems. And I think that comes back to how we work with founders. It isn't just about investing in them like a traditional fund would and hope there's fabulous returns, but it's about how can this offering that they have, this group of founders, join this global ecosystem where the sum of one plus one is whatever 11. Right. Because we're now working together. What they do isn't just in itself an interesting proposition, but it makes sure it adds and is additive to everything else we do across the group. And we see that a lot, and we're doing that increasingly, of course, our focus is E commerce, but sort of with the intersection of AI, there's a ton of new ideas, propositions, products, techniques, tools that we're looking at very closely to kind of bring them into the group.

Demetrios [00:25:15]: All right, now we got Floris with us, and it is cool to have you here to talk about all the things that you've been working on. All right, Floris, so what do you do?

Floris Fok [00:25:24]: Okay, yeah, thanks. Yeah. So I'm an AI engineer at process, and the last year, one and a half years, my main focus was on agents. You know, building agents, testing agents, verifying their use cases, and building them out in real products. So it was mostly like these cycles of, like, we have an idea, you know, we want to build this POC or mvp and we want to know if it works. And in some cases, you know, I stuck around for a bit longer and I actually, you know, did some work to get it in production, but it's a lot of experimenting. And I think you mentioned already early in the podcast, the applied R and D I think is a good way of positioning. So.

Floris Fok [00:26:15]: Yeah. So very privileged to have this position.

Demetrios [00:26:17]: I'm always curious what. Because we used to have a lot of talk around what an ML engineer was and Is that somebody who's modeling, is it a data scientist who is specifically working on ML? And now there's the new term of AI engineer. So what is that? Like what is the day in, day out? Are you building evals, you're working with agents, you're creating agents.

Floris Fok [00:26:38]: And yeah, so I would say like an engineer is there to solve problems. Like an engineer will say like, okay, this is what we need to have, you know, build it. I don't care how we reach it. Not with AI. Nowadays it's almost all software. So you're, you're part software developer, but you're also part thinking like, you know, how can we position this product, how can, how can users interact with it? You know, it's a, it's a bit more, you make a bit more decisions than a normal software engineer. Because normal software engineers are like working on a task to task basis. But we're, or I as an AI engineer are more like, I'm solving this task using AI and how we would fill that in, most of the time it's a blank piece of paper or a mirror board and we just start building.

Floris Fok [00:27:35]: So I think that is kind of the nowadays AI engineer.

Demetrios [00:27:38]: Over the last two years you've been trying to play with agents. What are you looking at?

Floris Fok [00:27:42]: Yeah, well, it will be a larger number than I think many will expect. And I think using the word playing with agent is quite right. You know, we've been exploring, you know, it doesn't always need to be a good idea. We're just trying to work the muscle here. But yeah, I think over 20 projects that were related to building an agent that was solving a specific use case that at some point we thought was a really good idea.

Demetrios [00:28:11]: Yeah, and we'll get back to that because I want to ask a lot of questions around why did you ever think this was a good idea? But the, the other thing that is worth noting is how many now are actually still being used or are real projects. I guess they made it past that filter.

Floris Fok [00:28:31]: Yeah. So it's. So there are two that actually made it. But there is like a caveat. There is like a few of them were merged into one. So because these were exploratory projects, we saw value in it. Yeah. But on a standalone feature it was like, okay, it's not adding any value, but if we bundle this all or we add it to our Tokon, then it adds value again.

Demetrios [00:29:03]: So Tokan, for those who don't know, what exactly is it.

Floris Fok [00:29:06]: Yeah. So Tocan is our general assistant. The idea started Kind of like having this extra co worker. So it started also on Slack, now it's also on the web. It's been evolving a lot, still is evolving. But it started as you just send a Slack message to this agent and it will do part of your work. And of course it started with just a simple summarizations and now we're building it out into more complex systems where it can do full analysis and you know, you can save stuff and kind of build this project on top of Dokan having this interaction of back and forth like you would have with a real colleague.

Demetrios [00:29:52]: And the other one that still exists to this day is the SQL analyst.

Floris Fok [00:29:57]: Yeah, the token analyst. Yeah, it's mostly used, it's used for SQL. And yeah, that one is really successful because we really saw it was adding value and we were saving people's time and money.

Demetrios [00:30:14]: Yeah, nice. And we're going to do a whole episode on that, like a deep dive case study. Now what I want to talk about, you've seen over 20 use cases. What are some green flags and red flags of an agent that is going to work versus fall flat on its face?

Floris Fok [00:30:33]: Yeah. So to come back to my earlier comment of like, you know, we bundled a few, I think when we were really trying like okay, let's try many ideas. One of our experiments was what if we did an agent that could do, you know, less, you know, so more specific. So we call this verticalized agents. Will the accuracy be much better, consistency be much better so people trust it more and use it more. So we, the one the test we ran was we had an analyst agent which was making Plots Python, it was reading Excel sheets, doing statistical analysis, anything. But we saw that sometimes with cleaning data like make mistakes. So they're like, okay, let's make a cleaning agent separately.

Floris Fok [00:31:22]: So you first go to the cleaning agent and it will clean and then you can come back to the analyst and it will do the analysis. Yeah, so you have the separation. But actually people are not using the cleaning agent more because they sound like, yeah, but you know, I rather have it that 80% of the time where we just finish the task in one go is so much more easier than me having to switch an extra step. Yeah. So the extra step was not worth it.

Demetrios [00:31:49]: Makes sense. Now what are other red flags?

Floris Fok [00:31:54]: Yeah, I think every agent that was really hard to test saying like, you know, it's, it's right or wrong because then, you know, I had a colleague that always was saying like, you know, we're measuring vibes. I think that was A really good measurement of things you should avoid when.

Demetrios [00:32:12]: Building agents, if it's not binary, if it's not like the code runs. And I've heard that a lot with coding agents and assistants, that one of the reasons that they are such a strong use case is that it's like the code runs or it doesn't, it compiles or it doesn't. And you know, if the AI generated code or the agent that assisted you worked or it didn't.

Floris Fok [00:32:36]: Yeah, exactly. And you see it also in 01 now. You know, it's, it's, it's really funny that you, you see that same things that we see in agents, that you see that in O1 and O3. Because OpenAI itself, you know, said like, hey guys, if you want to do creative stuff, still just use 4.0. Because 4.0 is still preferred by humans as being better at creative writing. And that is exactly due to the same issue where O1 is being trained on being right or wrong. And the moment there's no right or wrong, it cannot improve itself. So O1 is amazing at all these analytical and more, yeah, beta tasks.

Floris Fok [00:33:18]: But the moment you're getting into the.

Demetrios [00:33:19]: Creative stuff, it's anything subjective.

Floris Fok [00:33:22]: Yeah. And that's something we saw in the agents as well, is like, you know, when you're more in like this creative side, it's like, you know, how do you know it's right?

Demetrios [00:33:33]: Yeah, yeah. So making sure that whatever the task is, there's a clear way to evaluate if the tax was executed or it wasn't. That's another green flag, I would say red flags for any other that you have that come to mind.

Floris Fok [00:33:53]: Yeah. So it's actually quite funny because now it seems that the tables are turning. But like a year ago we had this web cert agent, and that is one of those agents that also got like, it merged with Token, but at the beginning it was a separate agent. And I can still remember the feedback of like, yeah, this can never work because, you know, the latency is way too big. You know, it was doing deep research and maybe you recognize his name.

Demetrios [00:34:26]: Sounds familiar.

Floris Fok [00:34:27]: It's something that Gemini or Google is now doing. And they released like this deep research and now people are fine with like, you know, oh yeah, seven minutes. If I get a Google report with a lot of sources, that works. But you know, we were doing something similar like a year ago, but people were saying, like, it takes too long. You know, I don't know where in zone. Sometimes it took like 10 minutes. And yeah, I Think also the biggest change what happened is, you know, we were on Slack so we couldn't provide this multi page document or in theory we could, but we at that time that was not the way how we were thinking. We just want to have like a concise message on Slack.

Floris Fok [00:35:09]: So that's where we're saying, yeah, for this, just this message, it's not worth the waiting time.

Demetrios [00:35:14]: Yeah.

Floris Fok [00:35:14]: So yeah, we didn't kill the project, but we didn't kept it as an agent, you know, we just distilled it a bit and moved it to the more general one.

Demetrios [00:35:25]: That's funny because I find that my own workflow, I tend to ask AI a question and then go do something else. And so I'm in the camp of I'm totally cool with just waiting, seeing what happens, coming back to it. When I get to it, sometimes I forget and then come back a day later and it's like, oh yeah, yeah.

Floris Fok [00:35:45]: But we were in an era where chat GDP was the norm, you know, and that responded immediately. You know, you had the streaming and the streaming, it was so within 300 milliseconds, you know, the first word started and you started reading. So if you then introduce a system that you need to wait five minutes for, people are like, no, we can't too much. So it's also like the public adapting to this view of these agents doing stuff. So the more people know that there is work being done, the more they appreciate that waiting time. And they actually are like, oh yeah, but it's normal. And like you're saying you're doing this asynchronous work. You know, I even have some like three tabs open and ask it three questions at the same time.

Floris Fok [00:36:27]: And you're really multiplying your, your multitasking. Yeah, multitasking has a new definition now. It's quite funny.

Demetrios [00:36:36]: What are the ones that died?

Floris Fok [00:36:38]: So we had this hackathon, you know, since, you know, we're really like, you know, there are no bad ideas, just develop. So the idea was like, let's get the whole AI team, you know, 24 hours, or it was a bit less and make agents. And one of these ideas where I thought like, yeah, this is going to be the agents from the hackathon. Because it was kind of the idea like have a hackathon. Yeah, exactly. I would have invested. But it was the JIRA agent. Yeah.

Floris Fok [00:37:11]: So it was doing the JIRA tickets.

Demetrios [00:37:14]: Nobody likes jira. That's so true.

Floris Fok [00:37:16]: And the thing was we saw it working. So in this test setup where they built the JIRA board with the agents and they started adding tasks and changing tasks and asking summaries about tasks. It was working really nice. You know, again, in Slack, it was super useful. But the fun thing was the moment we connected it to our gyro saying, like, okay, we're going to be the first beta testers, it completely broke down. You want to know the reason why?

Demetrios [00:37:47]: What happened?

Floris Fok [00:37:48]: It was the human text, all these acronyms and all these really short sentences, like minimal information that was needed for human to understand. It was. It was messing up the agent. It was saying like, you know, this is not description. You know, like half of it didn't even have a description of the task. But all the humans in the team are like, yeah, of course we're building this project, so this must be that, you know, but all that context, that was not in that gyro agent. So that's why it completely did not work. And that's why we're like, okay, you know, um, let's not continue with this.

Demetrios [00:38:27]: Because we wonder too much if today you would take a different stab at it and you add in some kind of a knowledge graph with Slack messages and with emails or with other context. Do you think it would have been more successful?

Floris Fok [00:38:43]: Yeah, I think today I would. I would make like an interview agent as well and first interview the team saying like, you know, provide me all the current information. Nice. Then convert that into documents and then, you know, substitute the gyro agent with that. Like, you know, if things are missing, you know, look at this interview I had with your colleague.

Demetrios [00:39:03]: Yeah.

Floris Fok [00:39:04]: Maybe that clarifies, you know, that is. That is more. More stable approach. But another approach would be. And that is, I think, something that we'll be seeing much more of is if you go AI first, because what is the. What is. The reason why that board was messy was because we needed to type every word ourselves. And then you're.

Floris Fok [00:39:25]: You're doing like basically the SMS language. You know, you're trying to do as many as the least amount of keystrokes. Yeah.

Demetrios [00:39:34]: Like T9 back in the day with people who were texting on the flip phones.

Floris Fok [00:39:39]: Yeah. Like, instead of, oh, so I write this out or just add an emoji. No, but it was. So if you go AI first, you know, and you say like, I built this board with AI and then I maintain it with AI, you know, then there's this chance. That's why. That's what, what. That was the reason why the test was. The test was really successful.

Demetrios [00:40:01]: Yeah.

Floris Fok [00:40:02]: Because the Test was built with AI and then questions with AI and then it understood its own language. So that is also a route that you can say like, okay, you just need to force people to remake the entire board.

Demetrios [00:40:17]: But, but it would be a board, it would be a whole separate tool.

Floris Fok [00:40:21]: It wouldn't be using Jira or a design decision. You can still use Jira, but then just a fresh board or you can make your own ui.

Paul van der Boor [00:40:33]: I think that's maybe something that we've learned over and over again is that as you bring these systems into existing workflows or ways of working in particular, when we go into the E commerce world, people have expected expectations on patterns of use and of course it's not surprising, but it's so important to get that right. And in our world we gave tocan access to our GitHub and it would comment on code and comment and so on. And we switched that thing off in no time because it was so noisy. And then we tried other products like Code Rabbit and it was very similar because at the end of the day it's very easy. It's cheap to generate content and comments, but there's still cognitive load to go through it and you want to spend that on high value information. And so these things. One of our missions is to become sort of the best AI first team. Right.

Paul van der Boor [00:41:35]: So we have AI assistants everywhere. We've got our own AI statistician, We got all these little AI layers. So we test everything, but very frequently on some of these workflows we let go of the tool because it doesn't make sense yet.

Floris Fok [00:41:49]: Right.

Paul van der Boor [00:41:49]: It doesn't work. And I think. And part of it's our expectations and how we interact with each other. And the tools, but also the tools like Jira in Floris's example, it wasn't made for interacting with these agents as it is today. Maybe it will in the future, but not there yet.

Demetrios [00:42:08]: Linking that back to what you were talking about earlier on, the cost per intelligent unit, or what did you call it?

Paul van der Boor [00:42:16]: The cost per unit of intelligence. Yeah.

Demetrios [00:42:19]: And you think about how that is not a unit of intelligence. It's outputting something, but it's actually a unit of distraction.

Paul van der Boor [00:42:28]: Yeah. In this case it's. Well, it's costing our cognitive load. And you want to do the opposite.

Demetrios [00:42:34]: Right.

Floris Fok [00:42:34]: And every right question saves time, but every wrong question spends it.

Paul van der Boor [00:42:40]: Yeah, yeah. One thing to add here is that this come back to this theme of cognitive load that we add sometimes without thinking about the current state. We've got a big platform called olex classifieds platform, millions of listings being uploaded every day, and a natural place for us. A good example of the kind of work that we do is we try to see how agentic systems can help people transact goods. And I think it's one of the strange consequences of ChatGPT is that we've tried, everybody's tried to basically make a chatgpt for X for everything. Right. So we also naturally, when we started that journey, said, hey, we need a ChatGPT for OLX for people that try and buy and sell stuff on classifieds. And we realize that people today, I mean, hindsight is obvious, but they go to a website, they see a ton of images already, they have a search bar and that's how they discover.

Paul van der Boor [00:43:38]: And then we said, you know what, we're going to introduce a conversational agent. But what's the cognitive load placing on this user? Right. This user needs to come in and say, I'm looking for a piece of furniture for my home, which is such and such style and it needs to be under. And people just wouldn't use it. And so there was a huge, let's say, drop off. Because of that additional friction, we'd introduced the cognitive load for people to input things and even when they did use it, they put in blue couch.

Demetrios [00:44:11]: Yeah. So it's the same as search, it's.

Paul van der Boor [00:44:15]: Just a search bar. And then the agent would come with 500 tokens of questions and content and then the user would say, cheaper.

Demetrios [00:44:28]: Yeah. There is another thread that I wanted to pull on where you're talking about different layers of if an agent project makes it into production and one is you design it a certain way and you take as the creator of this agent certain design decisions and then the other is later it's out in production. And maybe it's increasing the cognitive load, but it could be that it's increasing the cognitive load on the user because they don't know how to properly use it, or it could be because it's a shitty project.

Floris Fok [00:45:07]: Yeah.

Demetrios [00:45:07]: So you have to decide later which one of them is it? Do we need to educate the users more or do we just need to kill the project or take a different design decision?

Floris Fok [00:45:16]: I think that blue couch example of Paul is amazing. Like, you know, 100%. All the developers working on that project were nicely filling in the full prompt, you know, like typing like, hey, I want this couch that looks like this. And the moment they indeed gave it to real users, they were like blue couch.

Demetrios [00:45:37]: Which seems so clear in hindsight. Of course, I don't want to have to fill out a form. I don't want to have to put more than I need to to get what I want. If I can do it in one click, that's better than typing out words.

Paul van der Boor [00:45:52]: Now there's another side to it right now, I think to the kind of things that we learned as we build these agents is that. But these systems have a much better ability to understand complex queries. So as soon as you put in something like modern couch, most search engines today fail. But a genai based system can actually understand what modern is and what that may look or feel like. And so we've sort of leveraged that and said, okay, well actually we need to represent our catalogs in ways these agents can link more complex queries. Whether a modern couch is an example in the classified space, but in the food space you can say something light and healthy. No search engine today in any food ordering platform knows how to handle that, but we actually can. These LLMs can suggest something light and healthy.

Paul van der Boor [00:46:45]: Give me five suggestions of what they look and feel like. But then you need to match that to the underlying catalog. So then you need to have a system that does this sort of what we've called magic or smart search to retrieve that. And then you have another layer which is like, well, if we can actually understand that and we want to overcome that friction, there are places in the world where we work, like Brazil and India, where people work with voice. And so if people can actually say through voice, hey, I'm looking for a quick meal tonight in my house for two people, that's fairly frictionless, right? If you can send it through a voice message and actually an agent can decipher that and say, oh, house is there. This is where they live. This is what the. I mean, these are meals that would satisfy two people.

Paul van der Boor [00:47:33]: And so you can actually take a much more unstructured, different modality of input from a user, give that to an agent, they can process that and translate that to a set of items that then can be presented to the user. So there are other opportunities that open themselves up because you've got stronger reasoning capabilities, multimodality and so on.

Demetrios [00:47:51]: Yeah, yeah, I do like that. How you don't have to think in the traditional way. And that's what's becoming clear, is that if you're trying to fit the agent into old workflows, it almost feels like it's a square peg in a round hole type situation. But when you start thinking out of the box and you think, okay, well, since the Agent can do just about anything that we throw at it. What can we try to make that as a new workflow that doesn't that the user isn't already trained on how they're used to using the app or how they're used to interacting with this or that. And so you're going to get inevitable dead ends on that path, which I think you saw the glorified search bar, but then like, yeah, the voice note sounds incredible. If I could just send a voice note to an agent that would give me suggestions all the time. That is a really awesome use case.

Demetrios [00:48:47]: What are some other use cases that died though? This is where I want to hear.

Floris Fok [00:48:52]: Back to the graveyard.

Demetrios [00:48:53]: Yeah, I want to hear more than.

Floris Fok [00:48:54]: Halloween episodes.

Demetrios [00:48:57]: Because it's almost like that's where the best learnings are. Right? You always see people writing blog posts, and especially companies of your size, they're writing blog posts about the successes, but you don't really hear companies talking about the graveyards and what they had to do to get to that success.

Floris Fok [00:49:16]: I think I have one more that is quite interesting because normally we were always so heavily focused of can it scale. But here we had one example where we kind of forgot that part. And it also ended up the graveyard, but we called it the UX Researcher. So we had real people coming to us with an issue or saying like, hey, I have like all these open forum questions that we get like as a review or comments on our products, but they are like too many for me to process. Can we build a system? Can we build an agent that goes to those comments and kind of summarize them, like, hey, name me the top three features people dislike about our site. You know, these are questions that we foresee. We're like, okay, this is indeed something agents can solve. And when we started with this, we built this whole tool that was analyzing that was going row after row doing like map reduced.

Floris Fok [00:50:25]: So it was first checking what are the subcategories, then dividing into subcategories, and then for each subcategory finding given the user's objective, like what is then the answer? So it was combining all these techniques. It was super fancy.

Demetrios [00:50:43]: Wait, can I stop you right there real fast? Because if it was, was it a clearly defined workflow each time, or was it that you asked the agent and the agent would figure out the workflow on its own?

Floris Fok [00:50:59]: So it's. It could manipulate which workflow, but it was quite repetitive.

Demetrios [00:51:05]: Okay, so it almost choose its own workflow. Yeah, and that was where the agentic.

Floris Fok [00:51:09]: Part came in and, you know, we tested it with, you know, excels of like 100 rows, a thousand rows. And we're, we're testing this. And I was like, ah, we're working fine. And then we went back and they were asking stuff for excels of 100,000 rows. And it could not do that. You know, it was. It was like, like we knew it was going to be large, but, you know, we thought like 100 or a thousand, you know, because it could do like 10 without any sophisticated. So we already like multiplied it by 100.

Floris Fok [00:51:47]: But it was devastating. And the worst part was also that we designed this in a row by row and they had their answers in like a verticalized. They just transposed their. Their whole table. And it also broke the entire thing. And it was just a mess because we thought, we were so enthusiastic. We're like, this is a great use case. And we saw all these ways how we would solve it, but we completely forgot how they would solve it and asked.

Floris Fok [00:52:15]: We didn't ask enough questions.

Demetrios [00:52:16]: The basic question, how big of a file are we talking?

Floris Fok [00:52:19]: Yeah, but there wasn't a period where we were like, there are no stupid ideas. We just need to make agents. Agents, agents, to see what sticks.

Demetrios [00:52:29]: You did mention one thing about why you have that mentality, which I thought was pretty cool. And you all look at it like it's going to the gym and you're building the muscle of creating agents and you're trying to figure out how you can create these new workflows, these new products that are agentic first.

Floris Fok [00:52:50]: Yeah. Because if at some point, if you made enough agents, you know, it's like learning physics. You know, if you learn something new in physics, you walk around the real world and you see that, that formula taking form in real life. And the same happens with agents. You know, if you build a few, you'll be. Instead of, if you enter a site, you know, instead of seeing the ui, you know, you start to see tools.

Demetrios [00:53:14]: Yeah.

Floris Fok [00:53:15]: You know, you're like, hey, this, this can be a tool, and this can be a tool. And then I have a chat window and then I can just remove this entire ui, you know, that's how you start.

Demetrios [00:53:24]: Until a user asks for a blue couch.

Floris Fok [00:53:26]: Yeah. Well, it's like this whole new way of thinking and looking at things. It's something that you need to practice. Because the first time I saw the agent, you know, I remember I worked a week at Process and I was sat in this room and the only thing we knew is like, we're going to test a new agent. You know, Ahmed built this one. It was the analyst, it was doing like all these pattern analysis and they just gave like the chat window, like good luck, you know, go test it. We need to load test it. You know, does it scale, blah, blah.

Floris Fok [00:54:01]: And I was amazing. You know, it was, it was like I didn't know, I didn't know how it was doing it, but it was doing it. And, but it was also like, you know, where are the limits? You know, it was really, really hard to find those, you know, in the beginning because you didn't, you didn't know what the system was. And the more you were and at that time, you know, you just did some things, but you really saw that after three months of working with it and developing it, you know, where you were better, much better at testing it, you know, finding those edge cases because, you know, okay, this is how it works. So this is how I can annoy it or this is how I can make sure it works. And yeah, that was quite interesting to see this muscle grow.

Paul van der Boor [00:54:44]: Let me add to that. Why, because you asked, why do we do that? Right. And it's exactly like for us that we need to understand what makes these things work and why is it important for us as proses is because we fundamentally believe that these agents are going to be able to help us build better products for our users. And we've made predictions around this. Right. You were at the marketplace. So one of the predictions we made was in a year time, 10% of the actions done on our platforms will be done by agents on behalf of our users. That's a pretty bold prediction.

Paul van der Boor [00:55:21]: And whether it's in 12 months time or 36, it will happen. We're fairly confident about that. Because these systems. Why won't you send out your agent to help you get whatever you need, whether that's food or other things, if it can do that reliably for you. But we can only build those things if we fully command the technology and have a very good intuition. And you can see how floor is, you know, by basically trying a ton of things with the rest of the team has developed that intuition. Right. We said we're not ready for that.

Paul van der Boor [00:55:53]: But this sure, that tool we can build, it'll get us to 80% accuracy. We measure these things, we test the tools and so on. So that's the larger picture of why.

Floris Fok [00:56:02]: It'S also because it's awesome, fundamental, playable. Yeah, it says the wow effect. You know, I think that agents won't give me that wow effect. I think that will still be a wow before that will be removed.

Demetrios [00:56:17]: And so maybe you can give us some just tactical things. When you're putting agents into production and you want to make sure that you've covered and you've checked all of the boxes, what are some things that you've learned or you've done that have helped you to make that jump?

Floris Fok [00:56:38]: On the B2C side, I think there are people who know much more than I do. But we did work with a few agents that were around data. So the data analyst.

Demetrios [00:56:50]: Yeah.

Floris Fok [00:56:51]: And there we really saw that improved in prompting and security is when we would repeat after we gave the answer is like we gave this answer under these assumptions. And so because we really tried to ask as many questions to make sure it was not ambiguous question. But that's just hard. And there were still a few questions that came through that like that defense mechanism. But at the end we were like, okay, then let's just recap and saying like, okay, you gave this question, I did this. So that means that I made these assumptions. And listing those at the bottom is also a way, a security way of saying like, you know, maybe I made a mistake. Because you don't.

Floris Fok [00:57:44]: You want to minimize mistakes, especially when you're doing data analysis. Yeah, because we want to position this tool as like, you know, you want to make decision based on this. Yeah, you know, we want to make. You want to make everyone be able to make decisions on data now the better those decisions are good. So one of these mechanisms is like the assumptions. And I think we haven't seen that in other tools.

Demetrios [00:58:06]: So you're just asking it as a final step in the prompt, like, tell us what you did, tell us what the prompt was.

Floris Fok [00:58:14]: Yeah, it's a separate mechanism. So it's not the agent itself. You know, we really want to kind of let the agent do its thing. But there's like a second LLM call or agent call that basically reviews the steps and saying like, okay, user started with this question. But I've seen you also added this filter in the SQL query. I see you change the date format. You know, maybe that changes things.

Paul van der Boor [00:58:39]: You know, it's like a proofreader basically. Right? Like a layer of checking before and validation before it gets sent back to the user.

Demetrios [00:58:47]: Yeah, it's critiquing everything, but it's not.

Floris Fok [00:58:49]: Saying like it's wrong or right, you know, but it ends to the user, like, is it right that I made these assumptions? Because mostly those assumptions are made because there was no other way of Calculating it or it was because it's some rule and some document that we added to the agent, you know.

Demetrios [00:59:04]: Okay, other thing that I want to finish on is your view of the evolution of prompting.

Floris Fok [00:59:11]: It's come a long way, you know, so like the first time I was using the large language models in some kind of like assistant way, it was like the NEO X, it was, it was an open source model, 20 billion parameters. And I remember, you know, prompting it like as if I was writing a paper and then stopping at some point and then it would like finish some complex question because it would do like, it would write the paper that would answer that question, which was insanely trickery. You know, it was, it was, it was basically we're tricking the lem, you know, and then three point the da Vinci came. And still, you know, we needed these tricks, we needed examples, we needed to kind of massage it into this pattern.

Demetrios [00:59:59]: Yeah.

Floris Fok [01:00:00]: And then the era came of instruct models, you know, and that's the beginning of chat GP where you could just ask a question and it would understand that that is an instruction. But this development kept continuing, so people thought like, okay, the moment you can ask a question, it works. But we've seen it in these system prompts, as you call them, that in the beginning we needed to tell them every single thing, like, this is how Python works, this is how you use, this is how you be friendly, this is how you use emoticons in your message. It's like we were writing the tiniest bits of corrections that we want to see consistently we needed to write down. So you had system prompts of like, like 3,000 tokens and maybe even more for some agents. But over time we had struggle converting these prompts from model over model. But what we actually saw is like if you just removed everything and started again with empty prompt and then adding the parts that indeed were failing, you ended up with a shorter list. So what actually was happening, you know, OpenAI was training these models better and better in doing many of these things people were forcing it into, to be part of the native behavior of agents or of models.

Floris Fok [01:01:23]: And that is, that is, that is a trend that I really see. Like if I now build an agent, you know, I literally start with three lines of system prompt. Wow, that was unimaginable in the time of 3.5, you know, it was not possible.

Paul van der Boor [01:01:37]: I think that's a really great outline of prompting, right. Going from these base models that were just dumb token. Next token predictors, the core autoregressive function to actually instruction fine tuning. Well, actually you at first had the few shot, then you had instruction fine tuning, then you had alignment. Now you have. Well, now you have the 01s, which basically do like chain of thought suggesting before they actually start executing. But because we're talking about agents, we also see that if you look at the prompt that we use in our gentic systems, they're essentially. It's a piece of code where you start inserting all sorts of parameters, right? So it's basically like dynamic prompt building or composite prompt building, where you've got placeholders for all sorts of things that come from the system.

Paul van der Boor [01:02:29]: And it can be information about the session or the context or the user or whatever. But also of course, you've got the tools and the function calling that you need to describe where. The way you describe it, flora, I think is absolutely right. Now we're kind of. You can't put in 2000 functions and describe them. It doesn't work yet. You can do a couple. We know kind of where the sweet spot is, depending on the model.

Demetrios [01:02:55]: Where have you found? Is it like 10?

Paul van der Boor [01:02:57]: It depends on the model, but no, it's more.

Demetrios [01:02:59]: You do.

Paul van der Boor [01:02:59]: It depends how complex the functions and so on.

Floris Fok [01:03:02]: And if you need to chain the functions, if they look alike, if they're super far apart, you can add as many as you want.

Demetrios [01:03:08]: It's when they look alike, right? Then it gets confused and it's like, yeah, that's the same, isn't it?

Paul van der Boor [01:03:12]: So we build evals on one. Can it actually pick the right function at the right moment, but then the next step is you've picked the right function, can it actually provide the right parameters for that function to be executed? Typically if you do code execution, you need parameters, or if you go to the web, they need parameters, search queries and so on. And that's a second evaluation, right? Can you actually ensure that when you've identified the right function, you pass it the right information to come back? Now, all of that stuff comes out of that prompt, right? So in fact, your question of how this prompting changed is super relevant for folks building agents because the way you think about a prompt and the orchestration around it, what information you pull in, what information you get back, if you do sequential chaining of tools in the agentic workflow, all that stuff needs to somehow it's very stateful, right? Needs to be stored somewhere, it needs to be managed. So anyway, we've ended up in all sorts of worms, cans of worms, because if we tried to make these things work as you change the model, add.

Demetrios [01:04:21]: A tool breaks everything.

Paul van der Boor [01:04:24]: Well, yeah, you need to make sure you understand what breaks.

Floris Fok [01:04:27]: When it breaks, you learn something. I think that is really important. I think that's almost the first question I ask people that made some agent or an agent system is like, what can't it do? Because that's really, really important. Something that we saw on a project that was, we constantly knew what it couldn't do. So we knew that that was our next target. And then, you know, we could, we're able to do that. And then we're, you know, spend a few minutes on like, okay, where does it now break?

Demetrios [01:04:59]: Yeah.

Floris Fok [01:05:00]: And then like, okay, that's our next target. And then you move on, moving target to target until you're like, okay, these tasks are edge cases. But we still know it doesn't work there. But that's off limit.

Demetrios [01:05:14]: Well, that goes back to that binary execution, right? Because you know, did it complete the task or not?

Floris Fok [01:05:21]: Yeah, exactly. Like binary. You know, a lot of people saying like right or wrong, but indeed a task completion, you know, if I, if I have eight steps to finish a task, you know, I don't, I don't really care how it achieves the task. You know, I just wanted to achieve the task at a certain consistency. So that is also a binary, a thing. It doesn't have to be like, yeah, there's that.

Demetrios [01:05:46]: But then there's also the almost higher view of are these the right tasks? If you ask an agent to do something, it may complete all the tasks, no problem, but the tasks aren't related.

Floris Fok [01:05:58]: I think it's a good one where you're also, you know, if you take time back into it, like if you waste more time trying to get that task to work. Yeah, but it is then automated. You know, that's also kind of like, is it then, is it then worth it? I think that the cognitive load on checking it and making sure it works, that needs to be in check in the value it delivers. And I think what you mentioned earlier in the podcast, you know, the computer use, the web use. You know, I think there we are really in a stage where there will be a period where we're saying, like, okay, it can do it, but you know, maybe I will just make that pivot table myself because typing it will probably take longer than.

Demetrios [01:06:41]: And I want to use my computer.

Floris Fok [01:06:43]: Yeah, yeah. And I don't want to sit behind it. That's also one like, does it save you time if you're not able to operate the computer at the same time.

Demetrios [01:06:51]: Or you need a second computer just for your age.

Paul van der Boor [01:06:53]: I mean, for us, we generally think about make it work first, then make it fast because users don't like to wait and then make it cheap. And we're typically always pushing the frontier, does it work? And so it's perfectly fine to spin up 10, basically agents that will try and solve your task and whichever one gets to it first, because that's more. Having a right answer is more valuable than having nine or ten of these things in parallel at the cost of that. So we're always trying to push for the boundaries of, of can we make it work.

Floris Fok [01:07:28]: It's also interesting for process itself to know which tasks are solvable by AI because then we know there's a time factor that within X amount of months or years this will be viable from a cost perspective. So we just need to know it could be solved. But if it's the right time is then another question.

Demetrios [01:07:48]: Yeah, a huge shout out to the process team for their transparency because it is rare that you get companies talking about their failures, especially companies that are this big in the AI sector and really helping the rest of us learn what they had to go through so painfully. Sometimes a mention that they are hiring. So. So if you want to do cool stuff with the team that we just talked to and even more, hit them up. We'll leave a link in the show notes and if you're a founder looking for a great design partner on your journey, then I highly encourage you to get in touch. We'll leave all the links for all that good stuff in the show notes.

+ Read More

Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.2K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

Lessons From Building Replit Agent // James Austin // Agents in Production

Posted Nov 26, 2024 | Views 1.4K

# Replit Agent

# Repls

# Replit

Create Multi-Agent AI Systems in JavaScript // Dariel Vila // Agents in Production

Posted Nov 26, 2024 | Views 1.2K

# javascript

# multi-agent

# AI Systems