Sign in or Join the community to continue

Getting to Grips with Web Agents

Posted Feb 25, 2025 | Views 27

# Token Data Analyst

# AI Agent

# Prosus

Share

speakers

Paul van der Boor

Senior Director Data Science @ Prosus Group

Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.

+ Read More

Chiara Caratelli

Data Scientist @ Prosus Group

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

This episode explores the concept of web agents—AI-powered systems that interact with the web like humans do, navigating browsers instead of relying solely on APIs. The discussion covers why web agents are emerging as a natural step in AI evolution, their advantages over API-based systems, and their potential impact on e-commerce and automation. The conversation also highlights challenges in making websites agent-friendly and envisions a future where agents seamlessly handle tasks like booking flights or ordering food.

+ Read More

TRANSCRIPT

Demetrios [00:00:00]: Today we're talking web agents. This episode is going to be all about how the process team has been leveraging web agents and why you would even want to go down the path of trying to get web agents to work. Because that was my biggest question. Isn't it just adding complexity and you can use these API agents. Let's get into this episode. This is the mlops community and process collab. We are doing this limited series all on agents in production. All right, man, Web agents, here we go.

Demetrios [00:00:38]: We're getting into what they are, what they aren't. And you told me something yesterday or two days ago about how computer use is full use of the computer, like the name states. And then web agents is almost a hybrid of APIs and computer use. What exactly is a web agent? Why are they useful? Let's get into it.

Paul van der Boor [00:01:04]: Yeah, so we talked about the fact that what are agents? And my simple definition going back to that Is agents are LLMs that interact with the world. And one way to interact with the world, certainly one way that we as humans interact with the world is through.

Demetrios [00:01:21]: The Internet, the Web.

Paul van der Boor [00:01:22]: Right. And browsers more specifically. And because at Prosus, we're a large tech company, we interface with our consumers, about 2 billion of them, all over the world, through our companies and their products, which are almost all entirely on the web or through apps. As we think about what's the next stage of agents, of course, we're also exploring how the web can be used by agents to help users navigate, find, discover, transact, buy, and so on. And that's why we have an active, let's say, team working on web use for agents.

Demetrios [00:02:01]: Why only web agents? And why not just go full all in with computer use?

Paul van der Boor [00:02:05]: Yeah, I think the way we're starting to give these agents the ability to interact with the world is gradual. And that graduality, sort of incrementality sits in the fact that you kind of increase complexity. And think about the first access that we gave LLMs to the rest of the world were through simple APIs and function calling. And now we're saying, well, maybe they can actually interact with the world through a browser. And then of course, after that, it will be very likely the computer, as we are already seeing, and then maybe after that, the physical world. And so web browser is sort of the next step, the natural step that we wanted to take also, because a lot of, as I said, our interfacing with our consumers, customers happens through the web.

Demetrios [00:02:54]: I know that there is a ton of e commerce stuff that you all are doing, like olx, for example, is a huge one. And if you have web browsing with agents, why would you use that instead of APIs? Why would you, how would you build a E commerce site that's optimized for web browsing for agents? Like, how do you think about that?

Paul van der Boor [00:03:17]: Yeah, let's start with the question why? Why would we want to do that? Well, as I mentioned, I think it isn't a very important way that we interact with the world today. The way we discover things, we learn, we may buy and shop and exchange goods in general. And so that's why we're trying to solve that at Pros is all of our companies have a web interface to the world. And then the question is, okay, well, how do you start going down that road? I mean, why not just through APIs? Well, the truth is that actually, you know, we're. Most of the things that we want to give agents access to aren't made for agents to access. They aren't API fied yet. And so.

Demetrios [00:04:05]: Wait, what do you mean by that? Like a photo on a website or what is not apiified?

Paul van der Boor [00:04:13]: Well, if I. Let's say I want. So one of the predictions we've made at the marketplace, that 10% of the E commerce transactions will be done by agents on behalf of consumers in the next year. Right. Whether it's 10% or not, I think it's safe to assume that agents will start to take actions on behalf of us on the web in some near future. But then the question is, okay, which types of action? Let's say you want to order a flight or you want to order. Order a pizza to be delivered to your house today. You obviously go to the browser, you have a mouse, you click around.

Paul van der Boor [00:04:50]: There's no API for all those actions yet. Right. You go to a URL and the agent needs to then figure out, okay, where does it click? For example, does it do that based on vision on the site map? And so that's not a regular API. Obviously, there's a lot of APIs behind the website that you're interacting with, but there's no obvious place to kind of plug the agent into in a very structured way that the agent, for example, today interacts with all sorts of other tools that we're giving them access to.

Demetrios [00:05:24]: Yeah, and I want to call out too, that I've heard the pros and cons for both API and web browsing or computer use agents. And we're going to be having a full debate on this particular topic to hear different thoughts and viewpoints from Engineers on why to use one and why to use the other and why it's the future. So when I think about web agents, one of the clear value props is that you can build it once and it can go out and do what it needs to do. And you don't rely on APIs, which is a huge selling point because APIs are very finicky. And so I've said this before, that you don't have to think about building a bunch of different API calls and then the agent has to choose which tool it's going to use, which API it's going to go out and find, and you don't have to worry about, oh, this API now changed the way that it works. And so we have to update and then it wrecks the agent. You, you just build the web agent, it goes out, it explores, it interacts with the web in the way that we think about interacting with the web. And so it is more human, like.

Demetrios [00:06:41]: Now let's take the second part of this, which is how do you think about building websites, particularly in E commerce, that are agent friendly? Web browser agent friendly.

Paul van der Boor [00:06:57]: I mean, as agents are developing, it's even unclear what agent friendly means. But to go back to your question, why just not use an API? APIs today, as you will understand, listen to this podcast, is you've got some kind of parameter and input, you send it somewhere else and you expect a standard response. It could be even in the genai world, it could be like, hey, I've got a prompt that goes to an API and that prompt is used to take text to image to a text to image model and returns an image in a certain format. And you know what's expected in, and you know what you can expect back. There's no API, for example, for Amazon.

Demetrios [00:07:37]: Or even LinkedIn doesn't have an API.

Paul van der Boor [00:07:41]: You can't say, well, let's say Amazon, you know, a gift for my niece. There's not something you can say, that's the query. And then, you know, add something to my shopping basket back. There's actually a set of steps that we do as humans to get there. You obviously look at the search, use the query, you search, you look at certain items, you click around, you maybe read reviews. So a lot of things that go into the shopping experience, and the same is true, by the way, for the food ordering experience, which we do a lot of, as we talked about, or finding secondhand goods in the classified space where you need to talk to sellers. And so all those steps are things that we now use the web and the browser for that. We're exploring how well agents could do any of these steps in the value chain.

Demetrios [00:08:25]: And you can't do that with APIs. That's a very clear reason why you would want to do it with a web agent.

Paul van der Boor [00:08:31]: Yeah, well, if it's already designed for us, I mean, that's how billions of us interact with the web today, is through the browser. If you can do that reliably, then that means you can very quickly start to give these agents a whole bunch of other capabilities. Because you can say, hey, go check online and book me a restaurant for two. Right. And then tonight at 7pm and we'll be able to look at what's available and so on. Because you can just browse the web like you or I would do that.

Demetrios [00:08:59]: I will say, since we've been talking about agents so much in the last couple days, this whole week, I've had like agent boot camp coming to the process offices and talking to everybody on the AI team about what they're working on, how they're dealing with agents. What are some of the challenges? When I open my calendar to find the location of this place that we're recording this podcast at, and then I have to copy and paste the location into Uber or into Google Maps. I'm sitting there and I'm like, this is so backwards because agents will be taking over this. Or how does my phone, at least it doesn't need to be an agent. How does the intelligence of my phone not know that I have a calendar invite for this time? Maybe I should just already have an Uber that is being ordered so that I can get there on time.

Paul van der Boor [00:09:55]: Yeah, I think you're already thinking a couple of steps ahead in terms of navigating. Well, first, across an operating system, multiple apps, you know, we're looking at frameworks that are made to do that so that really, that sort of computer use across apps, certain benchmarks like osworld and so those are measured against these kinds of tasks that require opening a directory, loading some files, reading them and putting them into Excel and opening a pivot table. Right. Those are sort of multi.

Demetrios [00:10:26]: Multiple steps.

Paul van der Boor [00:10:27]: Yeah, multiple actions. Maybe copying the address like you mentioned, and ordering an Uber through their app. I think that's definitely coming. Um, but even the step beforehand where you. Within one website. Right. Let's say within an E commerce website, like olex's website Glove or Ifood or Emag or take a lot, all these websites that we have on the group, if you want to go and buy something, even that Is difficult because there are pop ups coming in. There are, you know, if you're trying to order some food, it'll say what, you know, what side dishes do you want? What toppings do you want? Obviously there will be a whole bunch of other filters.

Paul van der Boor [00:11:10]: Sometimes we've designed things. Am I human or are you a robot like Captchas and so on? Those are all things that make it hard to solve the entire task of getting food or ordering, whatever.

Demetrios [00:11:25]: Hard for humans even.

Paul van der Boor [00:11:27]: Well, hard for some of us for sure. And even harder for agents at this moment.

Demetrios [00:11:32]: Yeah, just reminds me of the whole conversation we had a few episodes ago about the cognitive load and how we want certain things that we do want to be the least amount of cognitive load possible so that we can use all that precious brain juice for something that actually is going to take us the whole cognitive load. But let's now skip into the part where we get to talk with Chiara, who has been working on web agents for the last six months. She spent a half year diving into it and getting real contact with what's been working, what hasn't been working, playing around with web frameworks and learn from her and get her insights.

Paul van der Boor [00:12:18]: Great.

Demetrios [00:12:19]: Chiara, you're here. Thank you for joining us. And I would love to start out with a brief overview of the project that you've been working on so that people can know the web agent journey that you've taken.

Chiara Caratelli [00:12:35]: So the first project that I worked on when I joined the team was to build an agent that will help people order food.

Demetrios [00:12:42]: Oh, nice.

Chiara Caratelli [00:12:43]: And it sounds like a simple task, but it's really complex. So you need to understand the user, what would they like, what are the dietary restriction context, also what time of the day it is, where are they located, are there events in the area? And this agent should be able to order food. So go to food platforms, maybe several of them, advise the user on what's available, where are the promotions, and ultimately being able to order food. And as Paul said before, not everything is available through an API. So we decided to delegate this task to a web agent. And web agents, I mean, there is a reason why they're popular right now. You talked about it already and they're really powerful. And we saw that there were a lot available at the moment, few of them were coming every month.

Chiara Caratelli [00:13:45]: So it was really a challenge to understand how to navigate this landscape of agents. So I immediately started to try out all these tools. So there are a lot like Multion is probably the most famous at the moment, but there are a Lot of open source projects as well from companies and research groups. And these tools are all really nice but we discover pretty soon that they have a lot of limitations. So one thing that is really a big problem for agents is that websites are built for humans. So there is a lot of information that is not agent friendly at all. You have dynamic content that loads the DOM of the page could be huge, could change. There are of course the standard things like captchas.

Chiara Caratelli [00:14:42]: It sounded like something that would be really easy for us so we'll just delegate execution of this stuff to an agent. But it was not at all.

Paul van der Boor [00:14:52]: I mean at this point we've already been building the data analyst and other things which from our seem like much more complex to get right because you have to connect to database, you have to run queries, you have the code executor, you have to validate. We're like, well what we're going to do now is just we're going to send this agent to a web and basically navigate beautiful soup style. Just go and actually it took forever. It didn't complete any of the tasks, even multion at that time. We looked at all the other frameworks.

Chiara Caratelli [00:15:23]: Yeah. One good example is Web Voyager is one of the first web agents. They also published a benchmark. It's public. Many agents benchmark against this and what they say is that the success rate can change a lot from one website to the other and from one task to the other. For instance, websites where you need to perform a lot of actions like booking.com or Google Flights, they're really hard to navigate for an agent. Especially when an action can trigger something else that the agent doesn't expect. Like when you book a flight, you select an airport and the destination airport changes depending on the starting airport.

Chiara Caratelli [00:16:07]: Right. So this is all really complicated for an agent. So yeah, we started trying out all these tools. We built a lot of internal knowledge about them and all the techniques that they use. And at the end we decided to build our own basically framework like your own. Build our own web agent. Yeah.

Demetrios [00:16:29]: And how does the logic work or what is the backbone of a web agent do? Because I know the API agents use tools and you have the potential to do function calling or whatever. But with web agents, what does it look like?

Chiara Caratelli [00:16:49]: So it boils down to something really similar actually. An agent is something that has access to information and can decide what to do next. Right. So the information in this case could be the screenshot of the page or the dom, HTML. Lots of things that you can get from a Browser and what it can do are the actions that a human can also do, like clicking, entering, text scrolling and so on.

Demetrios [00:17:15]: Those are like the tools it has.

Chiara Caratelli [00:17:17]: Yeah, those are nothing but tools that the agent can call. So the way it goes is that usually there is a planner that decides once it gets the task, decides how to perform. Could be several steps. For instance, go involving several websites and then there is an agent that chooses which tool to use. Like if there is a cookie banner and it's click Accept for instance. It's something that could be unexpected. So that's why you need an agent. You need to be able to react to an open world.

Paul van der Boor [00:17:55]: By the way, this is a great example because we were benchmarking on and looking at others benchmarking on Web arena, which turns out doesn't translate at all to the tests we were doing. It was one that the actual average results didn't compare but also they were super unpredictable. So one time they worked, one time they didn't. So we'd have to devise simulations where it would go to look at how many times out of 20 would this thing succeed on a task that we care about. Right.

Demetrios [00:18:24]: Oh wow.

Chiara Caratelli [00:18:24]: Yeah, Most of the times we saw agents getting stuck in loops and yeah, just not knowing what to do next. You ended up stuck because maybe the task was not clear or the task that the planner gave was not clear.

Paul van der Boor [00:18:40]: Or the action space was not clear. Because we also saw that you're looking at a website and for us it's very obvious. You look at a website and there's all these by now patterns that we're all familiar with. There's search bar scrolling is one that was super hard for these agents to do. But you basically have, you've determined this is the website. You've got some reasoning through some LLM that tells you what you want to do next. But where on this website do you click to do that? Just understanding what is the coordinate that corresponds to the action I want to take. And that's not something these multimodal image models were good at.

Paul van der Boor [00:19:17]: Like just taking an image, understanding what are the kinds of actions that are there is fine. But then saying okay, then you need to click on such and such coordinate to execute that action or scroll down because it's probably lower on the page.

Demetrios [00:19:29]: Yeah, because I don't see it.

Paul van der Boor [00:19:30]: Or you don't see it because there's a privacy or cookie banner in the way.

Chiara Caratelli [00:19:34]: Yeah. One of the first things we worked on was crawling actually. So we looked at the open source frameworks that were around and we tried to use similar strategies, but. But we chose only specific strategies that applied to a use case. And I think this is really important because the way to make an agent succeed is to limit the amount of choices. It has to do as much as possible. All these tools were optimized for a specific goal, which is being able to surf the web. But our goal was different.

Chiara Caratelli [00:20:09]: So in our case, we couldn't use a tool like that. It wouldn't work for us. So we needed to build a web agent that could interact with platforms to order food. And that's a different task. And since the scope of this task is smaller than we could optimize for that. So we built an agent that could get more information about the page. Depending on the platform that we were working with, we could prompt the agent to behave in a certain way. Like first you search in the search bar, maybe you need to enter your postal address, and so on.

Chiara Caratelli [00:20:48]: There are certain things that always go together. And yeah, this is also another thing we did. We took certain tools and we merged them. So if there are actions that you always do at the same time, why would you use two tools for that? For instance, when you search for something, you type in and then you press enter. So these are two separate actions, but you can combine them into one tool because we needed that. You basically never type without pressing enter. Yeah. Another thing was improving the scrolling.

Chiara Caratelli [00:21:29]: When you have long menus and lists of restaurants, you need to be able to fetch all the information. So we adopted these strategies to work with our use case and we got a good success rate for that. So I think the lesson here is that if you want to build a web agent for a specific task, keep in mind the task that you have to do and be smart about it. If there are things that you don't need, don't add them to the agent.

Demetrios [00:22:00]: And so did you go and map out the trajectories and the user flow on these food ordering apps? And maybe it was like you would go to ifood and say, hey, I want to order pizza. And then go through that flow yourself so that you could use it as a golden data set for the evals of the web agent.

Chiara Caratelli [00:22:22]: So one thing that we did was to do all these flows manually. Like, I think you cannot build something until you try to do it yourself and understand what are the pain points. So this was first thing that we did. And then we tried to prompt the agent to interact with the web page in a certain way. And these instructions were loaded dynamically based on the page it was on. This was one thing. So all these methods don't really change the speed at which the agent works. But what we did was also storing all the trajectories that the agent had done.

Chiara Caratelli [00:23:06]: We defined three modes that the agent could operate. So one was the traditional mode where it would grab a screenshot of the page, load the content, and then decide what action to take next. The other was faster mode that didn't involve a screenshot, and we would do that on pages that we knew. So if we would search for a certain food and the food would be different, but the task would be similar, we would not load the screenshot of the page because that was not needed. So the agent would know exactly where to click because it had seen that task before. And there was a third mode, which we called reflex mode, in which we would automate the web actions directly, like sort of a macro, let's say. Some parts of this can be automated. So why would you have an agent do it? Right? So, yeah, we combine all these things and the agent would try to do things in a fast way, and then if it would not succeed, would do it in a slower way.

Demetrios [00:24:12]: So it was almost like the slow way was the plan B. In case it couldn't do it fast, it would do it slow and you would give it a little bit more reasoning or you would give it the screenshots and it was more thorough.

Paul van der Boor [00:24:23]: I think that's an interesting. If you try to understand where do we think these things go is we start with a set of tools and frameworks that aren't really made to interface with the web necessarily. But through those three modes that Kara just explained, we were able to actually have it because of these trajectories that it knew were successful on specific sites for specific tasks. Access that learned, let's say, learned action space for websites, it was then sort of quote unquote, familiar with. And I think that's something that if we think about our world, like you would want to have, like when we go to websites, we know we're familiar, we can navigate, right? You evolve into booking.com, you go there, right? You know exactly what to do. If you order your food many times and you can see and you don't need to kind of rediscover that page. And these agents, essentially we're starting to see as we are able to create that persistent or learned intuition about a website, they become experts. Well, first familiar than experts, and can get you to your desired output much, much faster.

Demetrios [00:25:31]: And I know that I'M still trying to just separate what's different and what's newer. With the web agents and with that learned experience, something that it's seen, how are you saving it and how are you making sure that the agent has access to it? Are you throwing it in a database? Are you caching it? What does that look like?

Chiara Caratelli [00:25:50]: I think the storage itself doesn't matter that much as long as this is something that is privacy compliant and doesn't lead to leaking user information.

Demetrios [00:26:03]: But you're storing the path that it took or you're storing the action because you're not storing the screenshot and then loading that up again, right?

Chiara Caratelli [00:26:12]: No, we store all the path and the state state that the page was on. So the agent knows how the DOM looks like, what are the elements it can click on.

Demetrios [00:26:22]: Okay.

Chiara Caratelli [00:26:22]: Another thing I didn't mention before is that we did some work to understand how to clean this dom because there is a lot of information there, but the agent doesn't need all of it. It should get as least information as possible. So we only took the elements that were clickable, for instance, and we combined this with a screenshot in fast mode. The agent could be a bit more blind, let's say, and knowing where to click because the task was really similar.

Demetrios [00:26:52]: Nice. Now the other thing that I think we wanted to talk about was this, the differences between planning and execution and the models that you use for each of these. Because we know that there's the reasoning that models that you probably are using for the planning, but then do you offload that onto a model that is smaller and just executes, or is it fine tuned? What does it look like?

Chiara Caratelli [00:27:20]: So for this specific task we use foundation models. We did experiments with all the major foundation models and we saw of course, some differences. For the planner, of course it helps to have a model that is good at planning, like O1 for instance. And for execution itself, you don't really need a model that is good at planning, let's say, because as long as it knows what to do, this is a very limited action space. Right.

Paul van der Boor [00:27:54]: It's an important sort of pattern that's emerging, the separation of the planning and the execution as you start to interface with the world we're talking about. Now, of course, web is one of those interfaces, because planning itself requires a lot of reasoning, of understanding the intent of the user if you do the execution. That basically means I actually need to know the action space really well and to be able to translate that plan into the action space of my world, which could be at the broadest sense could be the web, or it could be a domain like Ifood or OLX or Payu or any of those basically websites that we know and understand well that the second execution agent then needs to navigate to help to get to that outcome successfully.

Demetrios [00:28:40]: Is that where the simulations were coming in? And tell me more about what the simulations were and how those helped in.

Paul van der Boor [00:28:47]: Terms of the simulations. I think what is also really nice with the web, which is different from other places we've been applying LLMs, is you can just send these agents out to go and explore the websites like a web crawler.

Demetrios [00:29:00]: Right.

Paul van der Boor [00:29:01]: Essentially you can basically say, go and find me a blue couch in Warsaw and Olx and it can then go and explore. And as long as we've defined what success looks like, like for example, found the couch or added it to cart or whatever it is, then we can do this a hundred times. And it learns what the trajectories are that are most likely to get it to that state. And that's where it's sort of. It's more exploration to learn what these websites can and cannot do that allow you to get to a system that actually is really good at executing within your catalog or your e commerce environment.

Demetrios [00:29:43]: It's basically like you're mapping out the space, correct. And then once you have the map, you can traverse it easier.

Paul van der Boor [00:29:52]: Well, no, I think we're seeing it in computer use similarly, that people are starting to map out the applications. Right. So if you were to know every button on Excel or on Word or the commonly used apps, and if I go to you and say, hey, please make me a presentation in dark mode with such and such fun and background and you've done PowerPoint many times, you know where to go, you know. And I think that kind of mapped action space is something you can simulate essentially because you just go and have these agents explore apps.

Demetrios [00:30:25]: Are you using hotkeys as tools?

Paul van der Boor [00:30:27]: You could. Yeah.

Chiara Caratelli [00:30:29]: Tap, for instance, is a very useful hotkey to know all of those things. You can do most of the things. You can do most of the things on a web page through the top.

Demetrios [00:30:38]: Yeah.

Chiara Caratelli [00:30:38]: And yeah, this is also one of the reasons why it's good to separate planner and execution because execution only has limited tools and does ollie needs to understand whether it has finished or not. So we would have a planner telling the executor a very specific task. For instance, if I would search for a T shirt on olex, first thing would be open the Ollix page, search in the bar and so on. And The Planner would take all these subtasks and execute them and stop when it was finished and give the response to the main agent that then would process it and decide what to do next.

Demetrios [00:31:23]: Oh, nice.

Chiara Caratelli [00:31:24]: So yeah, you can map the space and give better information to the execution agent from both sides, basically both from the Planner and Execution side. So I think there's a lot of room for improvement once we will get more data.

Demetrios [00:31:45]: All right, so now let's talk about some of the frameworks that you used. You did mention Web Voyager, you also mentioned Multion. I imagine there were things that you really liked from some of these. And is there anything that just stands out at you from one of these web frameworks that was a particularly. Maybe it's a novel way of doing things or a good way that you feel like was something that you brought back into your own framework that you created.

Chiara Caratelli [00:32:15]: So something I really liked was Web Voyager. It was the simplest. All the other frameworks, many of them, they built on top of that. But it was really clear the separation between Planner Executor. Executor didn't have many tools available, just basic web interaction through an SDK. In that case it was using Selenium, which is a testing tool. We decided to choose another one. But the strength there, it's its simplicity.

Chiara Caratelli [00:32:54]: So yeah, I think that's really powerful. And I also liked the visual approach because the DOM does not always bring you in the right direction because you could have misleading information there. But what the user sees is what is important at the end of the day. So yeah, that approach I think is really useful. Other frameworks build on top of that and they added more complexity. In terms of Planner, for instance, we saw Agent E that added a more hierarchical type of planning, which increases the success rate. It does, but it also makes the task more difficult and slower to execute. We saw other approaches like Monte Carlo Tree Search for Planning.

Chiara Caratelli [00:33:47]: This is an open source project from Multion as well. At the end we decided to choose the simplest possibility because our task was clear, we knew what we had to do and we ended up using this agent, sort of an API. We created the code that was very modular so we could delegate things to a web agent that we wouldn't know what it was doing. It was kind of a black box within our application and would give us the response. And with that we could take action and interact with the user. Because at the end that's what's important.

Demetrios [00:34:24]: Basically we're six months in the future of your journey knowing what you know now, what Would you tell yourself, if you could, six months ago about this.

Chiara Caratelli [00:34:34]: Whole journey, a lot of things. So let's start with. I think the most important one is to really understand what is the problem you're trying to solve. Dive in, try to do the things yourself. Because web agents are automating tasks. So try to do it yourself. Try to see what are the pain points. I would explore all possibilities that are around, but reminding myself that these tools are not necessarily what I need.

Chiara Caratelli [00:35:07]: The other thing that helps is to approach this as a software engineering problem rather than data science. And I say that because I come from data science background. So this was really, really big for me.

Demetrios [00:35:21]: What does that mean? What's the difference?

Chiara Caratelli [00:35:22]: Yeah, I'll come to that. So this is not a data science project, but software engineering project where there are some LLM steps. And this means that you can adopt all the good practices of software engineering like keeping things modular, separating responsibilities and keeping things as simple as possible, trying to have more control. I think you discussed this about SQL Agent. This is even more important there. But yeah, it's important to understand where you need the agent and where you don't and try to limit the amount of LLM calls as much as possible. Why I say this because when you approach these agent projects, it's really tempting to use all these high level frameworks with high level of abstraction, do everything through an agent. But this is not the right way to do that.

Chiara Caratelli [00:36:24]: I mean, it's nice to do proof of concept play around, but if you need to build something that works, you need to have control. So it really helps to think of different modules. So I have a planner that needs to think very well, but the execution part doesn't need to be done necessarily by an agent. Like if there are things that can go through a deterministic approach, it's much better. An example, in this tool we had to pull user information because we had to understand user background, whether they had data restriction, for instance, things like that. We didn't always need it, but most of the times we needed it. I very naively built an agent that could interact with the database, retrieve the information, but actually we didn't need to do that. Why would you use an agent if you can just pull the data and add it to the context, to the prompt.

Chiara Caratelli [00:37:28]: So that was a big revelation for me because it made things much more simple and gave us control. So make use of the frameworks where you need them, but keep in mind that you can go low level and have more control over the parts that are Important and test things, try to find edge cases, try to find what doesn't work and have fun. That's also.

Demetrios [00:37:58]: That's what you would have told yourself. You didn't have fun.

Chiara Caratelli [00:38:01]: Yeah, I had a lot of fun.

Demetrios [00:38:02]: Yeah. The idea of where to use an agent and when to use it is really a fascinating point because like you're saying you can build an agent to do this thing, but if you can do it without an agent, it's going to be more predictable. I've heard the other side of the argument be I can prototype an agent or I can create an agent so fast it's almost faster if I do it through an agent versus if I do it through traditional software development. Have you seen or do you have thoughts on that?

Paul van der Boor [00:38:39]: I mean, I think in general for prototyping, that's definitely true. You can prototype a lot of stuff very quickly, but at the end of the day, you know, we are, our main job is to take something that we can see work and scale it.

Demetrios [00:38:52]: Yeah.

Paul van der Boor [00:38:53]: So our next step is always, because of the size of our platforms, is to scale it to tens, hundreds of millions of users. And their deterministic workflows are much more preferred if you can, or at least narrow down.

Demetrios [00:39:06]: So many reasons. Right.

Paul van der Boor [00:39:08]: So I think we tend to, yes, prototype, you know, in any way we can quickly. But then after that, to Kiara's point, we can start. We need to distill the actual essence of, you know, the system to one that can be put into production and scale. And often that may so include function calling and agentic components, but it's not as sort of free reign or as, you know, maybe we do when we just start exploring it.

Demetrios [00:39:36]: Yeah, yeah, yeah. Did you create any benchmarks or particular evals for this project?

Chiara Caratelli [00:39:43]: Yeah, we chose some tasks that were representative of typical E commerce interactions and we tried to optimize for those and then we added variations of those and trying different possibilities and until we were happy with it.

Demetrios [00:40:04]: Did you have a certain accuracy score that you needed to get above?

Chiara Caratelli [00:40:08]: So our target was 80%. Yeah.

Demetrios [00:40:11]: Okay.

Chiara Caratelli [00:40:11]: But of course it heavily depends on the task and on the website and on the user itself because it really, it depends what the user wants and like how specific the request is as well. So of course there is a whole planning step where the agent talks with the user and tries to understand whether it has all the information. And then the other part, the execution needs to have all the right information to be able to perform the task. So this was very important. For instance, you cannot order food if you don't have an address. The user has to be willing to provide the address. Yeah. So yeah, this was a challenge.

Chiara Caratelli [00:41:00]: And to go back to the deterministic approach, there is some deterministic component in here as well. Because if I need to perform a certain task, I need to inform the web agent. I need to give it the right information to be able to do that. And that's deterministic. The planner needs to know that it needs to provide an address, list of dishes and so on. And yeah, that's really important and it increases accuracy a lot. We tried both ways and that's definitely better.

Paul van der Boor [00:41:36]: I think one way to kind of see what we concluded from doing all this work is that on the one hand, I think if we just take an agent and want to go to one of our existing websites or platforms, we can get pretty far. But I think there's a sweet spot where we can also, because there's a lot of things we don't display on websites that may be useful to help a user going through an E commerce journey. To give you an example, when we're in olx, we know it's a classified space. Buyers and sellers of secondhand goods. We know what the reputations are of certain sellers. We know location of people, we know what kind of things they've searched in the past. We know what the supply and demand are, we know what reasonable prices are for categories. Those are not things that necessarily an agent has access to if they just go to the website of a marketplace.

Paul van der Boor [00:42:27]: But if we're building the agent as the marketplace and we've got access to all that rich marketplace dynamics and information and customer reviews, that is certainly relevant at the moment of going through a transaction. That's where I think we can create really useful agents. And that's certainly a conclusion we drew away from because of course here the experimenting we're doing is just purely from the outside in. But if we combine that and build a agent that is integrated with the platform and is available to the user at the moment they want to find or exchange things, that's going to create a completely new AI first. Yeah, you know, E commerce experience and, you know, we'll hopefully be able to talk about some of that.

Demetrios [00:43:12]: Well, it goes back to building your website for humans or for agents, because you can also expose that data for other agents to use or you can choose to not expose that. And I know that's a debate topic that we're going to have too, because it's like, well, if this is useful for me, then it might be useful for other agents. And if we expose it in a way that a human's not going to see it. But if an agent is using the website for some reason, they will be able to see it. I don't know how that would look, because if it's not exposed in the GUI and you're using a web agent, the web agent has access to the gui, right? Or it also has access to the dom. So maybe you put it there and then that exposes it.

Chiara Caratelli [00:43:57]: We saw some website has started to add a markdown, for instance, with the description of the page. That's already really helpful. So I see some progress in this direction. This helps especially with E commerce because you might have a lot of items in a page, so it's just much faster if the agent can load them. The lessons we learned is that we had to be very specific with the instructions. Break them down as much as possible. So limit the amount of. Of.

Chiara Caratelli [00:44:30]: Sorry, limit amount of thinking that the executor has to do. Delegate that to the planning agent. So have the instructions as detailed as possible. Break them down in steps. Try to make use of all the tools you have available, but select them in a smart way. Like if you only need to do a certain interaction in the page, just make only those tools available for the agent. Doesn't need to have all the space. I've learned so much in these six months.

Chiara Caratelli [00:45:04]: I don't know what. Like if I look back six months ago, I'm a totally different person now, so.

Paul van der Boor [00:45:11]: Well, we're always looking for more smart people, so interns, others, if you want to come check out what we're doing, reach out to us.

Demetrios [00:45:20]: Nice. Yeah, really smart people. Except for some of them that sit at this table.

Paul van der Boor [00:45:29]: That's all you two Dimitris. That's why we work together.

Demetrios [00:45:32]: Exactly. Yeah. There we go. The Process AI team is hiring and you can find all the links to everything you need to know in the show notes below.

+ Read More

Watch More

Accelerate ML Production with Agents

Posted Mar 06, 2024 | Views 1.7K

# ML Production

# LLMs

# RemyxAI

Building Conversational AI Agents with Voice

Posted Mar 06, 2024 | Views 1.6K

# Conversational AI

# Voice

# Deepgram

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 5.9K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production