We’re Using AI Agents at Work (and it’s amazing) // Paul van der Boor & Euro Beinat
Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.
I am a technology executive and entrepreneur in data science, machine learning and AI. I work with global corporations and start-ups to develop products and businesses based on data science and machine learning. I am particularly interested in Generative AI and AI as a tool for invention.
A presentation from Euro and Paul about Agents, and what they have learnt from their 15,000 Colleagues who Use Ai-Agents Every Day
Euro Beinat [00:00:20]: Welcome back. About two years ago we decided to release an AI assistant to all our colleagues across Persus. It's an AI assistant based on agents. And what we want to share now is what we learned in the course of last two years both in terms of how we use AI assistant, agent based AI assistant to amplify our work, change the way we work, but also how we use the same technology to change the applications that we offer to our consumers. Many of you will know produce for those that don't. A few facts about this group. We are a global technology group. We focus predominantly in E commerce.
Euro Beinat [00:01:01]: And E commerce for us is a full spectrum of things that you might do several times a week a day, like ordering food, ordering groceries or perhaps receiving some parcels at home. And the other extreme decision that you take once in a while, like buying a house or changing car and so on and everything in between with electronics, fashion and so on. So we operate platforms in about 100 countries around the world. And these platforms connect the buyer and the seller, the student and the teacher, and the restaurant and the diner. So you see here some of the brands in the group. I want to congratulate Swiggy. Swiggy is based in India. Swiggy had a successful IPO today.
Euro Beinat [00:01:44]: So congratulations to Swiggy. Because of the nature of what we do, we had to start investing in artificial intelligence machine learning early on. We won't go through the process, but in any case, you can imagine that we transact billions of transactions every day. Therefore, machine learning is at the core of what we do. It has to be because it fuels growth. So we start 2018 at this moment. The ecosystem of AI experts, data scientists across the group is about 1000 people with several hundred models in production. And we also measure hundreds of millions of dollars of impact in practice with all these models.
Euro Beinat [00:02:21]: So AI is core to our growth. What I want to share, however, is what we learn in two cases. The first one, how we use AI agents to amplify our work. And I talk about Tocan. It's our own assistant that we developed in house and explain in a second why. And then how the same technology, the same core artificial intelligence agents power also OLX Magic, which is a platform that is offered to our consumers in E commerce. Let's start with Token. It's our personal and team assistant of Prosus.
Euro Beinat [00:02:57]: What it does well, as you can imagine, it is sorry for that. It is a very general assistant. It is designed for everybody in the group to experiment with artificial intelligence with generative AI and LLMs. And it lives in Slack because a lot of our engineering teams are on Slack or on the web. It's agent based, it is agnostic compared to LLMs. It uses all the LLMs out there. Of course, it has preferences, it connects to our databases, it's designed for our own use cases, which is the reason why we decided to develop it in house. There's strong security and privacy.
Euro Beinat [00:03:34]: It is designed to produce analytics and insights respecting privacy, of course. And it's composable in the sense that companies in the group can take pieces and integrate them in their own applications. At this moment, it is offered only to our employees. Of course we have some external users, but the focus here is how to make our companies better. Also the companies not only do we control, but also the companies that we invest in where we might have a minority participation. So the reason why we built it essentially is one, and it has to do with the fact that once we started doing large scale field tests of LLMs, and it was about 2020, which was early, and GPT3 was not really good for much, but it was promising. But we had the impression it could offer a number of solutions to a huge number of use cases we couldn't really solve before. People that were using that in our field tests were coming up with solutions ideas that we didn't really expect.
Euro Beinat [00:04:34]: So we came very quickly to the conclusion we can't really dictate how this is going to evolve. The best thing we can do is to give the best possible tools to everyone that can test them and can figure out by themselves what it can do for them. So we call this a collective discovery process. It's a large collective discovery exercise where everybody in the group has the possibility of using this assistant to make sure that they can work better, faster, deeper, whatever makes value for them. But also they can test use cases that can go to consumers. So when we started back in 23, so when we started in summer 22, so it was before ChatGPT. So in 23, when we started measuring adoption, at that time it was perhaps 1,000, 2,000 people. About 50% of all the usages of this tool were engineering based.
Euro Beinat [00:05:25]: So engineers were the first group of people that were convinced that this could make their life easier. And in particular developers, software engineers therefore, but also product managers and so on. However, this tool had a lot of limitations. And the main one is this we talked about earlier. This in one session before, every time you interacted with the tool, you had four icons, the first one, four emoticons. The first one is Thumbs up. The second one is thumbs down. Then of course, love.
Euro Beinat [00:05:54]: And there's one specific one that we had to add, and that's Pinocchio, is when you signal that the tool made things up. You're lying to me. You're really lying to me. I can't trust you. October 22nd, when we measured this, it was about 10%, so it means 1 in 10. It's just an answer that you cannot really trust, but you don't know which one. So it's very hard to use the tool at that time. Now time goes by.
Euro Beinat [00:06:16]: You can see how this number of hallucination went from 10% to about 1% where it lingers. Now, we believe that that 1% will either be there stable and is probably something you cannot reduce, but in several cases we can put sufficient guardrails around that that it can still be completely removed for specific use cases. Why did happen? Because models became better, because users became better using the tool, so they don't go where it doesn't work. But because also we introduce agents. So agents is one of the key for making these tools function better. Somewhere around beginning of this year, so January 2024, we introduced agents. And when we talk about agents, this is the stack that we use. The main components are the framework, the orchestration and the set of tools.
Euro Beinat [00:07:08]: So you can access this stack through Slack or through the Web, or through APIs. It goes to an agentic framework which decides how to unpack what you want to do in pieces that you can execute in parallel. It then figures out which tools you can use. It goes to a router that picks the tools. And then of course, there's a loop of reflection that goes on. But in fact, all these components here is what make this agent stack work. I'm sure many others in this conference today will have similar stacks or perhaps something else that we can learn from. There is also an important element here, which is that of insights.
Euro Beinat [00:07:44]: There is another model that lives inside this stack and it is designed to create statistics about usage. It is not designed to look at what goes on, but to create, to group the conversation. Say it's about learning, it's about a certain software language and so on, which is what helps us build the next version version of this one in this mechanics of continuous learning. If you look at how the tool works, this is a simple example. It's a data explorer, so it's an agent which is used by people that want to get access to data in English, so they don't know how to create queries. Therefore they go to Tocan and Tokan is connected to the data lakes and you can figure out an answer to something expressed in English. Of course, what it needs to do is find out which tables you have to use, which columns, create the query, execute the query, find out if everything works, there are no errors, otherwise come back and change the plans. And it can use a variety of tools.
Euro Beinat [00:08:47]: In this case it use query, creator, executor, but eventually also the creation of diagram. All this set of tools is part of the normal, let's say agent application. And we can see this becoming the standard of what our assistants are going to look like. At present it means after the introduction of agent based type of assistance, we see that the engineering tasks, the engineering usage has gone down and the known engineering goes up, which is by the way, this also corresponds with the fact that this has become much more usable across the board. So many more known engineers are using this 4159 which is 4060. It is more or less the technical, non technical composition of our teams. Broadly speaking, it is used by 24 companies at this moment, more more than 15,000 mentioned before, about 20,000 using continuously and is getting very close to about 1 million requests per month. What have we learned from all this? The first thing that they want to underline is that we made the bet early on that the costs of running this tool would be acceptable.
Euro Beinat [00:10:04]: When we designed the first version of this one, it was about, let's say, March 23rd, even before. So the cost of tokens in GPT4 at the time were extremely. GPT3 at the time were extremely high. So the bet was it's going to go down. This is what happened. It went down 98%. That is fairly common with everything that we see in all the other models. However, that's not the whole story.
Euro Beinat [00:10:30]: If you look at the cost of the agent interactions, this is between May this year and August this year. So very short amount of time, just four months. You can see that the cost, which is the yellow line, the cost per token went down about 50%. At the same time the number of questions went up. And I think it is the blue line, but the tokens per question went up even further. So it's the red line here because these tools require many more calls to the LLMs, many more interactions and because the questions of the users, because the tool works become more complicated and complex. So it means that the trend of cost going down and the user goes up sort of compensates. And we can see the green line, which Is the cost per interaction.
Euro Beinat [00:11:14]: This moment it lingers around 25 cents per time you actually touch the assistant, you go through an agent process and so on. I think this is interesting because this green line is what eventually will drive adoption. A total cost of ownership of this. At this level it is good. Hopefully it goes down, but it could also go up. Curious to see what you experience as well. The next thing that I want to focus on is some of the things we learned about the impact. Of course we do this in particular for work because we want to make sure that everybody in the group works better, was faster, works more independent and so on.
Euro Beinat [00:11:52]: There is one thing that we can always measure in this time time saved. And here you can see again the model that lives inside tocan that looks at the tasks which are there, the rows and then you can see these bars. Here is the amount of time somebody, let's say the intensity of requests, how popular the requests are and then small numbers. There is the time save. So this model has been trained to recognize how much time you save every time you ask for something in comparison to the best option that you have. It also includes the fact that sometimes we are wasting your time because you try to get something but you don't get what you want. Overall it is about 50 minutes per day, 48 minutes per day, 10% Question is what do I do with that? That's the first question. Second question is this the main and most important thing? If you ask users, it is no.
Euro Beinat [00:12:42]: The answer is no. What users say is that yes, they can work a little bit faster, but the main thing is that they can work with more independent so they don't have to tap on the shoulder of a colleague to get the answer. They can find the answer themselves. The third one is that they can work outside of their comfort zone. They might not be good at coding in a certain language, but they become good at coding in a certain language because they have this tool available. And the fourth one is because it gets them going so they eliminate the writer block. If you want you can start going these things together like having the entire organization a bit more senior, which probably is far more important than the 50 minutes saved per day. Also because these 50 minutes are spread over a long array of micro productivity bursts, which is non trivial to get them into automations.
Euro Beinat [00:13:31]: However, the other thing that we also learned is that if you really want to get the value of all this, you need to change the way you work. It's not only improved the way you work, now you have to change the Way you work seems obvious, but when you see it in practice with numbers, it is much more impactful. Here this iFood, iFood is the largest food delivery company in our group is based in Brazil and we work with them to give everybody access to data in English. So it means everybody that needs, that has a question about data, be that in customer support or not, operations, market and so on, can ask that to the data lake in English. Obviously you can see that it has an impact in time saving. So it automates about 21% of the request, meaning that 21% of the request you can do through the agent to the assistant. You don't have to go to the data lake, create the queries and so on. So that's the automation that we mean.
Euro Beinat [00:14:27]: You also see the one thing on the right is that in some other cases you have an automation effect, but you also have an additional work effect in this case. The fact that you can solve these data questions with the tool make it possible to sort out to solve a vast amount of backlog questions that otherwise would be solved in a much longer amount of time. So the first users of this tool we thought they were the analysts, the data analysts. This is the group of people that are experts with the data and they know how to query the data. And that in fact was the case. However, the real savings, the real change is when everybody else is actually answering the question directly. So we don't really want the data analyst to be the only users of this. Actually we want the opposite.
Euro Beinat [00:15:14]: We want the entire organization to be good at querying the data in such a way that you don't have to go to the data analyst for doing this. And this is exactly what happened. Most of the questions are solved at the source, which removes about 190, 200 man days every month from the data analyst, which altogether leads to reduction of time to insight of 75%. So time is yes important or not important. But the important thing is that once we realize that we need to act on the organization to make sure that that actually happens. This is what we learned with applying some of the things that we learned. We published blogs about this. If you are interested, please reach out.
Euro Beinat [00:15:53]: We're happy to share. There are many other things that we learn, but they also have to do with the way we develop applications. They use the same technology that powers Tokan. And I want to invite on stage Paul who's going to share some work on magic.
Paul van der Boor [00:16:11]: Thank you, Euro. So indeed you talked about two agents in production that we're sharing a little bit more about today. And the second one is one that we co developed with one of the companies that's part of the group called olx. OLX is a very large online marketplace for classified. So you can basically buy and sell second hand goods but also services. Think of electronics, fashion, but also real estate, cars, jobs and so on. They're present in many countries all over the world and we've been working with them to figure out what does the next generation of the shopping experience powered by Genai look like. So I want to share that with you.
Paul van der Boor [00:16:56]: We called it OLX Magic and before I show that to you, I'm going to show you a demo. This is powered by agents, just like Tocan. It borrows and takes a lot of the technology that we've built in Tocan and puts that inside this experience. But I want you to have a look at some of the features that you'll see highlighted throughout the experience, that many of them will be powered by Genai to better help the buyer to discover and find the products that they are looking for. So currently the App is live, OlexMagic is live, particularly in Poland and you can go to it and a lot of it looks and feels like a traditional e commerce experience. You go in, you put in a query like an espresso machine and immediately you'll get a couple of results. Those results in particular you can see the blue till colored features are then refined. It suggests to you, do you want to look for semiautomatic espresso machines or other filters and criteria that help the user narrow down what they're looking for.
Paul van der Boor [00:17:57]: And then the results get updated according to that choice that the user made suggested by OLEX Magic. You can also go in and see these custom carousels that are generated on demand for every search query personalized to the user. You can also go in and refine your search and say, well actually I'm looking for machines with a milk foamer and you can go in and now a custom filter that you've basically put in by describing gets added to the search query and the results get updated live with that query again powered by Genai and LLMs in the background. And one of the features I love most about this experience is you've got these highlighted sort of contextual filters per item that say okay, this has got a milk form or this one doesn't that are trying to address and highlight the things you care about as a buyer. One of the other Genai powered features is a smart compare. So what you can see Here is these two products that were selected by the user are put side by side with the criteria that are identified on the spot for this particular items with a recommendation for the end user based on what the query has been so far. And so, as you can see here, this experience of how you start to find and interact items on the OLX website is now infused and sprinkled with all sorts of Genai features. And we've seen from users that this really changes the way that they interact and are able to find what they're looking for.
Paul van der Boor [00:19:28]: Not just a query and set of results, but a much more assistive but also advisory experience that they get. Powered by OLX Magic and as of course, the theme of the conference is Agents in Production also wanted to highlight some of the tools that we've built and put under the hood of OLX Magic to make this possible. And as you can see, it's got a lot of the elements that traditional agent frameworks and products have that we've seen others talk about. It's got access to tools. It's powered by a set of very powerful large language models. It has a capability to plan, to figure out, okay, what are the actual steps I need to take to get to the right answer. It has memory, which in particular is important not just for the context of the conversation, but also the context related to this particular user that's trying to find something on the OLX website. Of course, the tools that OLxMagic has access to are very specific to its job, right? It has access to search on the OLX catalog, right, in a visual search way, in a text search way.
Paul van der Boor [00:20:34]: It has access to the World Wide Web, the web search. So it can also pull in the latest information about what folks are saying about different products. And that together basically creates the ability for OLEX Magic to answer and guide the user through the journey. Now, one of the things that I also want to highlight is what we've learned again, as you put agents into productions, these new products that are powered by Genai, There's a lot of things that we are discovering along the way, and the first one is related to actual user experience. So I think probably Precipitated by the ChatGPT moment, everybody out there was trying to build a ChatGPT for X, for OLX or for iFood, or for, you know, any other website or online experience you can imagine, which basically starts with a conversation blank canvas where you can start and write and ask about, for example, a marketplace. What I've just shown you is, you know, a Couple of iterations, product iterations down the road, because we learned that sort of very blank canvas conversational experience actually so far away from what a user expects to do in a commercial setting that we needed to go back to a lot of the user, the patterns that users are familiar with when they go and search for items online. The second is about agents and the frameworks that we build. It isn't just about agents, it's about E commerce agents.
Paul van der Boor [00:22:01]: And what I mean with that is that the systems we put together, the Agentix systems, need to actually cater to the e commerce journey, which means that the tools and the prompts and the evals and any of the other components, the memory, for example, needs to be catered specifically for personalization. The guardrails need to be built for an E commerce journey. So you get this almost like specific type of agent and its system that you're building. It in many ways builds on the same components that we talked about and for example, Token, which is a much more horizontal agent experience, but is refined and constrained to that E commerce role. And the third is that your Genai system is only as good as your search and it's kind of, you know, connected to the second point of E commerce agents. You know, the search journey and the pipelines that we've built historically didn't work for the Ox Magic experience, and we needed to really build them up from the ground. Think about how we created embeddings for listings based on titles, images, descriptions, and then build an entire pipeline of the ability to retrieve what the most relevant items are, using the prompt and LLMs to filter out what items were most relevant from the user conversation. And so it required a lot of work to get us to this point.
Paul van der Boor [00:23:20]: So I think this is the moment where we take a couple of questions. I think Juro and Demetrius are ready to join us on stage in case you have any questions about this particular set of agents in production that we shared with you today.
Demetrios Brinkmann [00:23:37]: There are a lot of questions coming through. And the key is that I think one of the ones that's the most popular in the chat is how long, how many human hours did this take to put together?
Euro Beinat [00:23:50]: Which one, the assistant to can or Magic?
Paul van der Boor [00:23:54]: Well, Magic has been around for a couple of months now, but as I mentioned, it's gone through various iterations. So there's a lot of different things that we discovered and the experience has absolutely changed a lot. The underlying tools, the LLMs, the prompts. So we're on a journey that started about, you know, like I said, Months ago. For Token, it's more years like you described. Euro. So that's something that we started with some of the earliest LLMs out there. So we've been on this journey for a while now.
Euro Beinat [00:24:25]: Yeah, yeah.
Demetrios Brinkmann [00:24:26]: And what about Token?
Paul van der Boor [00:24:28]: Yeah, Token has been the one that's been longest.
Euro Beinat [00:24:30]: Over two years. Yes.
Demetrios Brinkmann [00:24:31]: Do you have a number of people that are working on it? Actual like human hours?
Euro Beinat [00:24:38]: Well, we don't know the human hours, but the team of Tokan was at 15 people, something like that.
Demetrios Brinkmann [00:24:43]: Okay, okay, cool. 15 people working two years.
Paul van der Boor [00:24:46]: Powered by agents as well, so.
Euro Beinat [00:24:48]: Powered by agents.
Demetrios Brinkmann [00:24:49]: Oh, well that gets to a great question. Yeah. On the way that you're looking at how you're saving time. What is that? Can you break that down a little bit more? Because is it just subjective? Is it.
Euro Beinat [00:25:00]: No, no, no, no, no, no. It's a. So we train the model to understand the difference between working with the assistance. So there is a model that lives inside Token and the only thing it does it looks at has been trained first. Right. It's an LLM. Right. It looks at all interaction.
Euro Beinat [00:25:16]: Looks at. This interaction is about, I don't know, debugging code. Right. And for this type of thing, we have seen through the training set that you save a certain amount of time. Right. So it actually estimates the time saved by looking at interaction, looking at, let's say, computer comparison activities.
Demetrios Brinkmann [00:25:35]: So here's another great one. Now the chat's coming through for real. There's a lot of questions.
Euro Beinat [00:25:41]: All right, so we've seen many about costs. Right. Because they touched on. But let's say why do the costs go up while the token costs go down?
Paul van der Boor [00:25:48]: Right.
Euro Beinat [00:25:49]: I think we haven't published anything about the specific how many tokens we use into the end to end journey from the moment you have a question to the moment you get the answer. No, the loops in there, we probably should do that. I don't think we have done. Right.
Paul van der Boor [00:26:01]: We haven't done. But I can see there's builders watching because this is probably the most exciting thing. You can see how the number of tokens per question changed once we introduced agents. The different types of cost per token as we introduce cheaper, smaller models for specific tasks. So there's a lot of very interesting applied lessons that we learned and we maybe should publish a little bit.
Euro Beinat [00:26:20]: Yeah, we just publish it because it's for everybody.
Demetrios Brinkmann [00:26:22]: On this, someone was saying that potentially there's ways to do prompt catching to minimize cost. Are you doing any of that?
Paul van der Boor [00:26:29]: Yeah, we do. I mean we've probably tried everything you can imagine in terms of making sure that we optimize cost from switching models, picking the right model for the right question, caching using reserved instances, our own GPUs. We've tried all the different combinations you can imagine.
Euro Beinat [00:26:47]: This is the best we can get so far.
Demetrios Brinkmann [00:26:48]: All right, well, speaking of money, what is the. This is the last question, too, and there's a ton in the chat, so I'll let you guys answer the rest in the chat. But how are you looking at the ROI of OLX Magic?
Euro Beinat [00:27:02]: Oh, in terms of. Listen, OLX Magic has just started. So what we measure. We haven't published that yet, but what we measure is the metrics that make sense for E commerce, right? Which are transactions, which are engagement, and so on and so on. And we see the upside, but this is something that we are not sharing yet because it's still a project in development and it's rapidly developing, but I can tell it is positive.
Paul van der Boor [00:27:26]: In general. We look at this as make it work, then make it fast, and then make it cheap.
Demetrios Brinkmann [00:27:31]: Make it cheap as in the end. Well, awesome, guys. I really appreciate you doing this.