MLOps Community
Sign in or Join the community to continue

Tool definitions are the new Prompt Engineering

Posted Dec 23, 2025 | Views 7
# Prompt Engineering
# AI Agents
# AI Engineer
# AI agents in production
# AI agent usecase
# system design
Share

Speakers

user's Avatar
Chiara Caratelli
Data Scientist @ Prosus Group

I'm a Data Scientist with a PhD in Computational Chemistry and over nine years of experience using data to solve complex problems in both academia and industry. Currently, my work at Prosus involves developing practical solutions using advanced machine learning methods, with a particular focus on AI agents and multimodal models.

I thrive in fast-paced environments where I can collaborate across different fields to create impactful, real-world applications. I'm passionate about exploring new technologies and finding creative ways to integrate them into meaningful solutions.

Outside my day job, I enjoy experimenting with machine learning projects, automation, and creating content to share my experiences and insights. I'm always eager to learn from others, exchange ideas, and build connections within the data science community.

+ Read More
user's Avatar
Alex Salazar
Co-founder/CEO @ Arcade.dev

Alex Salazar is a seasoned Founder/CEO and auth industry veteran. As VP of Product Management at Okta, Alex led their developer products, which represented 25% of total bookings, and managed their network of over 5,000 auth integrations. Later, as GM, he launched an auth-centric proxy server product that reached $9M in revenue within its first year.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Agents sound smart until millions of users show up. A real talk on tools, UX, and why autonomy is overrated.

+ Read More

TRANSCRIPT

Demetrios Brinkmann [00:00:00]: We've got Chiara to my left, Alex to my right. CEO of Arcade. Chiara is working on iFood. She's at Prosus, lead data scientist.

Chiara Caratelli [00:00:08]: Data scientist?

Alex Salazar [00:00:09]: You're leading our book.

Demetrios Brinkmann [00:00:15]: You've been working on a project at Ifood. You gave a talk last night. I loved it. I wanted to bring you on here to talk about some of your learnings.

Chiara Caratelli [00:00:22]: For those who don't know, iFood is the biggest food delivery company in Brazil. It has a huge order volume, 160 million per month. Brazilians use it a lot. What we realize is that users often go to the app and they don't know what to order. They are a bit undecided and sometimes they get frustrated because they have too many options. It's a paradox of choice, right? So the problem we solve with this agent is to help them decide. We built something that knows who the users are, knows their preferences, their habits and the price range they're willing to pay. And we use that plus the user's questions to suggest the best options for them, even proactively.

Chiara Caratelli [00:01:06]: And we have a flow where the user can search and refine and then order at the end. We prepare two interfaces for the agent. One is in the app and one is on WhatsApp. WhatsApp is very popular in Brazil. I think there are 160 million active users. So they use it a lot. If you don't know, people use it to order food with just voice messages. It's really smooth.

Chiara Caratelli [00:01:35]: They just send a voice note to the restaurant and they can order like that. So we leverage this familiarity with the conversational interface. Both had different challenges in terms of ux, how we show results to the user. It often happens with agents, right? The biggest challenges are not AI related, but UX and adoption related. So the agent that we built is kind of a react agent. We didn't use too fancy multi agent set up because we needed things to go very fast. Users are hungry, they don't want to wait. So we need to have the simplest flow.

Chiara Caratelli [00:02:16]: We need to make sure that the recommendations that we give to them are good, they work for them. And we also need to remember if they don't like certain foods, the agent really needs to be a smart companion, almost reading their mind. One of the things we wanted to make sure is that the conversational interface was not just textual and there were multiple modalities for the user to talk to the agent because when they're hungry, they don't want to spend time typing. So we made sure that tools that we Built were connected to UX and UI elements so that the user could directly interact with them to have some short circuits, let's say it's like swiping. Oh, so yeah, we also implemented a swiping interface.

Demetrios Brinkmann [00:03:03]: Voice also.

Chiara Caratelli [00:03:04]: Also voice, yeah, yeah. So voice is a modality, but it's still textual in a way. What we did was also making sure that there were buttons to click to do actions quickly without having to type. Like, I want the third item that you show me and that's not needed. One interesting thing we noticed is that people on WhatsApp were way more len towards this conversational behavior than in app in up really, people really want to use buttons. They expect a different type of interface. So our tools needed to work very well. One of the challenges we encountered was that the tool definitions that we had made a lot of sense for us, but it didn't make sense to someone external that would them for the first time.

Chiara Caratelli [00:03:53]: We realized this afterwards, of course, with trial and error. What we noticed was that we created the tools. Right. Things made sense. But as soon as we were getting edge cases in production, we were adding this to the tool. So whenever the user wants to order something and you don't have enough information about it, make sure to call the get information tool. A lot of edge cases for this flows and it creates quickly becomes code, like a false statement that you don't want. Right.

Chiara Caratelli [00:04:28]: So the exercises we made was to try to standardize this tool. Like, think if I would share this tool with another team and if they would want to use it in their own agent, how would I write things there? The idea is to make them as clear as possible. Would the name of the tool even make sense? Right. And after that we had a massive improvement in latency because we could cut a lot of tokens and our system became much more stable.

Alex Salazar [00:04:57]: Making sure that a tool encapsulates the right things is the hardest part of the problem. And I think people really struggle with that. They think of it as an API. Oh, we just have an API, we'll call it. Well, that's not really going to work. There's a lot of usability. It's like, well, what's the user experience going to be? What's the intention that the user has? What's the intention that the agent needs to have? And then figuring out, okay, well, what's the right encapsulation of that in a tool is what everybody's struggling with. It's a new paradigm.

Alex Salazar [00:05:30]: It's a mental model. What we see as a Best practice as people try and build that is that it's, you know, there's like a laddering of tools. And so you might have a shared tool that is a workflow or maybe even an MCP server that's integrating to another service. But even if you have those things, the agents tend to do better if you then build a very domain specific or agent specific tool to capture that particular agent's nuances. And so when you get that, you get both the accuracy and the lower latency that you're looking for because you're pushing more off to deterministic software. And then you're able to abstract out the common elements so that other teams can reuse them really easily.

Demetrios Brinkmann [00:06:22]: I love how you talk about the difference between just a basic API and then an agent needing to consume some kind of a service and that one key word, which is the intention and how that intention plays such a big part in what is trying to consume.

Alex Salazar [00:06:42]: Yeah, I think there's a lot of confusion right now because every engineer is familiar with APIs. And then you throw an MCP and people go, oh, it's the same thing. Let me just wrap an API in MCP and success. We have an agent and we have tools and it doesn't work for a bunch of reasons. MCP is just a wire protocol. But more importantly, tools are not APIs. And so in engineering speak, an API is a service contract for the downstream service. So we'll use Google Drive as an example.

Alex Salazar [00:07:16]: It has an API. That API is a service contract on how Google Drive works. And yes, there's a lot of work that goes into building a really nice Google Drive MCP server. Like we have to make it a bit more workflowish, make it more intention based to kind of remove or at least abstract away the structured inputs that are required. So an LLM has no idea what a Unix timestamp is. So your MCP tool has to know what yesterday is as a concept and then translate that to Unix timestamp. But even with all of that, if you're building a sales agent, for example, and a rep is going to ask, hey, I need the brochure for this product because I'm going into this customer account, the agent doesn't care at all about Google Drive. That's not its intention.

Alex Salazar [00:08:09]: Its intention is to find the brochure that it needs. And so you give it even a pristine, beautiful Google Drive MCP server, you're asking for higher latency and you're asking for hallucinations, because now you have to stuff the context window with explanations of how to find the brochure using the MCP server, which means it's going to have to take multiple turns figuring out what the right folder is and where the right files are and how to determine which ones are brochures and which ones aren't brochures and which is the right brochure. But if you instead give it a Get brochure tool, that get brochure tool can call the Google Drive NTP server. Then all of a sudden, all the agent has to do is say, oh, I need a brochure. There's a git brochure tool. I have the context that I need. Let me submit that as arguments, and then it's done. It's one call.

Alex Salazar [00:08:59]: It's a low latency. You don't just stop the context window. You minimize the amount of tool definitions that you passed over. Then likely, much of the code inside of the get brochure tool is deterministic. You know, chances are it's not calling a model, or if it is, it's calling one maybe once. Think very specific. So the whole system just gets more accurate and faster. And so what that means is a tool is actually kind of the inverse of an API, while an API is a service contract of what a.

Alex Salazar [00:09:33]: Of what a downstream service looks like, like Google Drive, in my opinion, a tool kind of like the service contract for the agent's intentions. It's what it expects to do.

Demetrios Brinkmann [00:09:45]: Okay. And being able to write the proper tool definition is really the key.

Alex Salazar [00:09:52]: It's like half the battle. Yeah, it's like half the battle.

Chiara Caratelli [00:09:54]: That's why evaluations are so important as well.

Demetrios Brinkmann [00:09:58]: Well, I think also I just want to highlight such an important thing that you said yesterday and just now, again, which is let somebody else look at your tool definitions. Let them see if they can understand it. And if they can, then you can try it with the agent. But if they can't, you already know. All right, this is probably where the problem is.

Chiara Caratelli [00:10:20]: Yeah, exactly. And this exercise really forces you to have clear tool definitions. And I think it's also important to try to limit the agent choices as much as possible. I think if there is a lesson I've learned in this year's Building Agents, that's this one. It decreases hallucinations. It decreases context bloating as well. If you have tools that are always called together, like in the get brochure case, it makes so much more sense to create a workflow and encapsulate that into a tool. A tool is something that LLM can use to take action.

Chiara Caratelli [00:11:03]: It doesn't need to know what's inside. It can even be another agent. Right. So yeah, one way is also to use multi agent setup, of course, but from the point of view of the main agent, it doesn't really matter.

Alex Salazar [00:11:17]: Yeah, we think a lot about this. So yeah, in a perfect world you give it one tool. In a perfect world, the agent doesn't have to think at all and you don't even need an LLM. Right, because it's faster, it's cheaper, it's deterministic, but that's not the real world. The power of an agent is that it can handle generality and so there's this careful balance. Sam, my co founder talks about turning the knob on determinism and so you can give it fewer tools and arguably today you should because you want to minimize error rates. But the point of all of this, where all this is going, where all the investments are going, where the model companies are investing heavily, where the orchestration systems like Landgraaf are investing heavily, where we're investing heavily as a world, which is the opposite, where you can turn up the non determinism, you can turn up the ability to give it as many tools as you want and then let the agent intelligently decide what to do. That's very hard.

Alex Salazar [00:12:33]: That's where we're going to get to at the limit. We ourselves have achieved incredible things in the lab that we haven't yet announced. We're making it possible to kind of turn that dial up on the tool level. But for most people today, you're right. You're better off really thinking through in a multi agent system which are the right tools based on the node that I'm in or the state that I'm in and trying to be very specific. But the benefits of giving it more tools are huge because then you have fewer nodes, the agent's more intelligent, but it really depends on what you're trying to do.

Chiara Caratelli [00:13:15]: Yeah, also I think we haven't touched on this topic, but the way output, the way you construct tool output is also important because you can put a lot of instructions there as well. So by limiting the amount of choices, I don't mean the agent shouldn't have tools. Like we should try to make it as deterministic as possible. We can still leave freedom, but I think we need to be smart in where we put the information, where we put the instructions. So if for instance, I always have certain options for tools after a given tool is called, I can put the instructions in the tool response. I don't need to blow the system prompt with those instructions.

Demetrios Brinkmann [00:13:59]: I've heard that a ton. That dynamically inserting context, that's one way of doing it. There's all kinds of fancy ways that you can make sure to leave things out that need to be left out. And then when it needs it, if it needs, it's like this is a need to know basis here.

Chiara Caratelli [00:14:15]: Yeah.

Alex Salazar [00:14:16]: Well, I think I'll go back to this concept of layering up tools. If you look at APIs, this pattern already exists. You've got your low level system APIs, you've got your workflow APIs in the back end and then you have the APIs that the mobile application or the JavaScript talks to. There are three different sets of APIs. When you get all the way to the top where it's a JavaScript app talking to the back ends, those APIs are very, very specific to that application most of the time. Similarly, in what you're describing, and we talked a little more about this last night, so I'm going to steal from last night. You can insert UI code in the response, you can present a table. It doesn't have to be just text.

Alex Salazar [00:15:08]: You can present a react component in a tool that gets very agent specific. But that's kind of the point. You minimize the amount of work being done by the rest of the system by the tool carrying and doing a lot of the heavy lifting. And you know, I'll go back to the Google Drive example, right? Sure, it can pull the brochure, but it can also go pull a bunch of context that you know the model's going to need on the next turn.

Demetrios Brinkmann [00:15:38]: Yeah. It reminds me of a conversation I had with Zach from Sierra and he was saying that a lot of times, since they're doing voice agents and it's real time voice agents, what happens is they'll have almost like a supervisor agent that will recognize and preemptively assume if this conversation is going in a direction that I think I may need this context for, they just go and grab it just in case it comes up and all right, we have it now, I can give it to you. And then you don't have that user experience where the person is waiting on the other line because the context needs to go and you need to grab it and bring it back. And that takes a little bit longer. It's just like, let's have everything kind of loaded up and then if we need it, we can serve it to the agent that's interacting with the human.

Alex Salazar [00:16:30]: Well, I mean, I think that speaks to, you know, how quickly the history has changed. It's November of 2025. Right. And the conversation is no longer about accuracy and consistency being the blocker to production. Five months ago, maybe even less, that was the only conversation we were having. Now we're talking about latency. Yeah, that is the biggest problem. We're going to prod now, but now we're just having mediocre experience because we're all waiting 30 seconds.

Alex Salazar [00:17:03]: And it's amusing to me because if we were waiting 30 seconds five months ago, we would have been totally cool with it. But now we're like 30 seconds.

Chiara Caratelli [00:17:12]: This also connects to what I said earlier about WhatsApp versus app experience. So on WhatsApp, people are totally fine with waiting because they expect it. It's a familiar interface. They don't expect it to go so fast, probably because normally you have another person speaking on the other side, but on app, you're punished if you don't deliver in time. The expectations are completely different, even for the same user.

Alex Salazar [00:17:38]: In a perfect world where you had user feedback from everybody who's using, interacting with your agent in real time. I'd be very curious to see a generational distribution on, on patients for waiting for the agent. Because one of the things that, one of the things that I see, but I'm very curious is the, the generation of people who are, let's say under 30, expect everything to be agentic. If they see a menu, you know they've got to click around, they just, they bail. The later generations are the opposite. They expect a snappy ui, very snappy and responsive buttons and clicks. But I'm very curious how the generational distribution might be on. On patients for latency.

Demetrios Brinkmann [00:18:32]: And I also feel like when you're in WhatsApp, you can go and you can talk to your friend, you can look at something else and then come back. And so you get that. Because I imagine that you get a notification when the agent is done and it's sending you the information. You don't have to sit there and wait the 30 seconds. Looking at the WhatsApp chat, when I'm in WhatsApp, I'm talking with three or four people at the same time and I'm going back and forth between those conversations.

Chiara Caratelli [00:18:58]: Exactly. The data that we got was really clear about this. That's why we spent so much time refining UX thinking how to present the data to the user. There's a lot of work done around the agent that is not agent itself and I think this was one of the biggest challenges. Also users, they are familiar with AI by now, but it's still hard to. To trust an AI interface to suggest your food. People are pretty sensitive around that. They have their preferences.

Chiara Caratelli [00:19:30]: So we really needed to build customer adoption, make it also in terms of the Persona that we developed, it needed to be friendly but also engaging. We wanted people to come back. Right. It all connects to tools at the end, the tools that. How do you define it, how you use it, how smart you wanted to be?

Alex Salazar [00:19:52]: Can I call something you said that I think is perfectly on point today? The. The intersection between an application and an agent is now complete. And you mentioned earlier, right, like so many of the things that you ran into as you were building this weren't necessarily like the model or the AI, it was the application. And I feel again, speaking at least today, we'll see what the world looks like in three months. That's itself a huge transition now we're having. The debate is now, well, how do I make this UI work properly and how do I deliver this to the user properly as opposed to getting all caught up in this super deep ML data science of the AI.

Chiara Caratelli [00:20:44]: And this also connects to evaluations. How do you evaluate such a system? Because when we talk about evaluations, we have these standard metrics in mind like faithfulness, helpfulness, you name it. But they're not necessarily connected to the business value that your app brings. How can you leave UX out of the evaluation? You need to take this UX elements into account. That's part of the user journey and that's also part of the agent. It's really, really hard to separate these two.

Alex Salazar [00:21:15]: We were joking last night that every developer believes so strongly in test driven development that they recommend every other developer do test driven development. They're just too lazy to do it themselves. Right. But I feel with agents, especially around evals, you have to start there. And so I'm curious on this point, which I think is a really incredible point, around assessing the connection between business value and the user ui. How did you structure your evals? What were the evals you ultimately landed on?

Chiara Caratelli [00:21:50]: We have different levels of evals, right? You have those more connected to development, like making sure everything works like regression test, things you can evaluate with code that's like the foundation. We also have a golden data set that we run. What we realized after, I think the Evolve community is shifting towards this as well, is that error analysis is really, really important. We started many teams to think of metrics like is the agent helpful. Did the agent satisfy the user request? Which are okay, but they're very generic. You need to write them in a way that is very specific for your product. And when you don't have a product yet, this is really, really hard to define. So we were lucky enough to have a community of iFood employees.

Chiara Caratelli [00:22:49]: There are more than 7,000 people employed at iFood. So they used our app and they gave us feedback. There's nothing that can substitute looking at the data. We had to go and look at the traces, identify the errors that we were seeing. And once you have a good overview of what can go wrong in your application, that's when you can build a taxonomy of errors. So with this error taxonomy, you can build a test set and it can inform you of things like how good you're doing with your evaluation and if the agent is performing better or not. This is connected to LLMs judge. When you write an LLMs judge, having this very specific domain knowledge helps a lot.

Chiara Caratelli [00:23:40]: So we couldn't separate it from error analysis. It's something we learned later. If I would start a new project, I would start from that straight away.

Demetrios Brinkmann [00:23:48]: But you're also checking for evals on the tools and which tools were called. Was it the right tool? So there's higher level objectives, did it satisfy the user? But then there's those very nuanced things, pieces, right?

Chiara Caratelli [00:24:02]: Yeah, we have of course, evils for the tools. Some of our tools are smart tools. You could call them agents themselves. Like searching in our iFood catalog, you need to take into account user preferences, sort the items, pick the ones that have most variety and like match with the user intent and profile. So that's, that's just a tool, but deserves a whole set of evaluation itself.

Demetrios Brinkmann [00:24:25]: Is that, and just, is that how you were able to see which tools were used together and then create workflows from the tools?

Chiara Caratelli [00:24:33]: We analyzed the data like we did a lot of ad hoc analysis. A team member of us is completely dedicated to this, like assessing the quality of the agent and like understanding what goes on behind the scenes and also what goes on in front of the user. Another thing that we did, I didn't mention yet, but I presented it yesterday, was to create an evil set that was going to be picked up by an agent that would impersonate a user. So that was something that was defined by product team. So what we noticed was that was very easy to define scenarios and what the agent should do in those scenarios. What was hard was to create a LLM judge that could judge any Scenario. But given a scenario you could say the agent should do this and that. So we started with a set of queries defined this way.

Chiara Caratelli [00:25:31]: And sometimes you needed some pre processing steps to get there. Like for instance, when you have an item in the cart, like when you start to make sure there is already an item, things like that. And then we built an interface between the agent that would test our endpoint and the endpoint. And this interface made sure that when we were returning UI elements, these were also shown to the user impersonating agent. So the agent could choose to click on certain UI elements and at the end we would judge the outcome of all of this. So I think this allowed us to also test UX in a way it doesn't substitute a B. Testing, of course, that's like online testing is another story. But it helped us to identify regressions.

Alex Salazar [00:26:24]: On the tool side. How much work went into evaluating the tools themselves to make sure that you design them, or at least their definitions properly and the parameters, the parameter definitions properly to see if the model was selecting them correctly at the right time.

Chiara Caratelli [00:26:40]: Yeah. So we did evaluate if the agent was calling the right tools, of course. Depends on the tools. I think some tools are so clear to use, like create cart. Some definitions are so clear that you don't need to spend too much time on that. You just want to know that the agent can pick it up correctly. But some others are whole workflows, like searching. What I was talking about before that had a lot of evaluation being done and there is still a lot going on there because it's building a recommendation system basically.

Chiara Caratelli [00:27:17]: Right.

Demetrios Brinkmann [00:27:18]: With tools.

Chiara Caratelli [00:27:20]: Yeah. So what we have is a user who is searching for something or maybe just exploring options. They just say, I'm hungry, surprise me, give me promotions without any intent. What you're building is a homepage in the agent. So you want to show options that are really, really good for them, even when they don't specify anything that's a recommendation problem.

Demetrios Brinkmann [00:27:49]: Is there ways that you're plugging in? Because I imagine when it comes to the workflow, you're plugging in structured data and unstructured data. In this, you're giving the context to the agent of the last three, five meals that this person ordered when they ordered it, the timing, what they like, in general, all of these features, quote unquote, that you would normally put in a recommendation model, but now you're serving it up to the agent in different moments of that workflow.

Chiara Caratelli [00:28:22]: Yeah, yeah, exactly. We have a team at Process AI that is completely dedicated to building a model we call LCM Large commerce model. So it's a model that was fine tuned based on user behavior in our apps in the ecosystem. So whether user searches for something like something orders. So using this model we built representations of who the users are. So it goes way beyond the last orders. We know what type of customer they are. We have some segmentation as well in there that occurred naturally while using this model.

Chiara Caratelli [00:29:04]: We know their patterns. So that's like the core of who the user is. And we have similar things for restaurants items. So it's much more nuanced than just a list of order history. We use that in the main agent to select the best way to communicate to them, to select the tools in the best way. But within the tools also we are connecting to our food recommendation system. When we plug dishes, we also have models that were trained to show the best recommendation. It's in several places that we're doing this.

Demetrios Brinkmann [00:29:45]: It's almost like there's a very complex. Yeah, there's like a Rexis tool in a way or like I'm going to use the recommendation tool and it calls a machine learning model.

Chiara Caratelli [00:29:55]: Yeah, yeah. There are several models for different use cases for that. Plus the agent has the brain and we have LLM friendly representation of the user. Like for instance, if you order pizza in Brazil, you can put a lot of toppings and customization. If you always put pepperoni on your pizza. I know you're a meat lover, you don't see it from the order history, but I can extract this nuanced information. So the next time you say I want something healthy, I'm gonna propose you something with meat because I know you like it. Right.

Chiara Caratelli [00:30:30]: So yeah, there is a lot of emergent patterns that you just can't get.

Demetrios Brinkmann [00:30:36]: With traditional machine learning models because they're so specific and so here you can infer it because of the LLM being that intelligence.

Chiara Caratelli [00:30:45]: Yeah, exactly. That's the beauty of exploring this field. And we learn this as we go. There is so much, so many learnings on this and we also get feedback from the user which is amazing to see.

Alex Salazar [00:31:01]: I feel like I'm now excited for this app. When are you guys going to be in San Francisco? Where are we going with.

Demetrios Brinkmann [00:31:09]: Yeah, that's it. We gotta go to Brazil. That's the easier option than them coming to San Fran. Do you have some kind of a checker agent that makes sure what's happening is actually what should be happening, like Overseer. I don't know what that would. How you architect it but we have.

Chiara Caratelli [00:31:27]: A system in place for guardrails and making sure, you know, the response that we were given is. Makes sense let's say. But we didn't choose full blown multi agent setup because we wanted to keep it simple. Our use case is relatively simple. I think what's difficult is given the right recommendations, having the right context, but the context of the agent is pretty self contained. We don't have an agent to schedule a trip, for instance, that would be another agent. But the set of tools that our agent has are compatible with each other. We don't need another agent for that.

Chiara Caratelli [00:32:13]: What we did however was to have a dynamic system prompt that would change based on the state so that we would not need to have so much information every time. And yeah, that of course helped with latency and having less choices to make. To go back to the stratification we were talking about earlier, the way that.

Demetrios Brinkmann [00:32:34]: Someone interacts with iFood on WhatsApp is still almost like through the iFood app, but that's just to verify their profile and then they go back to WhatsApp. Did I understand it correctly?

Chiara Caratelli [00:32:46]: This is just to kick off the first authorization flow. So we need to make sure that you are. You connect to your profile. So yeah, if they don't have an account, if they write to us from a phone number, they don't have an iFood account. We need to make sure that they create one.

Demetrios Brinkmann [00:33:05]: And then you have all of the information about that user in iFood.

Chiara Caratelli [00:33:08]: Yeah, and we connect, including the payment.

Demetrios Brinkmann [00:33:11]: Options and all of that stuff. And so the agent I'm assuming just says do you want to use your regular credit card or do you want to use one of these options?

Chiara Caratelli [00:33:21]: There are some systems in Brazil for payments that we are connecting with.

Demetrios Brinkmann [00:33:26]: Yeah, but I remember Nishi talking to me about how hard it was to context engineer the types of payment systems that people would ask for. And it goes back to what you were talking about at the beginning of you end up just adding all these edge cases to the prompt and before you know it your prompt is so bloated. And if last night six Hour Stream taught me anything, it is like the least amount of context as necessary.

Chiara Caratelli [00:33:58]: If you think about what a user can ask in an app to order food, they could ask for specific payment method, price range, delivery time, like distance from the restaurant or they're vegetarian, they want gluten free discounts, they have a specific membership that they want to like that allows them specific discounts. They want to use that. Like there is so much it's really impossible to give all this information to the agent because then you need to like the agent needs to convert that into a query. Right. If you give all this information to the main agent, it's going to blow up the prompt. So what we did was like converting the query into something more LLM friendly with some basic knowledge from the main agent and then giving context of what the user wants. And in a way this is like preparing a task for another agent. So we delegate and the main agent doesn't need to worry about anything.

Demetrios Brinkmann [00:35:01]: And that first touch is just through like a small language model or sentiment analysis or it still is an LLM call.

Chiara Caratelli [00:35:10]: No, we use an LLM straight away for that. Yeah, it's simpler but we implemented some classification later in the way. So yeah, we have some classifier steps to simplify.

Alex Salazar [00:35:24]: Nice. How much are you using the foundational models like OpenAI or Claude versus your own?

Chiara Caratelli [00:35:34]: I would say we use foundational almost everywhere. We use smaller models, fine tune models for specific steps in the workflow, like creating representations for the users, for instance. It's a task where there is a fine tuned model. But for the conversational part we haven't explored yet. I think foundational models are so good right now that if you don't need to fine tune, that's probably you shouldn't do it. I think I hear a lot about fine tuning. Like maybe we should fine tune that to fix the output. No, it's a specification problem.

Chiara Caratelli [00:36:15]: You're not specifying enough. Or maybe you're specifying too much in the prompt. If something can be fixed there, that's the first step.

Alex Salazar [00:36:24]: Well, I think that's a great point I'd love to dig into because you know so much more about this than most people. Where do you draw the line on when and where to fine tune? You guys have an incredible amount of data and I know that much of it is leveraged. Where do you decide to do that work?

Chiara Caratelli [00:36:48]: So I think one of the biggest reasons why you might want to fine tune is cost and latency as well. If you're building a model for a very specific task and the model doesn't need to be able to converse with the user, they just need to take user data and build a representation of the user. For instance, that's a very specific task. And then you want to scale it to 60 million users and you want to do it every day, foundational model is not going to work very well. It's going to be extremely expensive. Expensive, yeah. So. And also performance wise.

Chiara Caratelli [00:37:27]: Right. Because at the end we're talking about embeddings and you want, if you, if you're working on a very specific space like food delivery, you want to, you want the model to clearly differentiate terms that might look the same but are actually semantically different. And that nuance you don't get in foundation model. So at the end, when you're fine tuning, you're changing the space where your model moves and you're making this the meaning of words that might look similar more far apart from each other. So that thing you want to do, and we proved that for our application, it's better actually we a b tested this and yeah, it gets better results. If you want to just kickstart a project, then yeah, first make it work, then make it cheap, then make it fast, et cetera.

Alex Salazar [00:38:27]: It's interesting we see this as well with a lot of customers is again today versus last year, people used to start with all of their own data and with fine tuning. And that was the beginning of the journey. And now today it's the end of the journey. Instead you throw a foundational model at it. Most of the time it's good enough. You do some prompt engineering, you attach some tools and you're good to go. But when you want that extra last inch, now that's where everybody's using fine tuning. I'm curious, within the work that you've been doing, where's the dividing line of fine tuning an existing large language model, Whether it's a foundational model, open source one, versus building your own model that might not be LLM based?

Chiara Caratelli [00:39:17]: Yeah, it's very fine line Sometimes I think regular, more traditional machine learning approaches are good to get patterns from the data, like when the data is numeric or when you have graph type of data, like collaborative filtering, for instance. So we're also working on that. I think traditional ML can find patterns that conversational type of LLM would not necessarily find because it can leverage also connections between user data points. You're embedding different type of information. So for instance, to give an example with collaborative filtering, the model might know that we are similar users and I order some food that you haven't ordered yet and it will suggest you this food. I don't necessarily have this connection. In an LLM that is conversational, There is a very fine line. What we are trying to do is use LLM for finding patterns based on the user behavior, things that are not just hard data.

Chiara Caratelli [00:40:31]: Like the example I mentioned before, where you order pizza with pepperoni, you might be a meat Lover, I know you're not.

Demetrios Brinkmann [00:40:38]: I'm the opposite. But I understand.

Chiara Caratelli [00:40:41]: For sake of argument, just to give you an idea and then I can extract the filter that I will use in every query to filter the data that I show you. Yeah. With traditional machine learning I wouldn't get these patterns out. Yeah. It's a hard question, I think because at the end it boils down on what type of data you want to show to the user. I think for data to fit 2llms sometimes it makes more sense to get text based representation. But when you need to do operations like search in a database. Yeah.

Chiara Caratelli [00:41:22]: That you need vectors, you need to optimize based on vector representations. Yeah. It's a very good question. Yeah.

Demetrios Brinkmann [00:41:33]: It's cool to hear that you're still grappling with it and there are these pros and cons of each and obviously it's always a trade off. I like the idea of how you can leverage one and at the same time kind of plug in the machine learning models. So it's like the majority of the stuff is happening with the large language models and the agents that are going and they're doing stuff and maybe one node or one tool that it can call is a machine learning model.

Alex Salazar [00:42:00]: I think one of the things that I find most interesting about process is how far out ahead and how advanced you folks are in agent building. You got a target for what, 30,000 agents in the organization.

Chiara Caratelli [00:42:14]: Yeah.

Alex Salazar [00:42:15]: Which is funny because you know, we can laugh about how crazy that number is, but then how many do you actually have right now? It's more than anyone else. And so not that far off. Yeah. Like you, you have at least over a thousand at this point.

Chiara Caratelli [00:42:29]: Yeah, over 10,000.

Alex Salazar [00:42:30]: Yeah. And so, you know, we can joke about 30,000, but most people are struggling to get one out. Right. And, and we've talked and we've talked a fair bit and I'm sure we'll talk more in different sessions about how what it took to get there both from a technical perspective and from an organizational perspective. But you had made a really interesting point about tools and, and how, you know, different teams start in one way but then eventually you guys are thinking, you start layering in governance. I'm wondering if you could share more about that.

Chiara Caratelli [00:43:03]: Yeah, this is, I think transition that happens in other organization as well where we are approaching Genai and we have very vertical type of vision. We're trying to build a product for specific use case. Right. So what you do if you're a team building this is you go very all in and very deep in this vertical direction. But if you do this and you multiply it for 10, 20 teams in an organization, you realize that often similar tools are being built. For instance, answer the user questions using knowledge base. What I think tools are really good for is also creating layers of governance, where for instance, in the case of the knowledge base, there is a team that makes sure that that works very well and then can be shared across different teams. The trick here is that the teams defining these rules needs to be very LLM savvy as well, because the interface that this tool would use needs to be used by an agent.

Chiara Caratelli [00:44:14]: What we see sometimes is that these definitions get too verbose. There is a lot of requirements for how the tool should be used and like the language that should be used, sometimes you don't need all of that and you really need engineers to work together to define the interfaces of this. And I think if you nail that, then you can really scale up because you can then share tools across the organization. And this also applies to agents, by the way, in multi agent setups, it's.

Demetrios Brinkmann [00:44:45]: Having that ownership of the tools. Being able to clearly define this is your tool. You're expected to keep it up to date to make sure that it's working and that the agent can consume it in a way.

Chiara Caratelli [00:45:00]: Yeah, exactly. And when you create tools that are directly connected to the public image of the company, like you define a Persona that represents the company, then the type of people who write these rules are not developers. So the person defining these rules maybe is a designer or product team, but you cannot give them to the agent as they are. You need some translation layer in here. And I think governance is really important and it's important to get it right because once you do it once, then it's more maintainable. And if I would let ILO be used by other agents, for instance, like expose it through agent to agent framework or let other teams use it, then I also need to make sure that that is used correctly.

Alex Salazar [00:45:58]: Yeah, yeah. I think what we've been seeing, which echoes your experience, is it's very easy for an agent team to get their agent working in a silo and they'll build the tools they need and it'll work. But there's almost always an organizational context that agent team is focused on its agent, but its manager or the director or the VP or the CTO or CIO is looking horizontally. And most organizations are typically building more than one agent. And the same problems that everybody saw in the last cycle with APIs is the exact same problem now with tools. MCP or not, take the wire protocol out of it, it's okay, great. These two people are inserting user records into the CRM. Why do they have two different tools that are being maintained separately with different logic? That should all be the same thing.

Alex Salazar [00:47:11]: Let's elevate that and make that a shared tool. But then when you do that, you suddenly introduce a governance problem that's never been resolved before. With tools. How do we do versioning? How do we do ownership? Who gets access to which tools? Team A, which is working on customer face. We talked about this last night. Agent A is a customer facing agent and Agent B is an internally facing agent. And those teams probably shouldn't be seeing the same tools because the policy at an organization, I think, which is the policy here, is that internally facing agents shouldn't have any access to anything outside of the organization. And so like sharing is the beginning of it, but the moment you start sharing tools, which is a best practice, it looks at first like a productivity gain, but immediately you inherit a bunch of governance challenges and governance gains.

Alex Salazar [00:48:07]: So for example, who has access to the tool? This agent. What's the policy for how access is being doled out not just to the individual developers on different teams, but to the nature of the agent itself. Internally facing agent versus externally facing agent. And how do you handle versioning? And if you're the one that wrote the user insert tool for the CRM and my agent depends on it, who owns the tool? Who's in charge of bug fixing it? Is it me? Is it you?

Chiara Caratelli [00:48:42]: When this impacts multiple agents, changing a tool definition is actually changing the prompt that the agent has access to. So it's crucial to evaluate because if you change that, you're going to impact all these downstream tasks. And it's really crucial to have good evaluations in place, not just for the tool, but for how the agent will use this tool and will interact with that.

Alex Salazar [00:49:09]: To link it back. I think this is where laddering up tools really matters. Because my agent consuming the shared user insertion into the CRM tool is going to likely have different set of evals than your tool because we just have different contexts, we have different intentions. And I'm going to bloat your tool irreparably for all the other agents. If I try and insert all of my needs and demands onto it. But if I then if I'm just. But instead I just layer up my own domain specific tool or agent specific tool that sits on top of yours, then suddenly sharing and repeatability still works without having to sacrifice accuracy and consistency and latency for the agent. So now we can share the same tool and I can have my own little domain specific issues with my own set of evals that are maintained in my agent separate from yours.

Demetrios Brinkmann [00:50:10]: I see what you're saying where it's like you normally have to go really deep on something and really craft it so that you understand the problem set, you understand how to build that agent in the best way possible. But at the same time, I feel like if you're getting that detailed with it, you're now creating a whole lot more work for every team.

Alex Salazar [00:50:33]: Yeah, you are, you are absolutely creating more work. But this is, this is the trade off, right? It's all about knobs and dials. It's. If you're a single agent team, you don't need this. Yeah, right. But the, and, and individual agent developers, you know, aren't going to be thinking about this. It's the moment that you're running an agent program, you know, call it a center of excellence or whatever. I mean, the moment you're building multiple agents across an organization.

Alex Salazar [00:51:02]: If you're a Fortune 2000 and you're even here, you have a mandate for 30,000 agents, this becomes a requirement at scale. If everybody's rebuilding the wheel every single time, the productivity gains of flipping the script become pretty high. Because when I go start my new agent, I don't have to go build all this stuff from scratch. I can go see in the registry, oh, what are all the different tools that already exist? Oh, let me pick that one, let me pick this one, let me pick this one. And then I can build my own tools on top to personalize them and everybody wins. But that's just to develop a productivity piece. If you think about senior leadership and the governance problems, it becomes possible to put your arms around it if you don't have any kind of reuse or reusability or registry or things like that. You know, when the CISO or the compliance team or senior leadership worried about performance, wants to go see what's happening, they have to go look at what, 300,000 tools, that's not going to work.

Alex Salazar [00:52:17]: They're all marginally different. That's a fail, right?

Chiara Caratelli [00:52:19]: Yeah. At the same time, there is no substitute for really digging deep into a problem because starting the other way around, I think would be a disaster. Like if you start from defining what tools can be built for the teams before even you're solving the business problem, that is a recipe for disaster. Like you're going to over engineer things and like this rule is going to have to be changed so many times. So yeah, what we found at works was really to build something, you know. You know like building an agent in production is way different from thinking about an agent. You find challenges you didn't even think about. So before you have done that, I think you shouldn't even start thinking about how this can be shared across teams.

Alex Salazar [00:53:10]: I agree with a few nuances. I think that's generally the best practice. Again, we talk to a lot of customers and they're all coming from different angles. There's a particular agent team that needs to go go unlock a particular piece of functionality. They want to talk to email or calendar or some custom service. And then we'll also talk to CIOs and their VPs who are trying to architect the organization and the enterprise for agents. And we have a saying internally which is regardless of where we come in as a vendor and to help an organization, we always drive the conversation to the first agent for that same reason. It's like, well before anybody gets caught up, you know, in, in building like the Sistine Chapel of, of complex agent governance and systems, let, let's first make sure that you've got your agent, your first agent at least successful and working and you got the right patterns and you're absolutely right, you should go deep first and then once it's working, then abstract out and then start optimizing for sharing and reusability and all that fun stuff.

Alex Salazar [00:54:22]: But you can't start there however with.

Chiara Caratelli [00:54:24]: Some third party tools that are pretty widely accepted by now like managing calendar emails. I think it's a good bet to.

Alex Salazar [00:54:36]: Yeah, those are easier which is how you guys use us.

Chiara Caratelli [00:54:38]: Good reason to incorporate those, right?

Alex Salazar [00:54:41]: Yeah, everybody should use Arcade and on.

Demetrios Brinkmann [00:54:45]: That we can do. The governance piece is wild.

Alex Salazar [00:54:49]: Oh my God. It's become like the conversation everywhere. It's crazy because like three months ago I never heard the word.

Demetrios Brinkmann [00:54:57]: Well I remember I told you after I was in San Francisco I was like all these CISOs talking about governance.

Alex Salazar [00:55:02]: When you and I spoke I was like, ah, governance.

Demetrios Brinkmann [00:55:04]: Yeah.

Alex Salazar [00:55:04]: Now it has been every single conversation we're in. It is just, it's wild to me how fast everything's changing. The maturity curve for most organizations is rapid.

Chiara Caratelli [00:55:13]: We notice also teams using multi agent setups way more. And sometimes agents look like they have multi agent setup where agents look very similar to each other. And when you start to dig deeper into the reasons it's governance.

Alex Salazar [00:55:28]: Yeah. I'll tell you what, I'm seeing though. Here's my bet. A year from now when we do this podcast again in I think multi agent systems are going to be a lot less common. Oh, interesting. Yeah, we're starting to see the early signs.

Demetrios Brinkmann [00:55:38]: It felt like as we talk through this, agents and tools can be almost like interchanged in a way.

Alex Salazar [00:55:47]: Yeah, actually we actually our sales engineer, Shub, who's incredible, he basically proved to all of us in a really dramatic way, I don't think he's even doing it on purpose that an agent is really just a collection of prompts and tools. And people say it, but nobody really believes it because, you know, they all want to use all, all the big systems and Shub just like slammed together a little YAML based agent builder and it fucking works. Like, take that. Yeah, it just works. You just need prompt tools and you're done. And it's got Landgraph under the hood. But it's wild, right? And what we've been seeing is the scope definition of a sub agent is getting bigger and bigger and bigger. As context windows get bigger, as it can handle more tools, all of a sudden you're seeing bigger sub agents.

Alex Salazar [00:56:44]: And so if you take that to the limit, you're not going to see as complex of sub agents as people have been saying.

Chiara Caratelli [00:56:54]: You.

+ Read More

Watch More

Wardley Mapping Prompt Engineering
Posted Jun 28, 2023 | Views 685
# Mapping Prompt Engineering
# LLM in Production
# FirstLiot Ltd.
Treating Prompt Engineering More Like Code
Posted Jul 21, 2023 | Views 694
# Prompt Engineering
# Promptimize
# Preset
Code of Conduct