Building Agentic Tools for Production // Sam Partee
Speaker

Sam Partee is the CTO and Co-Founder of Arcade AI. Previously, a Principal Engineer leading the Applied AI team at Redis, Sam led the effort in creating the ecosystem around Redis as a vector database. He is a contributor to multiple OSS projects, including Langchain, DeterminedAI, and Chapel, amongst others. While at Cray/HPE, he created the SmartSim AI framework and published research in applications of AI to climate models.
SUMMARY
Building agentic tools for production requires far more than a simple chatbot interface. The real value comes from agents that can reliably take action at scale, integrate with core systems, and execute tasks through secure, controlled workflows.
Yet most agentic tools never make it to production. Teams run into issues like strict security requirements, infrastructure complexity, latency constraints, high operational costs, and inconsistent behavior. To understand what it takes to ship production-grade agents, let's break down the key requirements one by one.
TRANSCRIPT
Sam Partee [00:00:05]: Well, I'm glad to be here Today I'm going to be talking about building agentic tools for production. There's really more about writing the tools in this talk. For the past three years I've been writing tools, which is about as long as LLMs have been able to even produce JSON reliably. Maybe not even three years really.
Sam Partee [00:00:27]: So in this talk I'm just going to share some things that we've found that work and then I'm going to specifically hone in on database tools. Mostly because they're used a lot, a lot of internal use cases recently, but also because it sheds light on a specific attribute of different types of tools and how you can use them. So without further ado.
Sam Partee [00:00:53]: Today, like I said, talking about individual types of tools, these are two examples from the Arcade MCP library. I wanted to start with this so that everybody can have a chance to also just get that QR code. And so you know, the library that we released to be able to write these tools just right at the beginning because I always appreciated when other people did that in our talks. So first steps, bare minimums. And I know that other people are not maybe going to say that these are the bare minimums, but if we're talking about taking tools to production, if you're going to have an LLM, a non deterministic probabilistic endeavor inside of your code going to production, we found that these are the bare minimums essentially. 1. Generate and take care of your tool schemas. I say generate because today most MCP libraries force you to have a separate schema from your tools.
Sam Partee [00:01:53]: We found that most people don't take care of these as well as they should, especially when they're separate from your tools. We generate these in Arcade because of that, we actually don't allow you to have a tool that doesn't carry annotated for any non context parameters. This is because typically it's a foot gun the data types that are allowed in other frameworks. It really does hurt the overall LLM accuracy. Recall what have you of the entire process. You see send slack message up there, channel and message, they have annotated parameters. We actually generate these from those annotated parameters and we restrict the data types. So like you can't have a list of lists of dicks, right? You can't have a list of floating point.
Sam Partee [00:02:49]: Actually I think you can have a list of floating point numbers now. But the reason being is because of the cardinality of those arguments and how difficult they are for an LLM to actually choose the correct value of. If you just think about an enum with three strings that has all of those presented to the LLM in the schema, that goes to the LLM versus the completely infinite list of floating point numbers, right? One is much easier to choose than the other. And so really take care of your tool schemas and pay attention to them and also generate them if you can. Also, most importantly, I think is the AUTH requirements. Every tool needs to have its own AUTH requirements. The agent taking action, not just context gathering endeavor, but an actual action on behalf of the user. If you are going to use your company's secrets or you're going to, you know, have oauth be able to in make that tool actually authorized as that user.
Sam Partee [00:03:58]: Each individual tool needs to be labeled with those authorizations so those scopes such that the least amount of privilege is applied to that individual tool. So if you think about like Google Drive for instance, read a file or even just posting a file, right? Writing a file is much different than editing a file or deleting a file. Those are three different levels of privilege. And if you try to deploy your agent product at an enterprise, they're going to expect this level of least privilege applied to each of your agents.
Sam Partee [00:04:38]: Evaluating your tools is very important. And note, I didn't say evaluating your agent. Most people today are evaluating their agent if they're, you know, doing it right, but they rarely actually evaluate their tools. The tool eval is a little known, I think part of.
Sam Partee [00:04:59]: Taking these kinds of systems to production Tool calling system, I'd say, because it's not necessarily something that you see a lot in terms of coverage of MCP or what have you. But tool calling evals are the reason why our Arcade tools, not just the API tools, work, right? We have a reputation for tools that work and that is why each of those tools has evals that cover pretty much the entire gauntlet of the cases that you see. And some of them can be very simple, but the fact that we run them across 15 different LLMs right every night makes it such that we can have essentially a CICD type process for every single tool in the repertoire. So that is one aspect that is really important but little known. Arcade also provides those with its eval framework. But there are a lot of different EO frameworks out there that can do this, but it's not necessarily practiced as much as it should be. I'll show you a couple different things here and specifically point out those execution depths dependencies. If you have A database string, it should go in something that is not an envar.
Sam Partee [00:06:21]: We've been doing that way too long in this field. A M file is not a production ready method to use. Annotated with examples of argument type as JSON string as JSON string. LLMs are good at JSON. Now you should have pagination and limits just like an LLM, or it's just like a regular API but have the defaults explained and placed inside of this string annotation. That greatly increases its ability to actually have the pagination and limits parameters correct at runtime. And like I said earlier, eval, eval your tools. If you think about it like.
Sam Partee [00:07:05]: I forget that framework's name, but if you eval the description of your tools every time you change or you mutate those tools, you'll be able to tell how they act and how they actually respond to different LLMs in different situations in which they're called.
Sam Partee [00:07:26]: So I'll now dive into database tools. Specifically, database tools can be tricky, especially since they can mutate really sensitive data if you're giving them the ability to write or edit that data.
Sam Partee [00:07:44]: There are two types of kind of tools that I want to explain when it comes to databases. That kind of also sheds light on how a lot of different tools can be made. Exploratory and operational kind of splits the tools domain and into two groups. That really shows you different ways of actually building your tools and then also using your tools. So exploratory being something more like a context gathering action and then operational being much more specific with like tightly scoped privileges that you can be sure that there are only a set number of outcomes. And so operational really is not something that necessarily needs human. It can be an ambient kind of action because it should be very tightly scoped in what it's able to do. But an exploratory action is actually really meant to be used in tandem with a human being.
Sam Partee [00:08:46]: And so you can have this kind of back and forth. I'll show you an example here. So, so this update user payment plan right here, you see a couple different things. The tool context there or context is just the MCP context that's passed through. So ignore that for a second. The user id, which most of the time there's a get user ID or the user ID is passed through, but this payment plan argument is the one I want to talk about. It actually is an enum. So this function really doesn't do much other than pick from a set number of options for a payment plan that's is very tightly scoped in its ability to act and it's in its outcomes.
Sam Partee [00:09:32]: So that is an operational tool and this can be an ambient action because there's only a set number of verifiable outcomes.
Sam Partee [00:09:44]: This is just another example that's applied in a different use case. This aggregation window Enum is something that we use a ton because it's really not helpful for an LLM to have a start date and an end date. It is much better for it to have yesterday or last week or last month. It understands those semantics much better.
Sam Partee [00:10:11]: Exploratory tools are kind of, you can imagine them kind of like poorly designed tools in some ways, but it's because they're more open ended on purpose. So this example of grouping users by UTM source, break it down into something that actually has a number of different tool calls within it. So discover the databases, discover collections, discover the schema of the tools that are inside of this available, the callable collection, aggregate those documents, maybe search through those documents, do a progressive search, meaning start narrow and widen out.
Sam Partee [00:10:52]: This way you can actually explore iteratively in tandem with a human being. This is much more like the tools that you'll see something like cursor use. It will start with a I need to go look up or find this information. And then it will slowly iteratively get to your answer. Exploratory tools are what most people are used to because until recently we didn't really have the facilities to do authorized actions or reliable eval actions on behalf of the user. Operational tools and exploratory tools, they're much different in how you actually create them, but also in how you use them. This one being a good example of that type of cursor like workflow or not even cursor like code editing, but something like an analyst, agent app or something like that. One point I want to point out here is that a really, really helpful thing to do in your tools is to return a prompt.
Sam Partee [00:11:54]: So whether it's an error like Arcade's retrieval tool errors, or it's just the correct, it's not even an error, right? You just send back for a different response of one of those small number of collections of responses, you send back an actual prompt such that when that individual response is sent to the LLM, there's a response that says something like take search progressive, for example. You search too narrowly. Widen out your search to the medium scope and the enum is narrow, medium wide, or something like that. Those responses where a prompt is returned have a Huge impact and is a little bit of a cheat code when it comes to building tools.
Sam Partee [00:12:40]: So I talked a little bit about this, but really the important part here is exploratory tools should not act with right privileges. They should be scoped down. They should be made sure that they are not able to actually mutate any resources. They should be read only. And operational tools should be tightly scoped and have a set number of outcomes that are verifiable. If you take anything away from the talk is that if you want to take LLM tools to prod and you want to have them actually act on behalf of a user or do anything of value or that is tough in terms of the domain that it's acting in, operational tools should be different than your exploratory tools, at least in the database domain. Together these are really powerful because if you are able to search and explore and find out and then act in an operational capacity, you're able to have agents that can both figure things out and then act once they have. Lastly, this creates a lot of tools.
Sam Partee [00:13:49]: And so one thing we've done is we've created this gateway concept where you can have all of your MCP servers right from anywhere or even hosted on Arcade and they all go through a gateway and then you can use them with your agent. We think the future is federated MCP servers. You should absolutely go check this feature out. It's really cool. And this way your agent never uses any more tools than you assign it and you can select them right in the ui. Pretty cool. Anyway, thank you very much. And that is all.
Sam Partee [00:14:21]: Check out the QR code for Arcade MCB and start building some tools. Thanks.
Allegra Guinan [00:14:26]: Thank you, Sam. That was awesome. Everybody that's tuned in, please drop your questions into the chat so that we can answer them. We do have a few minutes for any questions that come up. I can also get us started with some questions in case that's helpful.
Allegra Guinan [00:14:44]: Yeah, I'm wondering because it's sort of a different way than people are used to thinking about building their agents. Of course it's all a new space, but everything you're introducing is. Is new as well. Do you think there's some unlearning that has to take place or are there any challenges in helping break this down for people and what makes sense for best practices when you are going into this hierarchy?
Sam Partee [00:15:08]: I think your question is definitely really good for the agent hierarchy of needs talk that I definitely should have told Demetrios that I wasn't going to be doing, but I Think in general, the, the thing that has actually kind of made it tough to educate people in the space is that MCP blew up so quickly and everybody wanted to use it because immediately tool calling, I mean, I don't want to call myself a hipster and like I was there before everybody else, but I thought tool calling was really cool. No one was using it and I didn't know why. And then MCP made it so that it was like everybody and their mother started using tool calling. But not many also looked up tool calling and how to do it effectively. Right. It was just, I want to use cursor, I want to use Claude code. And I think the MCP MCP spec gets a lot of flack for being immature, but really it's like, how long has HTTPs been around? You know, it's like it's. We have time and it is a needed protocol and it will get better and people will learn more about it and you know, listen to talks hopefully like this that help them out.
Allegra Guinan [00:16:22]: Awesome, thank you. We do have some questions in the chat, so I'll, I'll read them out. First one is, is it open source?
Sam Partee [00:16:29]: Yes. Well, the framework I just pointed out, yes. And full, you know, everything that you would need in an MCP server, you can use it free, yada, yada.
Allegra Guinan [00:16:42]: Awesome, thank you. And there's another one on. Do you have a system to log or something to audit as well?
Sam Partee [00:16:48]: Yes, that is probably one of the more asked for things. It's Otel, so OpenTelemetry, pump it to whatever you want to view your logs in, you know.
Sam Partee [00:16:59]: Metrics, logs, traces.
Allegra Guinan [00:17:02]: Yeah, perfect. And what are some of the, the biggest challenges you're seeing? I mean, you touched on it a bit with falling from mcp.
Sam Partee [00:17:12]: They're endless.
Sam Partee [00:17:14]: Biggest challenges. I would say I don't have enough essays.
Sam Partee [00:17:21]: No offense. Ah, good question. The use cases are getting more diverse. I would say that is one of the more interesting and also difficult pieces because when they get more diverse, you haven't built them before. And so figuring out new ways to do things can be both interesting and difficult.
Allegra Guinan [00:17:46]: Yeah, fair answer. It surely is.
Allegra Guinan [00:17:51]: And we have another one. In addition to providing clear tool description, would be beneficial to include a few real world, few shot examples to illustrate how the tool can be used in.
Sam Partee [00:18:01]: This comes up quite a bit and it is really hard to say. So there's a, there's a poison pill with few shot examples, which is that it caveats in some cases really to the example. And it's also based somewhat on the LLM. So this helps in cases where the domain not quite like an operational tool, but even if it's exploratory, the domain is kind of smaller and the few shot examples can be generalized. If they can't be generalized, like in the case of like an enum or something like that, that you can provide the exhaustive set of examples. The LLM inherently will choose what's in the few shot examples if it's not relevant. So if you have a really good search system that you can prove that the recall is like 95 at 10 or something and above, then yes, few shot examples help, but they have to be very. They have to cover the domain and your search system has to be good.
Sam Partee [00:19:04]: Otherwise it's just not worth it. LLMs are smart enough now that they don't need it.
Allegra Guinan [00:19:11]: Thank you.
Allegra Guinan [00:19:13]: And is it a good idea to provide all tools in one MVP gateway? Will the quality of the responses go down if there are too many tools provided to the agent?
Sam Partee [00:19:22]: Yes, undoubtedly. You should not have too many tools and you should not put all of them in one gateway. The point of the gateway feature is that you can have as many as you want for as many agents as you want. The nice part about it is that you can have a bunch of MCP servers, but unlike just using them yourself, you can pick individual tools. So like if it comes with 100 tools or 26, like GitHub, you can pick just four from that one and Dropbox and yada yada. It's one of the features that I've wanted for forever. So when we put it out, I was very happy about it.
Allegra Guinan [00:19:58]: Amazing. And do you believe there are other standards that will be needed to supplement mcp or is all you need?
Sam Partee [00:20:07]: MCP is not all you need. There's another layer on top. We're actually working with a lot of people on this. I don't know what it is yet, honestly. Some people call it skills. I don't love the anthropic skills approach. It works. It's cool.
Sam Partee [00:20:25]: I don't want to edit markdown files like that. It just seems like a lot of effort. But that abstraction level is what's next. Taking the tools and then building them into more comprehensive actions that we impart our intuition into. I think that's next. I don't know exactly what it looks like, to be honest with you.
Allegra Guinan [00:20:45]: Yeah, that was another question that came up if you had a chance to view Anthropic's article on code execution with MCP and what your thoughts are.
Sam Partee [00:20:53]: So I guess yeah.
Sam Partee [00:20:56]: Can I say no comment? Am I allowed to say that?
Allegra Guinan [00:21:00]: I mean, you're allowed to say no comment.
Sam Partee [00:21:05]: Publicly. No, I have nothing to say about.
Allegra Guinan [00:21:08]: Fair enough. Okay, sorry, person with this question. No comment on this one. But I think what you mentioned about skills anyway is interesting as just.
Sam Partee [00:21:17]: Yeah, there are things you can read between the lines on that one. Yeah, it's. I'll say that there are tons of security loopholes right now, easier ones than are covered there that people don't even think about. And you should be cognizant of your security if you're deploying mcp, even if you're just doing it for yourself at home. Like if you're opening it up to the Internet, you should really triple check your security and think about it. Or use Arcade, either one. Yeah.
Allegra Guinan [00:21:54]: Okay. Think about it. That's the. That's the. The answer here. I think that's great advice on all cases. Okay, last question. What do you think would be the use case to go with a database tool? Comparing it with an API tool.
Sam Partee [00:22:08]: Ah, so we put out API tools because people were like, you don't have enough integrations. And so now we've got thousands of them. But the thing about API tools, like, if you just take an open AI, open so many OAs and eyes, now, open API spec, and just turn it into tools, it's not like I was mentioning the annotations earlier and trying to caveat them and eval them specifically for LLMs.
Sam Partee [00:22:38]: If you actually eval an API tool versus one that you wrote and eval, what you'll find is that it doesn't cover the corner cases or edge cases nearly as much. And it's because the parameters are usually too many, too high cardinality and something like date times that LLMs don't naturally speak. 11, 10, 25. That's not necessarily a great tokenization parsing for an LLM. So it's really helpful to take those API tools like the ones we provide, and just right on top of them, we call those Agent Optimized, I think is now, I don't know, but the API tools, you can just import them and write on top of them and change the description and schema and whatnot and go on your merry way.
Allegra Guinan [00:23:34]: Perfect. Okay, that was all of our questions. Thank you so much, Sam. It was a pleasure having you on. Thank you everybody who tuned in and for all of your awesome questions, and I hope you enjoy the rest of agency production.
Sam Partee [00:23:46]: Thanks. Y'all. Have a good.

