MLOps Community
+00:00 GMT
Sign in or Join the community to continue

How to build agents that take ACTION

Posted Oct 13, 2025 | Views 6
# Building Agents
# Ai agents
# Arcade
Share

speakers

user's Avatar
Alex Salazar
Co-founder/CEO @ Arcade.dev

Alex Salazar is a seasoned Founder/CEO and auth industry veteran. As VP of Product Management at Okta, Alex led their developer products, which represented 25% of total bookings, and managed their network of over 5,000 auth integrations. Later, as GM, he launched an auth-centric proxy server product that reached $9M in revenue within its first year.

+ Read More
user's Avatar
Arthur Coleman
CEO @ Online Matters

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .

+ Read More

SUMMARY

Alex Salazar, CEO of Arcade, argues that chatbots are useless without the power to take action. He claims real value in AI lies in agents that can actually do things—trigger workflows, manage authorizations, and connect to tools like Gmail or Slack. Salazar calls out why most “agents” never make it to production—security nightmares, high costs, latency, and poor accuracy—and says Arcade fixes this by giving developers the tools to build real, authorized agents fast. He challenges the industry’s obsession with data, insisting that AI’s future is about software, not datasets, and that the heavy lifting by OpenAI and Anthropic has already changed the rules.

+ Read More

TRANSCRIPT

Arthur Coleman [00:00:00]: Alex Salazar is the CEO of Arcade and he is both a hands on technologist and a true product visionary. Arcade's a great fit for him because he's been a tool builder for a long time. And as I said, Arcade's been a tremendous supporter of mlops. I go to a lot of their events. I learned so much. Alex, thank you very much. He is a unique combination teaching this morning, a hands on coding class over here on Arcade, but he's also a visionary technologist. He knows how to turn infrastructure into innovation at his customers.

Arthur Coleman [00:00:37]: And with that, please welcome Alex Salazar of Arcade.

Alex Salazar [00:00:45]: That's a heck of an introduction. Thank you. I'm going to have you be my hype man for everything. Okay, so hi everybody, I'm Alex Salazar, I am the founder of Arcade and we're going to talk about how to build agents that take action. And this is a topic that we're really passionate about because the word agent gets thrown around a lot but it's abused. And in our opinion, an agent isn't an agent if it can't take an action. Otherwise it's just a chatbot with a cooler name. So psa, what do we do? What is Arcade? And then I'll stop about Arcade and talk about agents.

Alex Salazar [00:01:23]: We're a platform that helps you take actions. And so if any of you have tried to use tools or MCP in your agents, you've discovered that it's hard. We make it easy and there's a number of ways that we make it easy for agent developers. The biggest one, the one we're most known for, is agent authorization or agent Auth. And this is, how do you get an agent to talk to something like Gmail or Twitter or Salesforce? Well, you have to prove that the agent on behalf of a user is allowed and authorized to take a particular action, a particular resource. And that is incredibly difficult to do. And we make it one line of code. The other thing that we do is we have a number of out of the box tools, MCP servers, if you will, that have already been eval, have been built by hand by our team, that they're not just API wrappers, they're legitimate intention based tools that will perform very well on tool selection and parameter prediction, which in English means your agent's gonna use them properly.

Alex Salazar [00:02:31]: And then because every agent is gonna have its own special sauce, we provide a really nice SDK to allow you to build whatever tools and FCP servers you need for your agent, regardless of what we have out of the box. So that's Enough about us, let's talk about agents. It's hard to really think that a year ago you really weren't seeing many agents and the conversation was just starting. And three years ago none of this was even happening. This has all happened so fast. But the biggest change that we've seen is that we all started in chat. ChatGPT now feels kind of old. When you go into it, you go into Claude and it feels a little antiquated.

Alex Salazar [00:03:18]: And those were the most advanced agents in the world a few months ago. And what's happened is that it's no longer just about chat, it's really about taking action. And the reason action matters so much is that the real value of all software is workflow automation. And in order to do workflow automation, you need to be able to take actions on other systems beyond just retrieval of information. The other big problem with agents, if any of you in this room have tried to build an agent, you've experienced that it's really easy to get a demo working and you feel really excited. Maybe you raise some money with VCs on it and then you try and take it to production and you realize 90% of the work happens after the demo. All the re architecting happens after the demo. And As a result, 70% of agents don't go to production because they can't get them to pre production grade.

Alex Salazar [00:04:19]: Now there's a lot of reasons why agents aren't going to production. We see four most often. Number one reason agents don't go to production is they can't get the security right. And it's not like a tinfoil hat, you know, security like the Russians are going to break into the system. It's I can't connect to Gmail, I can't connect to salesforce.com I can't connect to all these systems that are really important for this agent's feature set to work because they require something like OAuth2 on behalf of a user, what's called delegated user authorization. And we don't know how to do that. The other reason why agents don't go to production is costs. When you start thinking through the agent flows and how many LLM calls you're making and you're trying to get accuracy and you can't get it, what do you do? You start stuffing more context into the agent and that's a lot of tokens.

Alex Salazar [00:05:15]: Then you start doing chain of thought and more loops and before you know it, you're spending way too much money per request. And the return on investment doesn't work anymore for the agent and so a lot of teams end up killing it. The other reason is latency. Same exact reason, same exact problem, except it's not really a cost issue. It's the fact that the user's having to wait five minutes and they're not going to. And so they're not going to use your system if you took it to prod. And the third one is accuracy, which is probably the biggest performance metric for any agent. Even if you get everything working, if it's not accurate enough, which is typically like north of 80% and probably higher, users aren't going to trust it, which means they're not going to use it.

Alex Salazar [00:06:00]: So with that, we have the privilege of seeing a lot of agents being developed because of what we do for a living, and we have the privilege of also seeing which ones go to production and the difference of what's being developed versus what's actually going to prod. But if I take all of those customer stories and aggregate out of the picture for a second, that same problem statement is how we started the company and how we evolved into what we are so very, very quickly. I'll give you our story. So we originally started as a site reliability agent. The original concept for Arcade was we were going to see something like high latency on a server. And from there we would do a bunch of diagnostic work. We'd figure out the root cause and then we'd issue remediation or at a minimum give you a rule out list. And our demo was incredible.

Alex Salazar [00:06:51]: And we then went to go build it for production and we ran into two problems. The first problem we ran into was we ran into compounding error rates. Right. If you think of the diagnostic flow for something like a server, there's a lot of different things that you're going to check. And so you start going 10, 11, 12 steps down. If there's a hallucination anywhere in that step, the whole chain breaks. And we were, it got so complex that we started seeing 100% failure rates and so we couldn't go to prod on that. And then the second problem we ran into was we were giving ourselves super user access to every system, which no one will ever do in production.

Alex Salazar [00:07:29]: And when we tried to give ourselves what's called scoped access or privileged access, we learned the hard way that there was no way to do that in agents. And so we ultimately had to invent a new way to make this work for ourselves. And the way we did it was we pushed a lot of the logic into the tool layer and minimized the use of the large language model. And when we did that, it worked. But in doing that, we realized we built something more important, which is arcade. But the lessons we learned are what I'm gonna walk you through. So we talk about this as the agent hierarchy of needs. What are the things that your agent needs to have in order to be the agent you want it to be when it grows up? And so the first thing an agent needs, it needs evals.

Alex Salazar [00:08:16]: This is probably the most boring part of the conversation, but it's the foundational part of the conversation. Nobody likes testing, nobody wakes up super excited to build evals. But this is what is killing most agent development. Now, you don't need to have some enormous data set of evals to make this work. Just sit down in Google sheets and just write out for yourself by hand, what are the inputs you expect some scenarios, and what are the outputs you expect based on the scenarios. And then you at least have the basis of a very simple eval system. And that allows you to keep going back to that, to make sure that you are remotely on track for what you want to do. And yes, eventually when you want to go full production grade, you are going to beef that out.

Alex Salazar [00:09:00]: But it starts with just having a basic set of evals. It starts to look a little bit like Testament development. And the reason we put that first is that agents are, you know, they're non deterministic. And so the thing that makes them so hard for traditional software developers to get right is that the existing testing infrastructure doesn't work. They have to move to evals. And starting with the evals first is going to ground you in the use case. It's going to help you descope it and it's going to help you test to make sure that you built the right thing. That's going to work within the scope of a single agent.

Alex Salazar [00:09:34]: It starts with really defining what the agent's going to do and then using that definition to keep yourself grounded on whether or not the agent did it consistently enough. And so the next layer here is the model. Models are now really relatively easy. Yeah, we can debate religiously, Claude versus OpenAI versus the open source models, but that conversation now, it just isn't as fundamental as it was a year ago. Many of them, especially the flagship models, are all good enough. Yes, we can debate 1%, 2%, 5% here and there, but it's not like 10 or 20% differences anymore. And so if you're going to pick a model, pick a Model that's great for your use case. Play with it, see if you like it.

Alex Salazar [00:10:14]: Cost, latency, everything depends on your application. But you pick your model that works for you. I would argue strongly that if you're building an agent, one of the most important performance metrics for whatever model you're going to use is performance on tool selection and parameter prediction. Because an agent is not an agent if they can't take action. And the way agents take action is through tool calling. Today, the flagship models are far ahead of the community models on tool selection. That will change. Ask me Next week might be totally different, but at the moment, today that's the case.

Alex Salazar [00:10:48]: The next layer, once you've picked your model, is your orchestration layer. Now this, I would argue, is what's really unlocked agents in 2025. And there's a lot of these. There's Langgraph, there's Mastra, There's CrewAI, there's OpenAI, HSDK. This list keeps getting longer and longer. I won't take a position on which one you should use, but pick one that feels right to you. I would argue that one of the most important features in an orchestration system is that one, you can wrap your head around it and it fits how you code, because if you can't figure it out, you're never going to go to prod. But also human in the loop ends up mattering.

Alex Salazar [00:11:31]: In almost every use case we see in production, there's heavy human in the loop functionality. And so if the orchestration system doesn't really give you a good human in the loop experience, you're going to run into a wall at some point in the development process. So I would front load that analysis. But if you look at how most of these orchestration systems work, they have some notion of a workflow might be a graph based workflow, might be a hierarchical workflow, but there's a series of steps that the agent orchestration system is going to take the large language model through based on user input and system context. If you look at most of those workflows, for a truly agentic system, almost all of the branches on those workflows terminate in an action. They terminate in a tool call. The next part of an agent is the tools. What is it that this agent can actually do? Prior to tools, all the agent is doing is just pure reasoning and pure thought.

Alex Salazar [00:12:40]: This is where action happens. This is why MCP is so exciting. This is why MCP has been the dominant conversation for the last few months, because everybody now realizes that this is the Next leg of the stool for them. Now, when you're building tools, things get very complicated. We see a lot of mistakes when it comes to tools. The first mistake is we're going to grab an API. We're just going to give an API to large language model. If any of you built agents, you've learned the hard way.

Alex Salazar [00:13:08]: Doesn't work very well. APIs required, structured input needs a Unix timestamp. Your model has no idea what day it is and it for sure can't be trusted to generate a timestamp. So what do you do when it says email Bob? The API requires a unique identifier for Bob. The large language model has no idea who Bob is and it for sure has no idea what his unique identifier is. So how are you going to bridge that gap? And so then you start to see people say, okay, well I'm going to build an MCP server or some other type of tool. I'm going to put natural language around the APIs, natural language around some of the input parameters. But that still doesn't work very well.

Alex Salazar [00:13:49]: This is why most tools out there are kind of garbage, because again, let's do an email example. Reply to Alex's email. Awesome. Who's Alex? Which email in which time window? Let's say we know what the time window is. How do we translate that to a timestamp that the API requires downstream? Let's assume we figure all of that out. There is no reply to email endpoint on Gmail. So how is the model? How's the agent, how's the orchestration system going to process a reply if there isn't already a ready made API endpoint for replies? Now if you take a Gmail example, there's a search email API endpoint. It has like 40 input parameters, so you're asking for hallucinations.

Alex Salazar [00:14:37]: But let's assume you resolve that, you find the right email. Well then what if you want to actually reply and maintain threading, you have to unpack the MIME structure that an email's in. There is no API endpoint for that. So you can either leave it to the large language model to generate code on the fly to go unpack mime, or you yourself have to go build a tool to unpack mime, which is pain. It's like 200 lines of code. I wrote it. Our product, it sucks. And then after all of that, you insert your email, you repack the mime and then you hit the send endpoint.

Alex Salazar [00:15:09]: The send email endpoint, that's the easy part. Now you can make all of these discrete APIs in your own service. That'll work. It's a pain, but you can do it. But now you're asking the large language model to make three decisions unnecessarily because it has to happen the same way every single time. And so where a tool comes in, a well designed tool is you abstract all of that away. It's not about the API endpoints, no one cares about those. It's about the agent's intentions.

Alex Salazar [00:15:37]: What's the agent trying to achieve? That's the tool. Anything short of that and you're just asking to make more language model calls. You're trying to stuff more context into the windows to try and help the large language model bridge the gap between the APIs you've offered and the intention it's trying to execute. That's why our tools work so well and most tools suck. Is that one insight. You don't have to use our product for that. You can go do that yourself. But that's how tools get written.

Alex Salazar [00:16:08]: They're intention based workflows, not service based description contracts. I'll give you an easy example. We're building a sales agent and I as a salesperson want to walk into account. I'm going to Goldman Sachs and I say, hey agent, prepare me for this meeting I'm about to have. What do you want the agent to do? You want the agent to check your email, you want to check the CRM and maybe based on some context, you want to pull the right brochures for you so that you walk in with the right brochures in hand, or at least digitally in hand. Well, I give it a Google Drive tool and maybe it's just a wraparound the Google Drive API and explains Google Drive perfectly. My agent doesn't care about Google Drive. My agent wants to find the right brochures.

Alex Salazar [00:16:54]: So if I ask the large language model to use its intelligence to infer how to navigate Google Drive to find the right brochure, it might do it. It'll be slow, it'll be multiple LLM costs, and it might get it wrong. But if instead I give it a tool that is fine brochure, the chances of it getting it right are much higher. It's a lot faster and it's a lot lower token costs. And so as an agent builder, I am exposing to the model a tool for its intentions as opposed to a tool that expresses a lower level service. Does that make sense? Okay, so let's assume you get this right. How many of you here have tried to build an MCP server? Have tried to use MCP in your agents? Yeah. About half of you.

Alex Salazar [00:17:50]: How many of you have tried to do email or social media as part of your one of your first sample applications? How many of you got it working? How many of you got it working for more than one user? One of you? I'm impressed. I want to see how you did it. The next problem is no matter how well you get this working, you're going to run into the agent Auth Wall. We run into this at almost every single organization we talk to. This is arguably the number one reason why agents fail in production at organizations. They get everything working in a demo. They get it working for exactly one user. And so we see two common failure patterns on agent authorization.

Alex Salazar [00:18:39]: Pattern number one is what we all saw with RAG systems, which is you give the agent its own identity. Vendors will call this non human identity. It's an identity. It's like a human, but it's different. So we need to treat it like a new person. It's a new worker. It's a new knowledge worker. It's just automated, it's intelligent.

Alex Salazar [00:18:56]: That didn't work. That's. It's never worked. This is why RAG systems really struggled. What's the level of permission you give the RAG system? Any level of permission above zero. Then you run into a second problem is who do we give access to the agent. So let's say that we give. We create an agent that's doing compensation management in our organization.

Alex Salazar [00:19:17]: Does it get access to the CEO's compensation? Does it get access to the intern's compensation? What level of the organization can it see? We define whatever that's going to be. Now the intern comes in to work for HR and he has access to the agent. What can the intern see? If the intern's access is lower than the agent? You've just created what's called an authorization bypass vulnerability. The CISO will freak out. When you look at RAG systems, the mass majority of them are working on public information. They're getting the lowest level of permission possible to avoid this problem. This is why all those really smart support apps that you interact with on the websites, you can ask all the questions that it regurgitates the public knowledge base. But the moment you ask it where your order is, it cannot answer the question.

Alex Salazar [00:20:03]: It's because they can't figure out how to authenticate and authorize as you and pull your information and prove that it's you asking for it and not you asking for his information. Huge Problem. The other failure pattern we see is what is common in most MCP servers, which is you use the user's credentials. I put in my own credentials and the MCP server is acting as me. And that is secure. That is legitimately secure, but it's very unsafe. I have seen Cursor try and delete my root directory. It couldn't because it doesn't have pseudo access got blocked.

Alex Salazar [00:20:41]: It apologized profusely. It was very kind but thankfully it didn't have access to. That's why giving it your credentials is a terrible idea because you have the ability to delete directories in drive, you have the ability to delete emails, you have the ability to do all kinds of different things that you may or may not want the agent to also be able to do. The right way to do this is what's called delegated authorization. It is to take the intersection of what the agent's register to do not as an identity but as an application and then take the user's identity and what they're allowed to do and then take the intersection. If the agent's authorized to do it because they pre registered as an OAUTH application on the downstream service and requested scopes and claims as part of their registration process and the user's also permission to do it, then the agent should be allowed to do it. And if one of those two statements is not correct, the agent should not be allowed to do it. This is very difficult to do so far as I know.

Alex Salazar [00:21:46]: I think we might be the only service that can do it. But that's going to change in the MCP specification. This has just been accepted. We were the authors of that contribution. So my hope is it'll be merged into the spec by the end of the year. Regardless of which vendor you use. This should be possible soon. But once you have all of these in some form or fashion now you can have true agentic action.

Alex Salazar [00:22:13]: That's what everybody's here for. That's the thing we're most excited about. This is at the point at which you start to really see ROI on agents because now you can have real business workflow automation. So bringing it back to Arcade, for those of you who want to learn more about this, one of the things that we get asked for the most is how do I make this work for my data stack, I've got Snowflake, I've got Databricks, I've got Airbyte, I've got something else that it's a data system that I want to talk to this Whole notion that an MCP server or a tool is best built when it's domain specific. It's intention based versus raw service contract based expo, like explaining what the downstream system is. This is most expressed when it comes to data. And so our head of engineering, Evan Toller is going to be giving a talk later today. I hope you all make it.

Alex Salazar [00:23:12]: And he'll be giving that talk with a former colleague of his from Airbyte. And if you're interested at all on building agents that take real action that can go to production and not have a CISO freak out, I hope you play with Arcade. I recommend you go to our website, it is arcade.dev register also we've got really cool sample applications and so we have many, many, many sample apps. One of the coolest ones that I think most people here will be really interested in is we have a Slack bot that can do all of this from within Slack. It's open source, you can play with it, make it your own and you can start to interact with your own systems. You can interact with Gmail or Outlook or Docs or social media or whatever you want from within side of Slack Agent Also we have a discord. So if at any point you have questions or you run into problems or you wish we had a feature that we don't have, jump in the discord, let us know. We are actively monitoring it.

Alex Salazar [00:24:10]: So with that, thank you all so much.

Q1 [00:24:14]: Thanks Alex. The agent's hierarchy of your needs is based on the Maslow's hierarchy of needs is organizational psychology is similar to it. I see. My question is where is quality of data comes in place? We see the hierarchy, various tier of things. Right. So quality of data is essentially foundation of all foundation I believe. So what is your thought on it?

Alex Salazar [00:24:41]: Oh my God, I love this question. Okay, so there's somebody I was talking to with Apple from Apple today about this very same issue. I don't know if he's in the audience but hot take. Quality data is less important today than it was before in large part because there's old AI and there's new AI and old AI was model, model, model, model, model, training, training, training, training, data, quality, data, labeling all the data science stuff as of the beginning of this year. That's not really what's being built. Like there's been this overnight transition to what I call new AI and new AI looks a lot more like software development. It is assembly through composition. And so the models are critically important, but they're no longer the differentiation.

Alex Salazar [00:25:39]: The first iteration of Any agent is an unoptimized implementation of Anthropic or OpenAI. And then as a step two or three or four if needed, they'll start to then start doing fine tuned training. But increasingly they're not even having to do that. They're starting to hit some of the targets without any optimization. Now, if you have really clean data and you have lots of it, incredibly valuable, but by no means is it a gate anymore.

Q1 [00:26:07]: I have terabytes of actual data. Nobody touched it, it wasn't trained anywhere. This is actual results, it drives actual results. If it's garbage there, you can have the best model. What will be the accuracy?

Alex Salazar [00:26:20]: Oh, because that hard work you're describing was already done for us by OpenAI and Anthropic and the work they did for most use cases we're seeing ends up being sufficient. And so as an agent developer, I can just skip that for now and go focus on every other part of that stack that I just described. And so when I look at most agents being built, there really isn't much of a conversation happening at all around that particular piece of the stack. They will get there as a stage two or three, but it's not stage one anymore. Next question.

Arthur Coleman [00:26:52]: We're pretty much time.

Alex Salazar [00:26:54]: One more, one more.

Q1 [00:26:59]: You have information leak from the history and then also the context.

Alex Salazar [00:27:04]: Oh, that's a really complex question. Let me play it back to you really quick. Yeah, I'll repeat the question. So to play it back, to make sure I understood it, you're building an agent. There's some degree of user authorization that the agent can operate on its behalf. There's all this context like conversational history that you're collecting as part of this process. The user has now either expired or has revoked the authorization, but the data still persists. How do you handle that? That's a GDPR question.

Alex Salazar [00:27:37]: The easy answer is you should have a script in your code that goes out and finds the data in your data stores and deletes it. But that's ultimately a privacy compliance issue and less of an architectural challenge because the answer to it is an answer that web systems had already resolved. You could go back in the databases and scrub it all. It's very hard though.

Q1 [00:28:01]: What we found is that mostly invalidate most of our context because in an enterprise environment, the permissions and also authorization changes quite often.

Alex Salazar [00:28:13]: Well, okay, separate part, yes, if the authorization is to downstream systems, then yes, all the context should be shut down because if the user is no longer authorized, that data should no longer be available. That's where things like token refresh and token management become really important. That's a longer cursing of how to have it offline, but it's its own talk at that level. Thank you all for your time. Be Nation.

+ Read More

Watch More

How to Take Control of Your Rag Results
Posted Jul 26, 2024 | Views 194
# RAG
# Quality AI
# Superlinked
How to Systematically Test and Evaluate Your LLMs Apps
Posted Oct 18, 2024 | Views 15.1K
# LLMs
# Engineering best practices
# Comet ML
How to Build LLM-native Apps with The Magic Triangle Blueprint
Posted Mar 15, 2024 | Views 512
# LLM-native Apps
# Magic Triangle Blueprint
# AI
Privacy Policy