MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Governance for AI Agent Deployment

Posted Dec 05, 2025 | Views 0
# AI Governance
# AI Agents
# AI infrastructure
Share

speakers

user's Avatar
Spencer Reagan
R&D @ Airia

Passionate about technology, software, and building products that improve peoples lives.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Spencer Reagan thinks it might be, and he’s not shy about saying so. In this episode, he and Demetrios Brinkmann get real about the messy, over-engineered state of agent systems, why LLMs still struggle in the wild, and how enterprises keep tripping over their own data chaos. They unpack red-teaming, security headaches, and the uncomfortable truth that most “AI platforms” still don’t scale. If you want a sharp, no-fluff take on where agents are actually headed, this one’s worth a listen.

+ Read More

TRANSCRIPT

Spencer Reagan [00:00:00]: You know, so many things in technology repeat themselves. And I look back at stuff, whether it was blockbuster or in this case, MySpace, right, when we were all trying to find the right song for our MySpace intro, no one ever imagined that MySpace could end up in the dustbin. So my question is, who is it? Who is it that we're looking at today that's a major player that's going to end up in the dustbin? Right? Who is going to be the Tom of the AI industry? Right? Who looks the most like Tom? Can we imagine a meme where we've got Tom next to some other AI personality in that particular organization is now in the dustbin. Just as unsurprising or just as surprisingly as as MySpace was. So I think that would be my spicy take, is that one of these players, not pointing at anything specific, not naming any names, but somebody's gonna end up like Tom.

Demetrios Brinkmann [00:00:56]: Yeah, you said something there. When it comes to actually transforming this new way of doing software into something that is usable, I instantly wanted to ask you, what are some use cases? What are some ways that you've been seeing this actually having success out there with businesses?

Spencer Reagan [00:01:20]: Most of the time it's, it's around, you know, summarization and expansion of ideas. Whether that's looking through spreadsheets or records or, you know, evaluating written text at scale, you know, putting together normalized summarizations or expansions that can then be used. Whether that's, you know, filling out a lead sheet for a sales team or an rfp, or it's, you know, a customer that needs to evaluate all the schools in their district. Right. And evaluate the teachers in them. These are generally spreadsheet focused tasks. Right? Spreadsheet, just an extension of databases, things like that. But this idea that we need to run through these things and evaluate them and transform them, add color, and I think that's one of the first things that we've seen these LLMs just so impactful at is this idea of semantic meaning.

Spencer Reagan [00:02:11]: It allows us to normalize concepts down to a set of important criteria. It allows us to expand into other concepts in a way that gives us direction in a way that human beings just aren't great at at scale and repetition. But by having a similar instruction set, we're able to do that. Those are the things that I'm seeing really transform the most. You know, the question and answer aspect of agents, where we're asking a question about some data, is pretty important too. We've seen a lot of customers do, you know, these interesting setups with agents where they'll evaluate or constantly look across a set of changing information, right? This can be a set of Slack channels or teams channels or emails or JIRA bugs, you name it. And they'll have agents look across this stuff at a regular interval and organize it into some graphs, into a database and whatever works well for them. And then it gives them the ability to not just get a summarized output, but to ask follow up questions.

Spencer Reagan [00:03:14]: That. And so what we see is leaders out there, leaders that normally would get an email in their business and it would say, hey, you know, this is broken. There's an escalation over here because this data center is offline and they got a call up about four people to get the story right. They're grabbing their cell phone, they're dialing numbers, they're hitting slack huddles, they're trying to get answers, right? It's tough. Those same leaders can go just ask an immediate question of this set of data that's been organized properly and get the exact same answer out almost immediately. That is incredibly transformative to their business. Their, their ability to respond to issues, questions in real time. I mean, that's powerful stuff.

Spencer Reagan [00:03:50]: And that's what we see now. The challenge to those setups is they take a lot of preparation, right? It takes a lot of figuring out how to get this in place so that in that moment you can get that value out of it. And so there is that investment, right? And I think the whole industry is kind of figuring out what are these structures, how do we evolve this? But there is this moment of we've got to invest a lot and figure out how to organize this to then get the use out of it.

Demetrios Brinkmann [00:04:13]: It's funny, you did say this key phrase which is organized properly. The data that is organized properly. And I was like, oh man. Every employee that has ever touched any type of data is sitting here listening right now thinking, oh, wouldn't that be nice?

Spencer Reagan [00:04:34]: Yeah, yeah, yeah, absolutely. And it's one of those things. Early on it was when we were having conversations with customers, one of the first things that would always come up is, yeah, my data is just this giant mess, or it's this or it's that or I've got this chain of, I'm thinking of a specific meeting now, this chain of 20 hotels. And everybody organizes the client data differently. The customer data. Everybody looks at, looks at the preferences, where they ate the reservations, what they had with their kids differently. And so it becomes this great wealth of information that no one can really use because you've got no way to look across it in an easy way. But lo and behold, it turns out that These systems, these LLMs, are really good at that.

Spencer Reagan [00:05:11]: Right. Giving them a structured output to, asking them to structure unstructured data, generally not in a, in a really mathematical, precise way, but in a way that just normalizes some of this stuff so that then you can use that suite of data to really have unified customer profiles across all those 20 hotels. Right. Just by letting the LLMs know, here's what I need output. Here's what's important for you to look for in this conversation, in this email set in this reservation list. Look for this, put it in this right place, and that gives them that body of information that ultimately then becomes really valuable.

Demetrios Brinkmann [00:05:44]: Yeah. Two conversations that I had instantly come to mind, one being with the head of AI at Wise, Igor, and he was saying, you know, it really helps to look at LLMs as a tool that can transfer unstructured data into structured data. And that's very much what you're saying right there. Like, hey, let's normalize this data. We've got it in many different shapes and sizes and forms and schemas, whatever. But if we can throw it into the LLM and get just a gut check on where we're at, that can be seen as one step in the graph. And then you can go and do different things that you normally do with data in the next steps.

Spencer Reagan [00:06:33]: Yeah. Even if it's just as simple as going back to that customer. I was just referring to what dinner reservations every guest makes at the different properties, what time they like to eat, what they order, things like that. Each one of this customer's properties are very different. Store that information differently. But it's not too difficult to run through that and say, hey, I need a customer profile on how they eat, you know, what restaurants, what types of food, what time they want to eat, how many people do they take, the kids, stuff like that. You can identify what the important points are. Then the LLMs are great at going through just a giant ream of unstructured information and giving you that unified customer profile across all those properties.

Demetrios Brinkmann [00:07:13]: Wow. Because before what I imagine was happening is that if you had a very important person at the hotel, you would have a human go and look at all of the different events that happened. Maybe it's in some kind of CRM and they get to see that data. But as you said, if it's dispersed across various hotels and different physical buildings, they might not be privy to see what is in the other hotel, even if it's under the same brand.

Spencer Reagan [00:07:47]: Exactly. And that was exactly the challenge they had. Right. Sometimes it was in spreadsheets, sometimes it was in actual CRMs. Right. Some of it was local. It was all different. And so it was very.

Spencer Reagan [00:07:57]: It was almost impossible to get use out of that data.

Demetrios Brinkmann [00:08:01]: God, that's such a great use case. That is one of those ones where you almost are, like, it seems too easy to. But then you recognize, like, how valuable that is and how useful, especially in the hospitality business. Right. The hospitality business is known for, like, the more hospitable you can be, the better. And so if you can understand when someone wants to go to dinner and already have that in mind, and maybe you're helping them by preemptively saying, we've got a table ready for you. We know what you like, or we've got this special because we know that you really enjoy xyz, since we've seen that you've ordered it every single time in the last 10 hotel visits. Like that is an incredible experience for a hotel goer.

Spencer Reagan [00:08:54]: Correct, Craig? It's a win for everybody, right? And it's, it's something that's done by an LLM with some really clean prompting and a described output that just runs 24 7. You know, again, at that point, it just becomes compute, you know, go ahead.

Demetrios Brinkmann [00:09:11]: Oh, go ahead. Sorry, sorry.

Spencer Reagan [00:09:13]: As you say, you hit the nail on the head, though, with one comment in there. You really, you, you got it exactly correct in that, you know, it's. It's as valuable as you make use of it. And while that information can, can. Can, you know, feel a little bit like you're looking inside what I'm doing and my preferences, I want that right when I check into the hotel for me, hey, yeah, by the way, Tuesday has the special you like, and there's a reservation at the one that you love. Would you like to go ahead and make them? Those are valuable moments, right? Yeah.

Demetrios Brinkmann [00:09:45]: Yeah.

Spencer Reagan [00:09:46]: But again, it's about using the LLMs for, for what they're good at. You know, I think when we see some companies struggle or some implementation struggle, it's because you're. You're looking for a little bit of magic, right, that maybe isn't quite there. You know, the, the rule of thumb that I always use when I approach customers or businesses that want to build some true agents, some true agentic software is you need to start with the understanding of if you brought in an intern. Right. Right off the street, an intelligent intern, but fresh to the business. And you sat them down. Could you describe to them in natural language what it is you need? And the example we just gave you could, right? You could bring an intern in.

Spencer Reagan [00:10:26]: You could say look, here's this giant pile of information. It's all over the place. You got to log into 15 different hotels. But what we need you to do is look through every customer, look at their dining and pull this information out. If you can do that, if you can articulate how you would describe it to just an intern that you hire off the street, chances are it's pretty easy to build an agent out of it. When, when folks struggle is when they get to that moment they're like, well I don't know how to do that. I want, you know, I want this data to talk to that data. And if you can't articulate what that is in natural language, you're probably not going to be successful building an agent to do it.

Spencer Reagan [00:10:58]: But um, that's usually the rule of thumb that I will, I'll guide people down.

Demetrios Brinkmann [00:11:03]: I instantly think about a conversation I had with Alexa who works at Bloomberg and she was singing praises for some of her mentors at the company and at her past companies who as she put it, could smell projects that had strong roi. It was like they had a nose for it because that's kind of what you get as you progress in your career and you have so much experience. You recognize that like yeah, if I can move the needle on this, that is going to be valuable. And it makes me feel like a lot of projects that I've seen or a lot of ways that we're going about architecting these agentic flows or creating something, they're fun, they're cool, but they're completely over engineered. And it isn't this like smelling the ROI that you're talking about. And that's why I keep going back to like God, that is such a. It's. I'm not going to say it's simple because there's probably the data and actually getting that data to the LLM but it's, but it's like one LLM call I imagine or maybe a few.

Demetrios Brinkmann [00:12:13]: It's not like you're creating some sub agents, some multi agent systems. You're not going and trying to, to this over engineering that sometimes you see with agentic use cases, 100%.

Spencer Reagan [00:12:27]: You know Carpathia, I've listened to in a particular talk, I think it was maybe at a Y Com orientation or kickoff or something, a particular one where we really articulated well, you know, software 2.0 versus software 3.0. And understanding that 2.0 is the deterministic component and 3.0 is where we're leaning on the LLMs and this new idea of summarization and semantic understanding tool calls. But how important, important it is to understand when to use the deterministic stuff and when to use this new technology and not just hope that, hey, let me throw AI at the entire problem and imagine something's going to come out of it. Knowing, you know, specifically what to use the AI for and what not to is really just key to it. But yeah, absolutely, man, that helps you understand the roi. And you can see it coming from a mile away because you can see the tasks that are going to require that level of repetition. It's almost the stuff that you don't want to ask somebody to do because it's such a bear of a project. Right? Those are the things and you can see them coming from mile away.

Spencer Reagan [00:13:28]: To your point, yeah.

Demetrios Brinkmann [00:13:29]: Oh, what a great call that. It's something that you don't want to do because it's a bear of a project. It probably involves a whole lot of data or a whole lot of rote work. And by recognizing that, you can get out in front of it and try to see, yeah, let's. Can AI help with this? And knowing where to use it and knowing how to scope it is really your. That's where you can add the most amount of value, 100%.

Spencer Reagan [00:14:01]: We're seeing now that initial phase of what is the meat on the bone for LLM, Automation or task. We're seeing that kind of starting to be exhausted from a conceptual standpoint. What we're seeing now is that the edge of this is being pushed into really more what we would have considered human moments. Right. Things that are real interactions with browsers. I think all the AI companies are shipping some great browsers now. Some of them are fantastic products. But just outside of that, the problems that we're seeing in business now that folks are wanting to automate are really, they're describing to us a website they go to and they log into and they check this and they click that and they upload it over here.

Spencer Reagan [00:14:43]: And these are, these are really very human interactions. And we built all of these systems and all this technology for human beings to do this, to look at the web rendering, to click the buttons. And now we're almost retrofitting this, this whole set of technologies onto understanding, you know, rendering the browser window, looking at it. Not that the LLM needs that, but because we Built that for humans in our. In our eyeballs. That's the way we're now instructing the LLMs to go take the tasks.

Demetrios Brinkmann [00:15:11]: Yes. The GUI that we have for humans now is really like difficult in so many ways for an LLM to be able to navigate and execute on reliably.

Spencer Reagan [00:15:25]: Yeah, I think. And you know, it's funny, we see that in so much, so much stuff in technology. Just infrastructure that's in place, whether it's the Internet itself or airlines or roads or something. It's almost as we get the new technology in, we almost have to retrofit onto the infrastructure that we made for us as human beings initially. We very rarely kind of rip it out and start over. We just kind of add the technology. Think about cars or any of the other modern technologies. They're really just kind of retrofitting technology onto the human interaction that we built initially.

Demetrios Brinkmann [00:15:59]: Yeah, 100%. And I think there's some. I used to live in Park City, Utah and in Salt Lake, I remember. And somebody will have to fact check me on this because it might just be something that I heard when I was in high school and before like Google. And so I heard that they had such wide streets because you had to be able to do a U turn with like a horse and buggy. And so they had. All of their streets were just gigantic. I think it was like four lanes on each side, you know, because you had to be able to do that U turn.

Demetrios Brinkmann [00:16:41]: It never got updated as we stopped using horse and buggies. It just, ah, that's the way it is. All right, cool. Let's put cars on it now and we just have more room for cars.

Spencer Reagan [00:16:52]: Same thing. Yeah, yeah, 100%. And it's funny because I often think about the, you know, the bandwidth limitations of. We interact with these AIs and these LLMs through interfaces that are grossly underscoped for the amount of information they could pass back and forth to us. But we can't do much more than type or speak and we can't ingest much more than what's printed on the screen or what's shown in front of us. And that's absolutely the full, like, small pipe limitation. I don't think that that changes until we get entirely new paradigms. Right.

Spencer Reagan [00:17:25]: These. The neural interfaces. Right. Or something like that. That truly changes that. Until then, we're stuck shoving all this AI technology through these little bitty straws that we initially thought were just plenty for us.

Demetrios Brinkmann [00:17:38]: Right, dude? Well, that's funny that you're calling Out. Hey, it needs to be a new paradigm. I even just am saying, let's rethink the chat. I've had a few awesome guests on here and seen a. A few great talks where folks will talk about how in different areas of software, we have different interfaces that we can try and borrow from when we interact with LLMs or agents even. And one way that I've heard this be proposed is that when we're in Lightroom and any photographer knows that when you take a picture and then you go to edit it in Lightroom, you have a histogram and you can bring up certain colors and push down other colors or, or bring out the shadows. And you have much more controllability when you have that. But with agents, we just have chat.

Demetrios Brinkmann [00:18:41]: We say something and then you hope that it understands your intention to the best of its ability. And then it. Then it goes and you wait a few seconds or a few minutes and come back and see like, did it, did it get what I wanted? But wouldn't it be cool to have a little bit more of that type of interface where we have all these knobs and all these ways that we can express our intent in a way that isn't just through language? Because language is very fuzzy 100%.

Spencer Reagan [00:19:14]: But what's interesting is these systems are built off of language, right? So it's almost that becomes the programming language and becomes the instruction language. I will say, though, that, you know, chat as a, as a singular interface probably doesn't encompass everything we need. You know, I see a lot of some of the most successful customers that I interact with, building different UIs that try to meet that person, where they are and what they're doing. You know, you can present some information that's AI summarized, give them some quick buttons. You know, something as simple as, you know, email, for instance. Right. An AI that just looks through your inbox and then puts a draft reply, or puts a reply, you know, draft in your drafts for each of the emails you have in your inbox. It's one of those things.

Spencer Reagan [00:20:02]: You didn't go to a chat in your face and say, you know, I'm going to paste in an email here. You tell me what the response is. It may or may not be that you use that draft, but it's right there, right? And so you're taking the interface that you work in, that makes sense. And you're bringing the AI or the technology to it. You're kind of trying to merge both of them. Now, the reality is, if that draft isn't Right. Or you want to change something about it to your point. Maybe the variables are known enough to where you could do levers, but I think that we're never going to get away from either typing text or speaking text as a, as a way to communicate.

Spencer Reagan [00:20:39]: Even if 70 or 80% of the interactions can kind of boil down to more very tactical, purpose built buttons, interaction points, touch points. I think that that way of communicating language probably just, it's probably here to stay with these AIs just by the way they've developed from language infrastructures and nets.

Demetrios Brinkmann [00:21:00]: So do you have any other examples that are maybe simple on the surface, but just have been super valuable for companies?

Spencer Reagan [00:21:16]: Let's see, simple on the surface, super valuable for companies. Yeah.

Demetrios Brinkmann [00:21:19]: Or not even simple on the surface. Forget that I even said that. Just is there anything that comes to mind that you've been seeing has such strong or ROI or. I don't know where I was going with that acronym, but anything you've seen in that regard that has such strong ROI that you have to kind of shout it out right now?

Spencer Reagan [00:21:45]: I would. You know, really it's. Each industry or each organization in a company really gets so much different value. Right. HR is really obvious in terms of being able to summarize and expand and look for things legal, legal deals in so much contract language and needing to go through red lines and understanding intent and evaluating, you know, changes in documentation and comparing negotiations, development. Right. You know, in coding it's really obvious coding is one of the, one of the first things that these, these, you know, systems really excelled at. And I think maybe because the outcome is so provable, I'm not sure that they've been so important there.

Spencer Reagan [00:22:27]: But I mean marketing is one of the places that I think we see the most. Right. If you think about what you can do from a design perspective, a branding perspective, that used to be some really difficult, long, arduous work of a designer really working with pixels. And what we've seen is those design teams, the ones that are being successful, are taking, taking hold of this and instead of trying to compete with an individual design, they're trying to figure out how to empower as many employees as possible to get that same output. Right. So if you're a design team understanding vibe coding, right. Understanding how Claude or Chad, GPT, Codex, any of them actually builds some Vibe coded UI if you will. And then if you're a design team trying to put rules in place so that you can empower your workers, your employees to get that same output so I think that's one of the most impactful that we've seen.

Spencer Reagan [00:23:27]: Just in terms of human hours taken out of the loop. It doesn't, doesn't degrade the quality output, it doesn't offset the people needed, but it just makes it so much easier to make more content. I think we're probably seeing that everywhere across the board, this flood of content. But it does help to have that really personalized content that you can really speak to individual demographics, groups, purposes. We're seeing that being one of the most transformative, I think.

Demetrios Brinkmann [00:23:56]: Oh, interesting. So you're not saying like a design firm that will churn out 200 different ideas and then choose three of them. You're saying personalizing in a way that wasn't necessarily possible beforehand.

Spencer Reagan [00:24:12]: Absolutely. I think everything, you know, social media was already kind of becoming this short form content thing. You know, I think NRF earlier this year it was all over, it was, you know, everyone needs to move to short form content. It's all about the gen Y, it's all about the experience in person. But the short form, easy to consume content and the, the reason for that I think is because it is more personal. Right. These feeds, these algorithmic feeds that everybody gets, you know, they're really, they're really customized to you and your needs and wants. And the more content, the more granular the content is, the more targeted those things can be.

Spencer Reagan [00:24:50]: And there's arguments you can make about whether that targeting is good or bad. But at the end of it, that ability to have material that speaks to individual people more specifically I think is valuable. We've seen it be valuable, we've seen the customers like it and it just matches the flood of this social media driven demographic, hyper personalized experience that we're seeing evolve all over the world.

Demetrios Brinkmann [00:25:17]: So what are some ways that you have encountered challenges when building with AI and overcome them? What are some things that just were hard?

Spencer Reagan [00:25:32]: You know, tools and integrations are one of the toughest. You know, initially I think everybody was impressed with. You know, I ask a question of a PDF and I can get an answer which was pretty important, don't get me wrong, all 18 months ago or what, you know, as fast as we're moving. But now what we're seeing is as we've passed that initial hey chat with PDF and we've passed this phase of rag where we need to take everything and ingest it and turn it into vectors and then search across it, we're seeing this ability to empower LLMs and agents with tools in A way that makes these calls optional, right? MCP is huge. Just tool calls and function calls in general. And that's where really, you know, as much as I've shied away from the term magic, that's when a lot of the magic's happening right now. Because what we see is, you know, a certain set of instructions and prompting giving the LLM kind of some real agency about I'm going to go look something up, I'm going to read what I got, I'm going to decide if I need to look up more. I'm going to maybe make an action or take an action because of that.

Spencer Reagan [00:26:36]: And so that there's two parts of that that's really difficult. Number one, once you've given an agent the ability to just use a tool or call an API, it's really important that they know exactly how, when, why, maybe what not to call. Because it's very easy for an agent to go call an API that returns back just 100,000 words or something and immediately you've got an issue. So that's one thing, but the second thing is the security because you've given some sharp scissors to a very toddler esque entity in terms of its ability to run around and make some, make some, some issues or cause some havoc. So you know what we're seeing, one of the biggest challenges is in that, in giving that agency, but with some constraints. You know, I think that's where we get back to this idea of governance and how do we make these agentic software systems something that can be used safely in business because they are non deterministic. And just by the virtue of the definition, it's incredibly difficult to put a discrete rule set around something that's non deterministic. But what you can do is get a general sense about what you want to empower these things to do and then maybe zoom out and start taking a look at, well, if it does this, does that mean I need to change the rules for the next couple of things? Right? So you start getting this idea of understanding how to constrain some of this behavior, even though you're trying to give a blank slate at the same time that tension, that balance is really where we see the sweet spot in building these successful, repetitive, trustworthy systems right now.

Demetrios Brinkmann [00:28:14]: It's funny you mentioned that one because I've been meaning to create like a fun series or just set of TikTok videos where I play a character that is an agent personified called like Johnny Drop Tables. And it's like this agent keeps Deleting our fucking database. What's going on here? And Johnny Drop Tables doesn't even realize that they're doing things wrong. They're just doing it wrong, you know. But the other thing is the question that I, I feel like every agent builder is going through right now is exactly what you had mentioned. How much harness is too much harness and where is that sweet spot? Have you found any tricks when it comes to that? Wow.

Spencer Reagan [00:29:07]: It's not necessarily a trick, you know, so many of these things. I'll go back again to the human analogy. It's like that intern, right? If you just give them that set of tasks on Monday and then check in with them on Friday to see if they got there, I mean, maybe, right? But the real answer is you got to be on top of them, you know, once an hour, every couple of hours, you got to check in. And that's what we see with agents. And so much of this is really just, you know, so many human analogs, but for this specific one, it's that it's checking in until you get a level of trust. We have some really interesting tricks that we do that help anytime an agent hits an error, making sure that agent understands the error, and in a separate stream, evaluating how it could potentially prevent itself from hitting that error again and going ahead and recommending that change. So there's things like that that you can do, but there's still no substitute for right now, just a little bit of oversight and guidance because the vectors, you know, how these things can go wrong, are just so massive, right? It's just this literal 360 degree ball. If you could go any direction and you've just got to stay on top of it like a, like an intern, like a new employee, until you get a level of trust.

Demetrios Brinkmann [00:30:20]: Yeah. It makes you feel though. And I don't know if you've had this realization. It's like, wow, this is really cool, but this is unscalable. If I have to be checking in continuously, all right, if I have one agent, that's great, but if I've got thousands that I want to be doing all of the stuff that I normally do throughout the day, then me checking in with them constantly is not really going to work.

Spencer Reagan [00:30:49]: Correct. 100%. And you know, I kind of often will hold the, the one person billion dollar company up as a kind of a North Star. Right. You know, is sort of a meme we all had at the beginning of this. How long is it going to take before there's the one person billion dollar Company when we get there, what will that look like? Right? Because if you, if you can envision that and you can get there and you sort of work backwards, you start thinking through this and you. That's where there's these human analogs that I keep coming into, really come into play. Maybe not because they're perfect, but because again, we're talking about replacing systems that were already human infrastructure.

Spencer Reagan [00:31:22]: Right. So it's back to that, that, that conversation we were having earlier about, you know, the infrastructure is made for humans. So if we're going to replace them or we're going to, we're going to, we're going to attach LLMs to these tasks, that's probably how we're going to look at it. And if you work backwards from that, you come up with the idea that the important tasks, the big important tasks are probably going to be held by things that you trust more for one reason or another. Those are probably going to be more expensive, they're going to be more capable, they're going to have more experience, just like the people that you would trust in that organization. So you get this idea there's probably going to be this, this tier, this pyramid, this hierarchy, if you will, of the types of agents and the resources they have and what they cost you versus how much you can trust in them and how autonomous they'd be.

Demetrios Brinkmann [00:32:11]: Oh, I hadn't thought about that. Yeah, especially because we kind of just hit the API now and we pay for input tokens and output tokens. We don't think. And I guess in a way, different models could be seen as that you're paying for the more expensive model. But yeah, I could see a world where you're willing to allocate a certain budget. And I think Anthropic kind of did this for a while. I don't know if they're still doing it with the idea of saying allocate X amount to trying to solve this task. And I'm okay with you spending up to that much.

Spencer Reagan [00:32:55]: Yeah, a hundred percent. We see this. One of the first questions we get from customers when they start talking about these things in business is, wait, how can I, how can I put some constraints on the spending? Right. How can I budget what this is, Whether that's a budget for the entire project or, you know, X tokens per month or I don't want you to go over more than this per day. That usually is one of the first things that we get. And that's actually not that complicated of a system put in. We have it in like a lot of people have that in and it's one of the, one of the things that's needed right out of the gate just to kind of control that. But again, it is that same human analog.

Spencer Reagan [00:33:31]: Right. If you think about, I'm going to put a new business unit together, I'm going to hire some people. One of the first things in your head is what's the budget?

Demetrios Brinkmann [00:33:38]: And you know, the funny thing is though that no matter what the budget is, we as humans still find ways to mess it up. And you, I just think like to a conversation that I was having with a friend about Snowflake and how you can set a budget for. Let's say that like my monthly budget on Snowflake is 40 grand. But that doesn't mean that you can't eat up that 40 grand in one lookup, you know, or one equation. And then your whole monthly budget is done because of one thing someone put. And so having much more scoped budgets is almost what we're lacking in a way. And I feel like there's a very clear parallel with the OpenAI or anthropic APIs because you can put a budget for your company, but I can also use that budget or I can have my one engineer use that budget in one day. And then the rest of the company's like, what happened? I thought we had 40 grand to spend.

Spencer Reagan [00:34:48]: Yeah, rolling windows are the best way. Right. You know, if you think, you know, X amount per day or per 24 hour rolling window or per week or stuff like that. So there's, there's decent metrics that you can do that you can put in place. I'd say the more important thing is really kind of understanding when you hit that where's it coming from? Right. You know, what caused that hiccup? Usually it's some sort of rogue lookup or something like that. You know, an API call that, you know that it was supposed to return 30 pages of PDF to human. Not something that went straight to a context window.

Demetrios Brinkmann [00:35:21]: Yeah, exactly, exactly. I like that. Rolling windows, that seems like something that is again fairly obvious for somebody who has experience. You instantly call that out. You're like, no, no, no, we've dealt with that before. I know how to not blow a bunch of money up in the. Make that whole money truck go boom, you know, so that feels like a good one. The, the other piece though of it.

Demetrios Brinkmann [00:35:49]: Like, do you see ways to look at budgeting per person or is it per team? Like how do most people like to structure it? Because if we're looking at it again, like humans, you kind of have budgets per team or per category or activities, right?

Spencer Reagan [00:36:11]: Yeah. So we see all the above. We see budgets assigned to individual people, users, we see budgets assigned to projects, we see budgets assigned to, you know, individual tasks per day. We see all of that. And it really kind of depends on what is the set of tasks. Right. One thing that is interesting is there's never a question about was the output worth it. Right.

Spencer Reagan [00:36:36]: Unless there was some mistake where you've called something and you grabbed a bunch of data you shouldn't have have and it just blew up the token window for the day and you hit your limit for the most part. As much as we talk about how expensive these tokens are, the reality is compared to. Right, they're astronomically cheap if you really think about and you're using it properly from a development standpoint, a marketing standpoint, HR, legal, all the things we talked about. The actual cost that you spend on the tokens is just minuscule compared to, generally speaking, what it would cost to have a human being do it. Now that's got to change. Right. There's these companies have these massive investments in Thropik OpenAI XAI, massive investments in these data centers. Everybody that's running the open source stuff from Asia as well is just huge investment.

Spencer Reagan [00:37:27]: And so there's going to be a natural arbitrage. If I'm getting massive value out of some tokens that aren't costing me a ton and somebody's got a giant data center they're looking to monetize, that's going to arbitrage out. So I think we're in this weird golden age of, you know, it's, it's, it's the juice is absolutely worth the squeeze every time, if you will. But it'll be interesting to see how this plays out. We're going to get to a place pretty quick where there's. Where it's painful. Right. Where the value is much more associated with, with what's.

Spencer Reagan [00:37:59]: What the output is. You people are going to have to make different decisions. That's where the budgeting is really going to come into play. I think we start getting closer to that cost from the value.

Demetrios Brinkmann [00:38:11]: Yeah. Enjoy the subsidies while they last.

Spencer Reagan [00:38:14]: Exactly, exactly.

Demetrios Brinkmann [00:38:17]: I have heard different companies going about this with outcome based pricing. Have you tried to experiment with anything like that?

Spencer Reagan [00:38:27]: We haven't. What we do is the software that we offer enables someone to do all this stuff in a way that's organized that they can hand it for carry and feeding that has all the security and governance stuff we've talked about. We offer our own keys, for instance, for customers. You could charge an account with us, we burn that down. Or they can use their own keys, bring their own keys. We're really just agnostic to which approach the customer takes. It's not something that is important. We really try to focus on the value that our software provides.

Spencer Reagan [00:39:03]: So we're sort of observers in this price war as well. At the same time, we're watching it unfold. At the same time we haven't seen many people that are pricing outcome. It's really just time and materials and it's not deliverables based, if you will. If you go back to a PS comparison.

Demetrios Brinkmann [00:39:30]: Yeah, well, I. This is probably a good moment to just jump into what you guys are doing.

Spencer Reagan [00:39:35]: Oh yeah. So Aria is an orchestration platform, you know. And we began by looking into some applications that we wanted to build and some of them were in health and wellness, some, some other areas. And we built some AI applications. They became our reference applications. And then we stepped back and said what's the lowest comedy mode? The problem that we wanted to solve was there's an enormous amount of complexity in these new technologies. They're brand new. There's not a huge knowledge set out there in the market.

Spencer Reagan [00:40:07]: There's lots of organizations out there that have really competent developers, app developers, IT teams, but they don't have AI experience means that they are probably capable of doing all of this if we just give them the right tools and put it together in the right way. So that's where we started. We said, hey, there's this need for everybody to adopt AI, find ways to bring this into their organization and get value out of it. How can we make it simple for the teams that are in place today to be able to do that? That's the system we started put together. We put together a system that's agnostic of the model provider of the tool, of the token of whatever it is. We just wanted to put together a canvas, if you will, to assemble AgentIQ business software to test it, to manage it, to monitor it, to have ways to have it be reliable. And then we've added in a lot of other feature sets, the evaluations, the red teaming, the security guardrails, a lot of the constraint stu that I talked about earlier, that ability to let your agents go call tools, but then also not do some of the more destructive actions or change those actions around as the agents go through their execution based on what they've looked up, you know, if they've looked at some sensitive information, they shouldn't be emailing outside the company at the same time. Some of these things that are, you know, just really simple to understand, but haven't existed in this new paradigm of, of AI technology.

Spencer Reagan [00:41:27]: So that's what we do at Aeria and that's really what we put together. All of the security and guardrail and stuff that we, that we market, that we provide value with is available whether you build your agents inside Aeria or not. But we did begin as a true agent building platform.

Demetrios Brinkmann [00:41:44]: Tell me more about the red teaming, because that always fascinates me and it always feels like it's gotta be the most fun job. Just the professional Red teamer.

Spencer Reagan [00:41:57]: We do have a PhD on stab and he does love what he does in that aspect. Our red teaming, we tried to build it from an agentix standpoint almost from day one. We really acknowledged that. Boy, as you build these agents, QA is just really tough. Standard integration tests just don't work right. So you have to come up with these ideas, these schemes of adversarial testing, because really, even if you're testing a conversation or you're testing an agent for a particular output, you can't just ask it one question. You have to have a conversation. You have to have a conversation with essentially almost different personalities.

Spencer Reagan [00:42:34]: You have to come, how does an angry person react to this conversation? How does a person who doesn't speak English as their first language, a person who is too verbose, all of those things, you have to have always come up with these ideas of, of who are these personalities? And then let that be an adversarial challenge to the agent you're looking to test. Wow. And once you've done that, that's actually not the hard part. The hard part is you get all these answers, you get all these conversations, how do you evaluate them? So you've generated all this synthetic data about how your agents perform and then what's important about that, right? How accurate were the eight answers that came out? How closely do they compare with what you expected it. How well can you trust that this agent is going to behave the right ways in these environments?

Demetrios Brinkmann [00:43:20]: Sounds like that's a very expensive task. And that's why I, I mean, you've undoubtedly heard LLM as a judge is kind of what most people throw at that. But I am not convinced that you can just. LLM as a judge it, especially for something like red teaming, where it is a little bit more high risk, you.

Spencer Reagan [00:43:42]: Know, help me Understand what are you worried about with LLM as a judge? We go to pros and cons, but I'm curious.

Demetrios Brinkmann [00:43:49]: Yeah, yeah. I feel like the main thing on there is that you can not know what you don't know. If you're throwing LLM as a judge at it, then you feel like things are going well because the LLM has said yeah, this is all good, we're golden here. But in reality it's not among a large corpus of data. You may let something slide under the cracks and if it is a red teaming use case, it just feels like you want to have someone be really getting intimate with those multi turn conversations that you've created to know and to understand the way that the agents are communicating.

Spencer Reagan [00:44:48]: Yeah, I'd say that's accurate. You know, we generally encourage a combination of spot test and LLM as a judge because you can get some value of scale out of LLM as a judge across multiple iterations that you can't doing true human eyeball spot checking. But to your point think it's just as important because it's probably one of those things where we, you know, encourage both approaches. But I don't know that there's a, I don't know that there's a solution on either side. You know, the, the scale of being able to run these types of evaluations without needing a hundred people in a room reading reams of conversations is valuable. Yeah, you know it is.

Demetrios Brinkmann [00:45:25]: Yeah, I, I see what you're saying. It's like you use both so that you can almost like sanity check, have one sanity check the other.

Spencer Reagan [00:45:36]: Absolutely, absolutely. An interesting paradigm. I don't know if you've seen this or it's come up. We found though, in times when the evaluation is against different base models. I don't know if you've seen this. Instead of just evaluating based on personality, if you're evaluating a different model that you might use in a few different agents or a few different models that you might use in one agent agent. We've seen that the judge evaluating it, the models tend to prefer themselves so much that you have to make sure you get a diverse set of judges because it's undeniable. If you look at the statistics and you do this and you just measure multiple models and you evaluate their efficacy on the answers and you use models to evaluate that efficacy.

Spencer Reagan [00:46:19]: You undoubtedly see over time the models prefer their own answers. It's really interesting.

Demetrios Brinkmann [00:46:25]: I mean, sounds a lot like some humans I know again, right?

Spencer Reagan [00:46:31]: There's the, there's the theme I guess. Right. As much as we might not want to think that we are as deterministic or as understandable as we might be, it might be that we're a lot like these things.

Demetrios Brinkmann [00:46:45]: And are you evaluating the conversations for a certain set of important features, I imagine, and then evaluating or just looking back on how well they are tool calling or how well the agents are actually like, executing what was asked of them. Can you break down that? Because I've heard some folks talk about how they'll go as far as to create heat maps for the agents and really see where the agents fail over many different simulations. You can get like, a clear picture because of the heat maps that you've created.

Spencer Reagan [00:47:29]: Yeah, and I've seen that, so that's really compelling. Generally, what we encounter from our customers is they know very specifically what they want in an outcome. They know very specifically how they want an agent to react. They have an idea of what they would prefer. So it's less about a. A true evaluation of kind of choose your own adventure and let's see where we end up and where we can get. And it's more about. Here is the exact answer that I would expect or would want in that case.

Spencer Reagan [00:47:57]: And so most of the evaluations that we see our customers run are how close am I to the true north of what I said I wanted the agent to answer in this particular situation?

Demetrios Brinkmann [00:48:08]: Oh, that makes it so much easier, I imagine.

Spencer Reagan [00:48:13]: Simpler, you know, but then you still just only have this linear approach of, you know, how, how, how like, was this answer, which, you know, sometimes just a percentage can be a little bit misleading, but generally gives our customers what.

Demetrios Brinkmann [00:48:28]: They'Re looking for, you know, Amazing. Now, is there anything else that you want to talk about that we didn't.

Spencer Reagan [00:48:34]: Hit on, you know, we're seeing. I know, I know we talked a lot about integrations and, and that the tool calling being where agents really start breaking out of that and start. Start exhibiting behaviors that look really cool, that look like they're figuring stuff out. We're spending a lot of time there. That's, to me, one of the more important areas. And, you know, the idea that, you know, we all were settled on RAG for a little while and, and we just had return log minute generation as a way to get information. LLMs, as you know, I think it was a flash in the pan. And what we're seeing now is this idea that it makes more sense to empower agents to go look up the data and do what we described.

Spencer Reagan [00:49:18]: Choose your own adventure. Right. Decide if you need to look up more. The idea of rag feeding matching chunks in is probably going to really diminish in its efficacy and its applicability going forward. And so I think what we are seeing is just that importance, that importance of understanding how do you equip an agent with tools? And it kind of comes down to apps, right? We all have apps, things that we interact with and now we have agents. How do we make our agents securely talk to our apps? You know, we've seen MCP just proliferate, just everywhere. Everybody's running these GitHub repos. You know, you want an MCP server for something you search and we're just, we're grabbing things out of GitHub like it's hugging face, right? We don't even know the guy that wrote this MCP server, but it says it's going to go check that API and give me the price of gold in Zimbabwe.

Spencer Reagan [00:50:11]: Who knows, right? And so we're just running this, this GitHub repo. We're just running it, right? We're just yeeting around API keys. It's nuts. It's almost as if we've taken decades of security practice and just throwing it out the window. You know, the amount of API keys you can find in GitHub repos now or the things that are just being thrown around in emails is nuts. Now there's some real truth to why everybody's doing that. There's insane value to these tool calls. The value though is about letting the LLMs go get this information and interact with these agents or interact with these apps in business.

Spencer Reagan [00:50:48]: That's incredibly dangerous because essentially we're now saying, hey, I'm going to give this agent an API key to go interact with my Atlassian products set or my Salesforce product set. That's dangerous. And we don't, we don't then have identity. We have trouble with data governance and access. So the real next frontier I think is in trying to figure out how authentication fits into all this, right? We need to give these agents identities or we need to let them use the identity of the person executing them, right? Which starts to come into this idea of oauth of user based authentication of dynamic client registration. Lots of really detailed stuff about how can we make sure that when we go execute an agent and it has five tools that when it calls those tools, it's doing it with the credentials of the user that initiated the call. And I think that's going to be one of the biggest unlocks for us as we turn this Corner into securely empowering these agents to interact with our business systems and figuring out how do we handle this authentication throughout the system, throughout the pipeline of this agent, taking a query, going to execute some tools, doing some things, but doing it in a secure way with identity that matches our existing practices. To get out of this whole, like, you know, let's email around some API keys and run some random GitHub repos.

Spencer Reagan [00:52:02]: But so that's, that's where we're going and we're putting a lot of energy and information into it. Go ahead.

Demetrios Brinkmann [00:52:09]: Especially when it comes to dynamically provisioning permit permissions and taking them away, because you don't always want the agent to be able to do what you want it to do right now, now, 100%.

Spencer Reagan [00:52:23]: And that's where that idea, those constraints comes in, right? That idea that, you know, simple basic rules that say, as I described earlier, if an agent goes and looks up something in my Atlassian, in my jira, right? And then it goes to send an email, I don't want that email to be sent outside the company because it's just looked up sensitive information that's in the context window. It might, you know, in a, in a, in a bad scenario might find some reference in a JIRA bug to a customer and a problem that they're expecting an answer to and that customer's address might be in there. And look, I've had agents do some crazy stuff when you give them access to do it. But you know, you could see a scenario where an agent's going through and it's looking up a jury ticket and it finds that there was a customer and that customer's contact information is right there and it wanted to know about that bug and the agent just says, let me do this. Right? Those are the scenarios we really want to try to avoid, right? We can understand what that scenario is and the risks of it before that agent even executes. A lot of times you can just look at agentically the prompting and instructions of an agent and know what it should and shouldn't be doing. But this simple idea of this, if this, then that statements around agent execution, around tool calling and security dynamically adjusts in real time based on what that agent has done, based on where it's going. To your point, right? All this happens based on a rule set because it's got to happen faster than we can keep up with.

Spencer Reagan [00:53:46]: So we have to establish that rule set ahead of time and just know that our agents are going to work in sight it. Even if there's some intelligent, dynamic stuff in the meantime. But, but that's, that's what we're seeing is the, the way that we can actually finally cross this chasm of businesses have these cool prototype agents, but they're scared to actually using the soft.

+ Read More

Watch More

Challenges in Deployment Automation for AI Inference
Posted Mar 15, 2024 | Views 997
# Deployment Automation
# AI Inference
# Perplexity AI
Data Governance and AI
Posted Feb 16, 2024 | Views 713
# Data governance
# AI
# Gjensidige
Evolving AI Governance for an LLM World
Posted Jul 17, 2023 | Views 412
# LLM in Production
# AI Governance
# Factory
Code of Conduct