MLOps Community
+00:00 GMT
Sign in or Join the community to continue

The Future of Compute: How AI Agents Are Reshaping Infrastructure // Diego Oppenheimer - Keynote // Agents in Production 2025

Posted Jul 23, 2025 | Views 140
# Agents in Production
# Hyperparam
# AI Agents
# AI infrastructure
Share

speaker

avatar
Diego Oppenheimer
Head of Product @ Hyperparam

I am a serial entrepreneur, executive, product developer, and investor with a deep passion for data and AI. Currently, I do deep advisory for startups and scale-ups.

My journey includes being a co-founder at Guardrails AI, CEO in Residence at Factory, a venture fund focused on AI investments, founding Algorithmia (acq Datarobot) where I served as CEO, and leading teams at Microsoft to develop key data analysis products like Excel, SQL Server, and PowerBI. I have applied my extensive experience and involvement in the AI/ML community to help drive innovation and set industry standards.

+ Read More

SUMMARY

The rapid evolution of AI agents is exposing a widening gap between their unique computational needs and today’s infrastructure. This keynote cuts through the hype to highlight why traditional compute paradigms—mainframes, VMs, containers, even serverless—are struggling to keep up with agents’ bursty, stateful, and hardware-hungry workloads. We’ll examine the economic and technical inefficiencies organizations face, from unpredictable scaling to persistent state management, and why simply “tweaking the cloud” won’t cut it. Expect a candid look at the real operational challenges, the architectural dead-ends, and the tough question: do we adapt existing frameworks, or is it time for a radical rethink of how we design and manage compute for the AI era? Actionable insights, not wishful thinking.

+ Read More

TRANSCRIPT

Diego Oppenheimer [00:00:00]: Hey everybody, thank you so much for having me. Thank you for the patience. So today what I want to cover is a little bit of introduction about kind of like thinking through the infrastructure of what's going on as these kind of like intelligent cognitive workloads are starting to become reality. So I call it the infrastructure paradox. What happens when software starts to think? I'll rush through this, spare you. I've been working on AI, data and infrastructure for the last 15 years across multiple companies that either co founded or built or helped build out. And so it's a space that I'm super excited about. I couldn't be more excited about the time that we live in and to be here chatting with you today.

Diego Oppenheimer [00:00:54]: So what are we going to cover today? Just to kind of give you a little idea. There's going to be like some really, really, really good talks throughout the rest of the day that are really practical and really kind of like to the point. I want to get you a little bit of high level thinking about what's actually happening in infrastructure. So the first thing is talking about what I call the great mismatch. How our current systems are essentially misaligned with how thinking software is probably going to behave. And I'll talk a little bit about what I mean by that. The reason why I believe this to be true is there's these emerging patterns that are starting, these signals that are starting to happen inside our software, inside the infrastructure that we're building. They're starting to indicate that maybe what we've have up until now is not how infrastructure needs to be built out for the future and something that we should start thinking about.

Diego Oppenheimer [00:01:48]: And then I'm going to challenge you all to a thought experiment. If we had the opportunity of reimagining infrastructure from the ground up and we had the ability to really think about what cognitive workloads would look like from first principles. How would we design the infrastructure of the future? How would we design our production systems? Then beyond that, some possible future ideas of how we might want to reshape and build systems for the next decade. The first, I want to give you an idea of what this might look like, maybe some of the problems. Let's just say we send off our autonomous agent, we let it loosen production, we want to think about what cracks first. I've mapped three different resource planes that you actually pay for to just understand on a compute side. Maybe that agent goes and spawns 12,000 forward passes in sub two minutes, forks 15 sibling agents, each one with its own credential Scope. Suddenly what's happening at that computer layer is that our model API queues start spiking.

Diego Oppenheimer [00:03:04]: Our P95 goes to shit, our scheduler starts reacting to what the signals are doing and we start scaling up and down know really rapidly burning through a lot of cash maybe at the network and streaming layer, we Suddenly are streaming 10,000 tokens directly back into the UI. The cause of that, what we see is websockets start to buffer overflow. Our reverse proxy starts on a 502 loop. The user starts experiencing end user, you know, these partial completions, these retries, you know around and this amplifies the load. Maybe we look at our memory layer, right? And suddenly we're seeing that like our agent is pulling 2 gigabytes of embeddings and docs from its context window. It's emitting out eight megabyte chain of thought log cycles. Our cache miss is complete disaster. We're trying to pull 6 gigabytes of cold S3 reads thrashing page cache.

Diego Oppenheimer [00:03:59]: Our application performance management system actually truncates logs at 500 characters. So our forensic debugging is completely worthless. And finally these agents have been lifting credentials and fanning out those credentials in completely cross tenant way, breaking every single security paradigm that we're supposed to have across these credentials. So this could all happen in about an hour or maybe even less in a world of autonomous agents. So what would your compute budget, network fabric and storage, could it survive this? And a hint to this is like the reality is that the traditional scale up, scale down playbooks that assume predictable crud systems just don't translate to how agents behave. So have this scenario in your head as where this future where things might break. This is kind of like that thought exercise. So I want to give a little bit of a history lesson, a very short one, but for the last seven decades we've essentially built infrastructure around predictable patterns, right that follow deterministic clear rules of engagement.

Diego Oppenheimer [00:05:07]: We have the request response pattern of simple client server interactions with predictable latency and resource usage. We have the kind of store retrieve data follows clear patterns of persistence and access with known capacity and requirements. And we have the scale up scale down of infrastructure. Responding to the scaling pattern is based on metrics deep diving on a decade by decade, 1960s mainframes, we optimized for batch processing, running jobs sequentially, minimal user interaction. In the 80s we moved into this client server pattern of built for transactions and database operations with clear response patterns. In the 2000s we got the web and so now we're talking about designing for web scale elasticity, resiliency across distributed systems. Then we move into serverless, right? Where now we're engineering for event driven ephemeral execution with minimal infrastructure environment. We can react to certain metrics to understand how we should be scaling our infrastructure.

Diego Oppenheimer [00:06:13]: But suddenly we now get to today where how do we actually what's the infrastructure paradigm that's actually going to build out, that's going to emerge to support these cognitive workloads? And that's what I really want to talk about today. And there's some clues that we might have to think about things differently. I'm always of the idea that like let's not reinvent the wheel and it's really not necessary to reinvent the wheel. But there's some clues right now in agent patterns that are saying hey, we might actually have to rethink about this from first principles. So software that thinks right agents just don't process. They break the fundamental rules we've designed our systems around. They perceive right? Because they can continuously look at the environment and detect meaningful changes. They can act autonomously on their own without human triggering.

Diego Oppenheimer [00:07:04]: They can reason, they can apply logic and inference to make sense of information as it's coming in. And they can remember in the sense of there's massive context windows that across multiple interactions that are needed to flow up and down the chain of thought here so that you can actually complete these tasks. So these four fundamental characteristics are very different from any software we've built before and that might have downstream implications for the way that we build our infrastructure and our systems. So to give a little bit of a visual around the fundamental difference in a traditional crud, we have this like very linear input output, predictable resource consumption, mostly stateless operations with some level of explicit persistence, user client, HTTP endpoint. We have that business logic, we might retrieve some stuff and then return that response. When we start seeing agentic workflows, we're seeing a lot more of cyclical reasoning loops with feedback. We're seeing resource consumption that actually is not based on clock time but actually based on the complexity of the tasks that this is doing. Which actually has cost implications, it has timing implications, it also has downstream triggering of computer implications.

Diego Oppenheimer [00:08:20]: We have this idea that we now need to have a stateful context requiring that persistent memory. It's not just different, it's almost orthogonal to anything we've built. The fundamental patterns just don't patch. Here's a couple of even deep dive around those signals. So we start seeing these compute tensions, right? We have the memory versus compute in agent memory needs to be persisted. But we want compute that demands ephemerality, right? We want to be able to just lock up the bring in the compute, take it down. Yet we want this really persistent memory and the current architectures that we have. If you think about a lambda function or something, we have to choose do we want the persistence or do we want the ephemerality? We have a cost structure disruption because today expenses actually spike with the cognitive depth of the workload, not the throughput.

Diego Oppenheimer [00:09:15]: We've always been paying for throughput, we've been paying for response time. And now we have to understand that our cost structure is actually based on the cognitive depth. Our models for predicting and controlling cost break down completely. Probabilistic debugging. We had a little bit of this in this early when we were starting to build out the first MLOPS systems way back in the day. We already had a little bit of what happens when we're dealing with probabilistic software. What breaks in the software development lifecycle around that. Today when we look at these pretty agentic systems, how do we do trace reasoning that's inherently probabilistic? How does traditional debugging assume this? Traditional debugging assumes a deterministic execution even in a traditional MLAPS pipeline, which we generally understand the path that was covered.

Diego Oppenheimer [00:10:05]: But now as we allow agentic workflows to go out, the path that is covered is completely probabilistic based on that task. And now we have to go debug that we have to audit. We have to be able to understand what's going on. And then finally we have this concept of swarm coordination, which is agents are actually collaborating at machine speed. And our orchestration systems are gonna struggle with the communication density. Cause it's not just HTTP requests, right? We're now passing this memory back and forth, which is very, very large. And that is clogging up all the kind of communication protocols. So some of these emerging patterns to start thinking through.

Diego Oppenheimer [00:10:44]: Every agent has a story. Traditionally we have the model weights shared across all agent instances. I'm calling an API and calling a model. We could measure it in gigabytes. We had static. It was model weights were just essentially static during runtime. And it was pretty well supported during current systems. We spent a bunch of time trying to figure out how to run these things on GPUs.

Diego Oppenheimer [00:11:03]: And that is true. But overall we could support it with our current architectures. In the agent state world, we have state that is unique per agent instance that grows with the interaction depth of each one of these agents. It contains full conversation history of previous agents that have worked on and we need to track goals and do reasoning chains. Our stateless paradigms of especially around memory completely hit a wall. And so every conversation, every goal, every piece of reasoning, it all needs to be persisted somewhere now. And that needs to be able to be used as these components fan out. The second emerging pattern is this collaboration imperative.

Diego Oppenheimer [00:11:51]: We're going from compute isolation to actual compute, what I call intimacy. Single agent systems, simple vertical scaling, one conversation thread that manages mostly everything, clear resource boundaries in terms of what we're calling, when we're calling, response times, and somewhat well understood monitoring. You know, I'm giving us a little bit more credit than we have right now. But like I would say, it's pretty well understood. In a multi agent system, this complexity grows exponentially. We have these massive fan out and fan in operations that are happening at machine speed. We have to share context requirement and we have to do coordinated decision making. While we've built for good security and proper systems building designed for isolation, we're now actually being forced to enable intimacy between these components.

Diego Oppenheimer [00:12:40]: When agents collaborate, they create these communication patterns our systems were not built to support. The third pattern is an economic inversion. And we are moving from this concept of paper uptime to pay for thought traditional model, we pay for resources no matter what the utilization is. Obviously we have serverless and we have scaling up and down, but ultimately reserve resources for a piece of time and then let them go. The capacity planning that we do there is really around peak load handling. We're always trying to figure out what the peak load handling is going to look like. And we're trying to determine that to be able to provide infrastructure. All our cost optimization methodologies are around improving utilization percentage.

Diego Oppenheimer [00:13:24]: Right? That's really what we're looking for. And in this new model, this emerging model, we're actually paying for reasoning cycles, cycles that produce value that you know, the capacity planning focuses not on that reasoning depth, like how much, you know, how much chain of thunder is going to go through it. And the cost optimization means improving cognitive capability per dollar. So it's a very different economic paradigm of how we think about our system from optimization. So this produces essentially inverting. So we're moving more towards a model where the value is in the thinking, not in the runtime. And then we get to the cold start paradox. And anybody who's played around with a bunch of serverless functions, we go between figuring out efficiency versus responsiveness and we have to make this choice.

Diego Oppenheimer [00:14:13]: Serverless cold starts about 100 milliseconds to get a function up and running. Typical initialization time. Human expectations for a response time in a Natural conversation are sub 250 milliseconds. Machine are way below this. And now we have to pass in this meaningful context that's required for these agentic interactions to work somewhere between 100 megabytes and a gigabyte. Potentially our serverless gives us efficiency, but cold starts kill conversations. Do we keep everything warm? Well, now we just burn money across the entire system. This tension defines now one of the agent infrastructure challenges in terms of how we have to think about it.

Diego Oppenheimer [00:14:57]: Now that I've given you a couple of clues on the cost and how the interactions work, the network and interaction density, the ability that we have to move through state. Some problems that we might see as these mega agentic systems go to view. What if we took a step back and said, hey, we don't have 70 years of history. Let's design infrastructure specifically for cognitive workloads. My thought experiment that I invite you all in to is let's forget everything and what would we do? What would be the first principles that we go from the ground up to think through how we would do this? So because everything needs a name and I invented one, I call it the agenda. And it's a potential reimagination from first principles of what it might look like to build out these systems from scratch. So I have five principles that I think are worth kind of like, you know, talking through. And the first one is state resource decoupling.

Diego Oppenheimer [00:16:08]: Right. So resources flow to the need. Right. First principle is about decoupling, you know, state from resources. This means breaking the right, you know, the, the rigid bonds between where data lives and where processing happens for compute memory and storage operate as a resource pools on demand consumption. Agents consume exactly what they need when they need it. And we eliminate over provisioning by not more sizing for peak load or worst case scenarios. This is just about dealing with that unpredictability of what the agents might actually want to do.

Diego Oppenheimer [00:16:43]: Principle number two, which is interaction driven provisioning, which is I call it, infrastructure that breathes resources that scale with cognitive demand of the interactions, not just the request count. Trying to understand what the cognitive load might look like of this and understand how to scale resources behind that. Start doing pattern recognition around usage and proactively allocating resources were needed. We do this today somewhat, but now we have to do it around reasoning and the complexity of those reasoning tasks. We really Want to be able to go from zero to hero in milliseconds because we want that natural conversation flow to continue happening. Context persistence state that transcends execution boundaries seamlessly. We just don't want to save state, we want to version it, we want to make it searchable, we want to be able to roll it back. We need memory for agents to become a first class citizen.

Diego Oppenheimer [00:17:42]: We can do that via state serialization, we can do that via memory versioning, we can do that via cross execution continuity, which is being able to pass context between environments without losing fidelity or breaking security boundaries. And we need to make this all searchable. Finally, we want fluid execution boundaries, right? So we really want compute without borders which can flow from device to edge to cloud. In this case, we want on device privacy sensitive, low latency offline capability, right? We want to use edge for latency sensitive reasoning, maybe regional compliance, real time coordination and cloud for the most computationally intensive task, global optimization and long term memory. Also the reality today the absolute best reasoning models are only available in the cloud and so we're going to want to do that. So we have to think about these execution boundaries and how we can, depending on the task, move up and down that stack. And finally, one that I'm probably the most excited about, which is starting to think about market mechanisms for optimal resource allocation. So this is about markets, not managers.

Diego Oppenheimer [00:18:51]: So resource bidding, where agents bid for resources based on their tax priority or the value creation potential, value based prioritization. So system allocates based on where they think the most value output is going to be created, not just the input requests and then self organizing systems where resources essentially flow to the highest value workflow through market mechanisms at any given time. Certainly this is not a prescription, but it's a lens for thinking about the future. These principles could reshape how we build systems for the next decade and I think could actually allow us to rethink how we build infrastructure and production systems over time. Again, it's a lens, not a blueprint. I think these principles help us see where the friction exists today. They suggest where innovation might emerge moving forward. And finally they challenge assumptions about our infrastructure and the way it works today.

Diego Oppenheimer [00:19:46]: But there's still some problems and things that keep me up at night. There's some hidden challenges that are not really being addressed. The good thing is I was looking through the schedule of the conference and there is plenty of talks addressing this, which is awesome. So you're going to have a great day. Authentication. Who's the agent acting for how do we give away access? How do we take away access? How do we control and understand who and what and where? There's the governance angle of how do we audit machine speed decisions? We have compliance, which is how do we allow agents to handle regulated data? I know there's a talk I think later today and it's a really important subject that I think most people are not thinking about which is how do you do cost attribution? Like who pays for thinking? What department? Where does it go? This is a really interesting concept because this is pretty new to the world of agents of in terms of like again, what department pays for the thinking of what's going on in these agentic workflows? So three questions about the future as we explore this future together. You know, what infrastructure, what infrastructure assumptions today no longer serve us? How do we build systems for workloads that think and what happens when apps and I'll leave you with those for today. So the reason why this matters today is that I think we're at a huge inflection point.

Diego Oppenheimer [00:21:10]: Not because these agents are some magical thing, but because they're just different. We've seen and covered over the last couple of minutes why the infrastructure is different, why the behavior is different. I think this is an opportunity for companies to thrive in and really adopt and see what the next 10 years or next decade is going to look like of infrastructure management. I invite you all to let's figure this out together. To me, the future of infrastructure is not about fast servers, it's about systems that support softwares. That reason I welcome you all to one of my favorite conferences which is agents in production 2025. If you're interested in reading a little bit more with Priyanka developed a pretty in depth white paper around why the future compute might look different and how agents are reshaping infrastructure. You can go check out the QR code here to go.

Diego Oppenheimer [00:22:05]: Look at that and hit me up on LinkedIn. Challenge me if you think I'm completely wrong. I want to hear from you if you agree. I want to hear from you if you're playing or working in this space. I also want to hear from you. So don't be shy. Feel free to reach out to me and I'd love to chat about this further. Thank you very much for your time and sorry for the delays getting started.

Demetrios [00:22:27]: I got to give you a round of applause there. That is so good man. I really appreciate that. Of course I knew you're going to bring the heat and it was well worth the wait because A lot of this stuff I am just hearing for the first time. But as you say it, it's like, oh yeah, how are we going to do that? That is going to be a headache. I have a few. I have just one question and I'll let people throw their questions into the Q and A while I'm asking this first one. The main thing that I was wondering about was this like memory versus compute tension.

Demetrios [00:23:06]: And why you were saying that is that just because it's like compute is elastic and memory isn't or. I didn't quite get where that tension was.

Diego Oppenheimer [00:23:17]: So we've usually actually. So, okay, so the memory version compute tension is really around. Like we want compute to be just in time, right. And so if you think about kind of like serverless functions is a great example of this, right? Where it's like, you know, I call one, I spin one up, it does some stuff, it kills itself, right. They're completely stateless. But so I want the compute of an agent to work that way, but I want it to have full context of all the previous conversations. I wanted to one agent to be able to pass, you know, what it was working on in the full conversation history to the next one. And so you have this kind of tension of decoupling what the compute requirements might look like of a task versus what the memory and doing that.

Diego Oppenheimer [00:24:00]: There's a bunch of stuff that's working around this. There's Google and Meta are working on something called cxl which completely from a data center perspective separates memory from the actual compute cycles, which is really interesting. But that's the decoupling that's required, you know, right now in terms of like that memory persistent. We, we want agents to be able to work with full memory and context of what everything that happened before. Right? Because that's how they're gonna, you know, that's how that, that's how they work and that they get better. But we only want them to use the compute that they need at that time. Right. And it's not just an API call because they might be doing other tool calls, they might be doing kind of like processing.

Diego Oppenheimer [00:24:38]: Like we don't really know what the kind of like the shape of the task that it's doing at that time outside of the clear OpenAI call or something like that.

Demetrios [00:24:48]: So basically it's a little bit of this philosophy of like pets versus cattle and we don't want, we do want pets in memory, but we don't want it in compute or we haven't been wanting it in compute. And now we kind of need to, in a way.

Diego Oppenheimer [00:25:09]: Yeah. So again, like, it's, it's, it's, it's just the, the, the pattern that is needed for these workloads to execute. Right. Like we. So right now if you go look at, you know, you can, you know, you have a full day of asking people this question, like, go ask them how it is that they're, you know, outside of the. Hey, we have some level of compute, calling OpenAI or Claude or whatever it is. Like, what's the container like, what's the shape? Like, what's the shape? Like, what do you, do you have a VM that's running? Are you putting everybody on that vm? Are you. Are each agent spinning it up its own Docker container and sending itself and running? Are they super thin, just fast API wrappers in some pedantic functions, which I heard that's essentially how all of ChatGPT runs.

Diego Oppenheimer [00:25:58]: It's not a bad thing. I mean, you're saying that's kind of like it's around the shape, right? It's around the shape of how these functions are running and the complexity of the compute that's running inside of them.

Demetrios [00:26:09]: So we've got some questions from the incredible chat, and I'm going to start asking away. Have these agents been tested in an open environment without constraints to see how they operate? Imagine if our brains operated with the same constraints as the agents. We wouldn't be able to think clearly. Maybe that's more of a thought than a question. Let me jump.

Diego Oppenheimer [00:26:33]: I mean, I think there's been experiments, right? I mean, I think there's been experiments. Like you. I think, like, the. One of the ones I think is interesting that you can read about is like, like, I can't remember what they called it. Like, anthropic. Ran like a vending machine.

Demetrios [00:26:48]: I saw that.

Diego Oppenheimer [00:26:49]: Right? Like, like, you know, like, I mean, which is, which is.

Demetrios [00:26:53]: And it lost a ton of money.

Diego Oppenheimer [00:26:56]: It. Yeah, well, it ordered like, it was like a, you know, shout out to Silicon Valley for, like, making the joke 10 years ago. Like the, the show where they're like, where, what's it called? Like, the guy's agent ordered like, you know, 1,000 pounds of burgers or whatever. Like, essentially that's what happened, but like, at a smaller, like, rate. But that, that's an unconstrained, you know, version of something like this where agents kind of like, went on and tried to run a business. It was a disaster. But, but I think the. But that's just A point in time.

Diego Oppenheimer [00:27:24]: Like, I mean, this is, this stuff is all going to work.

Demetrios [00:27:26]: So we've got another great question coming through here in the Q and A and it is, how can we refine this further to develop a framework for agentic workflow, implement implementation? In particular, how do we handle agency? Who is responsible for the agent's outputs?

Diego Oppenheimer [00:27:45]: That's a great question. I think, like, in terms of developing for frameworks, I think, you know, my general principle is like, you know, try not to change anything that's not broken. So like, I was actually like, very like reticent to actually like, say, oh, we need to rethink how infrastructure works. I really didn't want to, right? Like, in the sense of like, I was like, I was like, I'm pretty sure we can fit most of this to current compute patterns. But then as you started like, kind of like digging into it, you're like, maybe it's not, maybe it's not going to fit. Right? It's kind of our common, our dimensions, our conversation around like LLM OPS and ML ops, right? Like, is there really a new ops or is it the same? And like, blah, blah. We had that conversation for a while. But like, you know, I think here the core thing, the probably the first thing that matters is really this kind of like memory in context.

Diego Oppenheimer [00:28:36]: Like, I think that's the most complex piece of the puzzle. And what's this concept of context engineering and really being able to help the agents provide the context necessary to complete tasks and manage that over time is probably the number one thing to figure out, right? Like from kind of a principles perspective, I think that's the most complex and the one that really doesn't fit the current paradigms that well. The second one I think is probably Auth, right? I mean, Auth is like, you know, you know, if we go back to like principles that we had before, like, you know, we were building a lot of analytics tools like, does it represent Diego? Does it represent a service account? Is there something in the middle? Like, we have to create dynamically these permissions potentially at the time, we have to revoke them dynamically. We have to be able to understand not only that we created hundreds of permissions dynamically and we revoke them, but we have to be able to go back in history and understand that we did that. Right? And so like that complexity really, really goes into. That's probably the second place where I think is really important to rethink ultimately today what we're seeing. The traditional compute, like the density of compute we're still offloading most of the GPU work to providers, which is really complex and not solved. But like we're essentially kind of forgetting about it and being like OpenAI, that's your problem together, that's your problem.

Diego Oppenheimer [00:30:08]: Like, you know, like blah, blah. And we just kind of get API calls and what we do inside of the compute layer beyond that tends to be pretty thin. Right. Today. Right. I think that complexity is going to, from a computation perspective. So that's why I would say like memory off. And then, you know, the general governance is what's going to make it acceptable.

Diego Oppenheimer [00:30:29]: The part of the question was like, who's responsible for this? I don't have a good answer. Right. I mean, ultimately, like, you know, is it still the technical teams that are building all of this, probably inside organizations, do they have to own everything? I mean, one could argue like you built it, you launched it, you own it. Right. And there's a good thesis for that.

Demetrios [00:30:50]: But it's a lot easier to build something.

Diego Oppenheimer [00:30:53]: Right.

Demetrios [00:30:53]: You can have teams that aren't necessarily super technical building stuff too, which means.

Diego Oppenheimer [00:30:59]: Frameworks and harnesses for keeping things like guardrailed, like inside, you know, properly is going to be super important.

Demetrios [00:31:09]: Well, I do love the idea of if we are giving off and permissions to folks and you're in a regulated space, you need an audit trail on that. And that is a really tricky problem because in case anything does, God forbid, go off the rails, you need to be able to show who gave it the permission and how much permission it had. And so it's not only the auth part, but it is like the auditability of that auth and access. And so I love that. I do also love the question of who pays for it and who, whose dime this is going on, whose budget it goes on. But there's more questions in the chat. I've got like one or two more and then we're going to jump to our next talk. There's.

Demetrios [00:31:54]: Oh my God, there's so many good questions. Do you feel like this? Okay, here's one from Lack. Shout out to Lack. He's giving a talk later. Is this cognitive architecture relevant for boring enterprise applications with a few hundred simultaneous users?

Diego Oppenheimer [00:32:17]: I think, I think again, if we are thinking about systems that are behaving based on cognitive patterns versus kind of like traditional compute patterns, the answer is probably yes. How acute the problem is. Right on. Just because you have a small enterprise, it doesn't mean that you can't spawn up 15,000 agents. Like, the problem's not based on like size of company anymore or size. It's really around if you think about it like kind of like what is the exponential growth curve of the tasks that are being generated and expand in the spawning of those agents to go do stuff even if they're doing like very, very little work. So I think that's where the complexity like you know, lies in. There's the fan out and then the other one is on the depth of the cognitive load, like how much thinking these agents are doing and what they do during that thinking process, which is probably calling other tools.

Diego Oppenheimer [00:33:16]: Like again the main thing is it's like today we have these super large distributed systems. We kind of understand the full cardinality of all paths. It's large but we generally understand it because it's deterministic, right? Spun up this many things here and this many things here in true, true agentic workflows, we have no clue, right? Like the model might decide to go build 50,000 ways of doing something and go do the task if it determines that's the task, no way to go do it. And so I think the size of the company, the boringness of the enterprise are just not determining factors here. It really comes down to cognitive depth and the kind of potential fan out in coordination that is required across that system. And those two are probably related.

Demetrios [00:34:07]: That's fascinating to think about. Okay, last one for you. If the best practice is having a backend server for handling the API calls to OpenAI or Anthropic, how important are traditional system design principles and how do we scale API calls to these research labs?

Diego Oppenheimer [00:34:30]: I mean I think, you know, anybody who's been building with these systems knows, you know, has suffered the frustration of being rate limited, of, of being like backed off of downtime, of like, I mean like it's just true, right? I mean like I think like if you looked at choose on you know, cloud codes GitHub, there's like 14 pages of people being like I'm being rate limited, right? And, and, and again like I think today we're mostly non critical system, you know, using this stuff, right? Yeah, we're co. I mean, I mean I would argue that like if my cloud code goes down, it's a pretty critical system to me because I'm worthless without it. But like you know, but, but you know, once we start moving into truly, truly critical systems, right? Where you know, downtime costs money, there's experience loss and stuff like that, those backend systems matter a lot, right? Because we have to understand what happens when do we have routing to other models? Do we have kind of like alternative paths? Do we have all the traditional kind of resiliency that we've built into distributed systems is probably still relevant. Not probably is still relevant. And so I think a lot of those principles don't change. And again, I hope nobody leaves today's talk being like, oh, we need to go reinvent the wheel. Because that's actually the opposite of what I want. I'm just thinking like, hey, the wheel has some poles and we got to figure out maybe we do have to burn the whole thing down and go from the ground up, but maybe we can just adopt systems.

Diego Oppenheimer [00:36:03]: But there's clear pieces of current systems that just will not work in this future agentic system. And hopefully that's what people take away from today.

Demetrios [00:36:13]: There's so much good stuff in the chat that I am just seeing. There's one more for you. Lightning round. If memory and compute are beginning to be separated in newly designed cognitive infrastructure, does the data in memory become communal to all agents?

Diego Oppenheimer [00:36:33]: Well, so that's a great question, right, because you want them to be able to do handoffs, but memory boundaries are famously really hard to do from a security perspective, right? I mean, this is actually one of the things that was really, really hard to do when sharing GPUs. Now we have models so big that sharing a GPU between multiple models is not really a thing. We actually put multiple GPUs against one model. But memory boundaries from a security perspective are a nightmare. They're actually really hard to do. But we do want this handoff. I don't actually have a great. I recommend you go read on some of the stuff that's happening with cxl, which is interesting that Met and Google are working on.

Diego Oppenheimer [00:37:21]: Uh, but the. I think the. It's. It's a great question. I don't have a great answer. It's just that the fact is. It's exactly that, right? It's like we want this communal context. Is it like shared in physical memory? Or is it.

Diego Oppenheimer [00:37:34]: Do we have some atomic unit that gets version that we can all pull on, that we can register, that we can understand? Like, hey, agents actually requested it and got it and you know, like, so we have that audit trail, like, probably leans more towards that second one. But like, it's. It's. It's a great question. I. I don't have a. It's. It's a.

Diego Oppenheimer [00:37:53]: It's a big tbd, at least in my.

Demetrios [00:37:54]: Yeah, there still needs to be some thought around it. Or just like some iterations on it. So, Diego, dude, it's always a pleasure when you come here. I really appreciate you doing this. I learned a ton and I love seeing a little bit of insights in how your brain works. You can come late to my keynote any day.

Diego Oppenheimer [00:38:17]: Sorry about that. Feel free to reach out to me on LinkedIn if there's any questions that weren't answered. Thanks everybody for the time. I'm sorry I'm late again and have.

Demetrios [00:38:27]: A great see you later, dude.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

Anatomy of a Software 3.0 Company // Sarah Guo // AI in Production Keynote
Posted Feb 17, 2024 | Views 4.1K
# MLOps
# DevOps
# LLM Operations
# Machine Learning
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.3K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Building the Next Generation of Reliable AI // Shreya Rajpal // AI in Production Keynote
Posted Feb 17, 2024 | Views 949
# AI
# MLOps Tools