Sign in or Join the community to continue

The Future of AI Agents is Sandboxed

Posted Dec 19, 2025 | Views 152

# AI Agents

# Sandboxes

# Runloop.AI

Share

Speakers

Jonathan Wall

CEO @ Runloop.ai

Jon was the techlead of Google File System, founding engineer at Google Wallet and then the founder of Index was acquired by Stripe. He is building Runloop.ai to bridge the production gap for AI Agents by building a one-stop sandbox infrastructure for building, deploying, and refining agents.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Everyone’s arguing about agents. Jonathan Wall says the real fight is about sandboxes, isolation, and why most “agent platforms” are doing it wrong.

+ Read More

TRANSCRIPT

Jonathan Wall [00:00:00]: I don't know if this is the right term for it. Like the GitHub ification of workflows is going to be important. It could be that someone figures out a very generic and extensible way to have a staging environment that outputs work that people approve or don't. That's maybe more open ended, but I think that's going to be ultimately another important piece of this puzzle.

Demetrios Brinkmann [00:00:29]: Sandboxes are very hot these days. They are the topic du jour. And I would love to talk with you since you have been in the weeds of sandboxes for so long. Just break down, what exactly do we mean when we hear that term being thrown out in the wild?

Jonathan Wall [00:00:51]: Yeah, so I, I think it can mean a few different things, but in, in the context of, of like agentic execution, typically what this means is you're going to give an agent its own sandbox environment where it can do what it pleases and you don't have any kind of concerns of security kind of vulnerabilities. Right. It gets to operate in its own isolated environment. Kind of a key to this is also setting that environment up so that the agent has any tools or any context that it needs to do its job. So I think typically, you know, when people talk about sandboxes in the engentic context, what they're really talking about is giving the agent effectively its own computer. You know, I think we've seen that agents tend to perform better the more tools you give them. Giving them their own computer, I mean, that's effectively like the most powerful tool there is really.

Demetrios Brinkmann [00:01:52]: And are you giving them beefy compute as they give them? And how do you look at the different environments that you set up for them? Is it something that you want to make sure that they have tools to go and search the web, or is it something a little bit more constrained?

Jonathan Wall [00:02:09]: Yeah. So on the kind of beefiness of the environment, you know, you can create these sandboxes via API on our platform. On Run Loop AI, you can specify the size you want, you can specify how much memory and how much compute and how much disk you want to be available. You know, depending on your workload, it might be that you want lots and lots of these things running in parallel and relatively lightweight boxes. Or if you have heavier weight tools, like if you're messing with FFMPG or I don't know, sometimes like NPM and PNPM can be pretty beefy too, so you might need more compute. So it's really kind of dealer's choice there. So we have like default sandbox images that we launch that have, like, you know, common tools. You know, like, they have Python and like Node and stuff like that pre installed.

Jonathan Wall [00:03:01]: But most folks end up using one of our products called Blueprints, which lets our users specify the precise context that they want in the container image. Um, so that basically a blueprint is basically a dockerfile. Why is it more than a Docker file? So there, there are certain things that are hard to express in a Docker file. Like Docker in Docker is kind of awkward to express from one Docker file, which we support. So the blueprint lets our user say, hey, here's the laundry list of stuff that I want in that container when it launches as part of the sandbox environment. Interesting things that we include in the blueprint context. We have an object API where you can upload arbitrary objects. You can say, hey, I want that on that container image.

Jonathan Wall [00:03:54]: We have an agent API where you can specify agents that you want to include in an image. We also have a notion of codemounts where you can say, I'd like to include this code repository in this image. So these are all kind of part of the toolkit for configuring your sandbox. The Blueprint is a way to kind of statically, ahead of time, build the container image with all that in there. You can also do it when you're experimenting or getting set up. You can do it dynamically when you launch the dev box. So you can dynamically put all that stuff onto your Sandbox environment at launch time. So I think developers will typically start launching dev boxes and dynamically adding the stuff they need.

Jonathan Wall [00:04:36]: And then when they're like, okay, I got all the right recipe together, like, let me put that into a blueprint. Now it's all ready and it just, it launches that way. It's just faster for launching like that.

Demetrios Brinkmann [00:04:47]: So many questions for you, man. Like, first off, you're not specifically creating tools in these different environments, are you?

Jonathan Wall [00:04:59]: It's really up to our users to build their agents as they see fit and choose how to compose them with tools and skills and things like that.

Demetrios Brinkmann [00:05:09]: And that's how the repo comes in to play. It's like, here, throw this repo in there. Then the agent can understand what's going on and it can grab the needed MCP servers and look in the different directories or whatnot.

Jonathan Wall [00:05:24]: You certainly could do things that way. So typically the code mount is really quite literal. It just means, tell me what GitHub repositories you want in this image and I'll put them there. And it might be a code repository that you intend to work on, like, hey, I'm going to write some code in here. Or it might be a code repository that you're using, as you're suggesting, as a source for like tooling and whatnot. This also kind of speaks to the agent API. Like these are very powerful and composable kind of APIs. There's many ways to skin any particular cat, so to speak.

Jonathan Wall [00:06:03]: So the Agent API similarly lets you specify an agent you'd like to put on the dev box. You can specify an agent that you want to be made available via npm, or you can also point to a GitHub repository, or you can actually upload an object that is a tarred up version of your agent. So you have many different ways to kind of compose these things and ultimately it kind of becomes dealer's choice on how you use these APIs.

Demetrios Brinkmann [00:06:33]: Do you find that folks will run various agents inside of one environment?

Jonathan Wall [00:06:40]: Yeah, often. Sometimes people will run one agent. They'll have a workflow like some folks use things like temporal outside of the dev box and they will launch the dev box with whatever target context they're trying to work on. And you might use something like temporal or any sort of other workflow thing that's maybe part of your product layer. And then you'll reach into the dev box environment and you'll invoke maybe a first agent. When that completes, you maybe move on to another stage of your product workflow and invoke a second agent. And it's possible that you could snapshot or suspend and resume. So our dev boxes have useful orchestration primitives.

Jonathan Wall [00:07:20]: So snapshot is, hey, we're going to take a point in time, copy of this. You can fork it or you can just save it for later. Suspend Resume is a feature that uses snapshot. It's like a nice product feature on top of it uses snapshots to suspend a dev box and then resume it. People who have like human in the loop kind of workflows often use this just because if you're waiting for a person for a while, maybe you should shut down the dev box. When the person comes back and makes some decision, you might want to resume it. So it's again, these are kind of composable primitives and it really comes down to what the application that ultimately our user is trying to build.

Demetrios Brinkmann [00:08:04]: And I think there's something I'm still trying to wrap my head around, which is why have A sandbox for the agent versus just like having that agent be some API that you can ping.

Jonathan Wall [00:08:22]: Yeah. So there's a number of kind of questions underneath that one question. But I think so kind of in a way you're asking, hey, why don't I just host an agent execution on a normal server? And I think that there are certainly agent patterns where you can do that. But I think the second you want the agent to have access to its underlying compute like it wants shell access, bash access, you want it to be able to author code and execute code. For security reasons, you really don't want that running right next to your server. You know, if you had an LLM, if you have a server running and it's hosting some number of these things, first of all, they have very unpredictable load and usage depending on because they're effectively going to execute tools and do what LLMs are suggesting they do. Right. So they have really unpredictable usage patterns.

Jonathan Wall [00:09:22]: And if you're going to give them bash or shell access or access to dangerous tools, if that's running next to your computer, I mean there's nothing to stop your agent from kill lining the server it's running on or RM rf ing the root file system where your server is running. And that would be problematic. So you know, that's really where the kind of the sandbox term really originates is you're putting it in a sandbox where it's isolated and it's kind of blast radius of things that it can impact is low. I think typically speaking, like, you know, I mean the LLMs and the agents are usually not doing silly things, they're usually trying to make like forward progress, but they can make mistakes or they can also just do things like that are very resource expensive. Right. That just chew up all the cores of memory on the system. So for this reason, like, you know, unless you have an agent with like a very specific and narrow use pattern, if you're looking to give it access to things like bash tools, compilers, the ability to execute arbitrary code, if you're looking to do those types of things with your agent, you probably want it in its own sandbox environment. And that's kind of like the for security and isolation rationale for doing it.

Jonathan Wall [00:10:42]: I think kind of the flip side of that coin is just opening up the world of capability to your agent. Right. You know, as I said earlier, I mean effectively like a computer is like the most valuable and powerful tool out there, right? Why not give that to your agent? What that lets the agent do, if you give it Its own sandbox is you can really open up its toolkit and you can let it do a lot of things that would otherwise be a little nerve wracking to have running on your server. Right. So you can say, hey, if you like clone repos, download stuff from the Internet, parse files, hey, you have full access to the file system, write yourself a to do list and go work your way through the to do list. You can write sample code or intermediate state to the local file system. So you, you know, you really do end up by giving your agent its own computer, giving it the, the, a lot of power and a lot of kind of open ended capability.

Demetrios Brinkmann [00:11:42]: So many great things there. And yes, thank you for clarifying that because it was like I had the inkling of, yeah, sandboxes. Of course there's limiting the blast radius. You don't want it doing stuff in the same place that you have maybe the production environment because it can go AWOL or just.

Jonathan Wall [00:12:05]: Yeah, I think these start causing trouble. The answer is really there's like the pro and the con and they're actually both opposite sides of the same coin. The con is you're like, oh my God, I don't want my stuff getting rooted or destroyed or like maybe agent's gonna do something bad. But the pro is like, hey, like this agent's cool and smart, give it the best tools and yeah, we should.

Demetrios Brinkmann [00:12:26]: We should give it access.

Jonathan Wall [00:12:27]: It's, it ends up with the same answer. So like I don't know, pick, pick your rationale. But I think you, you kind of end up with the same result either way.

Demetrios Brinkmann [00:12:34]: So if I understand this correctly, then you're saying for an agent that maybe is able to look into databases, are you copying whole databases over into these environments so that the agent can play and do what it wants. But, but also if you have Johnny Drop Table's agent, it's not going to destroy your last, whatever year of data.

Jonathan Wall [00:13:03]: Yeah, you certainly could. I mean there are a few different ways you could do that. So kind of earlier I mentioned the Object API to you. The object API just, I mean it more or less is, is like a product layer in front of something like S3, right. So you certainly could take a copy of some set of database tables, put it into the object store and build it into a blueprint or make it available when you launch the dev box and then your agent has its own private copy. Kind of the other way you would do something like that is you would maybe use an MCP server that had scoped credentials that Said, hey, I have like read only access to, to a, to an external database. So that would kind of be the other way to do that sort of thing.

Demetrios Brinkmann [00:13:51]: I want to just touch on one other piece that you mentioned and it's worth highlighting, which is the agents can be resource intensive. And so when you have a sandbox, you can kind of control that and make sure that if they do go a little bit wild with the resources, it's only up to the amount that you specified beforehand.

Jonathan Wall [00:14:15]: Yep. Yeah, it's, they're, they're in their own isolated sandbox. The most they can use is the full size of the sandbox. And we actually, in our product layer, we'll, we'll start to alert you. We, we put into the log traces when you go over, I think, I don't know if it's 80 or 90% of memory or CPU use, but we actually drop an alert. So we're like, hey, getting hot in here. Maybe I need a bigger sandbox. Or maybe you need to tone down the stuff you're trying to do one or the other.

Demetrios Brinkmann [00:14:46]: And is this, because this is really geared for production, how do you see it scaling horizontally when you want to start doing that? Is it just replicating the same sandboxes or is it getting a bigger sandbox?

Jonathan Wall [00:15:03]: Yeah, typically people just stand up a large number of sandboxes. Like some of our customers will stand up like 6,000 at once.

Demetrios Brinkmann [00:15:10]: Wow, nice.

Jonathan Wall [00:15:11]: Yeah, sometimes these are people who are trying to drive like RFT or like jobs like that, and they'll just stand up a whole pile of these things at once. So kind of the point is there's kind of an interesting observation here. This is, I think, another one of the reasons, you know, if you step back a little and you look at the success of things like Claude Code Codex and now actually LangChain with deep agents, they, you know, increasingly have this metaphor of really direct computer use. Right. Like when you run Claude from your terminal on your laptop, it's using your laptop as if it was its own. Right. Same thing with Codex. So you're starting to see this pattern become popularized of like, you know, agents directly using compute.

Jonathan Wall [00:16:00]: And you know, I think there's another advantage to that. We, we mentioned a minute ago maybe using something like MCP to give, you know, an agent access to non local resources. Right. And I think that if you look at the concept of bundling an agent with its code sandbox, you have this nice atomic unit of like, okay, here's your existing tools, here's the amount of Compute and RAM and stuff. You're allowed, you know, hey, you're in this box and I know you can use all of it, but this is like a nice atomic unit. And you can actually then assign identity to that, to that specific dev box. Right. And that identity can also be part of how authentication to MCP Tools is handed out.

Jonathan Wall [00:16:46]: So I think there's just a huge number of very practical reasons to think about deploying agents in this pattern.

Demetrios Brinkmann [00:16:53]: And the identity would be whatever. Johnny Drop Table's agent.

Jonathan Wall [00:16:58]: Yeah, Johnny Drop Table's agent is, you know, here's the agent id. It's running on this sandbox id. I provisioned this sandbox with this specific MCP access credential, which is read only database. So Johnny Drop Table, Johnny Drop Table doesn't go too crazy and drop my production database.

Demetrios Brinkmann [00:17:17]: He has been known to do that.

Jonathan Wall [00:17:19]: He has been known to do that.

Demetrios Brinkmann [00:17:21]: You can't give him that access, man. That's the problem.

Jonathan Wall [00:17:24]: Well, yeah, I mean the database nerds would say you can also create virtual databases that are clones of your production database to solve this. So there are many layers of the stack at which people will solve all these problems and actually all the way back to your database question earlier, like, hey, can you provision a database onto the dev box? I had suggested people using object amounts for that. We have a bunch of customers that use Docker files for this. You know, as I mentioned earlier, we support Docker and Docker. People will take Postgres, jam it in a Docker file with some amount of storage or some subset of their production database and use Docker and Docker for that. So as it's in computer science, there's always, there's always many particular, there's always many ways to solve a problem. It's just what is the most expedient and like judicious way for you to do it now or with what you're currently trying to do?

Demetrios Brinkmann [00:18:20]: Well, it's interesting, before we hit record, you mentioned that agents work a lot more like humans than traditional software.

Jonathan Wall [00:18:29]: Yeah, I think that's, I think this is also like a thing that makes the whole give an agent a computer insight even more valuable, right? Like, I mean, think about like you've, I'm sure you've worked on like a REST server or coded, you know, maybe a fast API or a NEXT server, right? Think about normal servers, right? Like you have schema inputs, right? Like typically REST or maybe GRPC with proto schemas, right? Like you have some amount of like, hey, work comes in this door Right, but it's, it's strongly typed, right? And that maps somewhere deterministically to code, right? And that code executes. And sure, some requests are more expensive than others, but by and large there is a deterministic set of paths through your system, right? And ultimately state gets stored probably in a database. Maybe something goes out the other door to a message queue or some other API. But you effectively have this deterministic mapping of like requests and inputs, storage, maybe outbound requests, right? Now think about an lln, right? It's natural language is coming in one side. It's, it has some tools it can invoke, but you have no idea if it's going to, you know, it's an inherently probabilistic process. And you know, if you even look at like, you know, Claude code or what a lot of these, you know, agent frameworks do now, they start doing things with the local file system. They're like, ah, like, yeah, that's a big task. I'm going to write myself a plan and store it on the file system.

Jonathan Wall [00:20:01]: Just like, you know, a human being at their laptop, you know, at work, might have a white, like a white piece of paper or a pad next to them and be like, oh yeah, things I have to do, like, check that off. So I like, don't, you know, let my teammates down, right? So you start to see like agents kind of, you know, not to anthropomorphize them, but they have like very different working patterns. If I'm sure you have seen this using agents too, where you'll ask it to do something and it maybe isn't 100% sure about an API or a library, and it'll be like, hey, I'm going to go over to the Internet and Download this, this API or download, you know, some, some ReadMe documentation. Like, that's how I work. Like, you know, okay, hey, I want to go code against this thing. I'm pretty sure I know I need to use this, but like, let me go look at the documentation. So, you know, I think kind of the way that agents use compute is a lot different than a traditional server or even a lambda function, which is just kind of like a point in time, one request slice of a server, right? And we, we really do believe that what we're working on here is kind of a new compute pattern. And you know, like, you might, you might pause and be like, what do you mean it's a new compute pattern? It's just like you're just using a computer but like to Think about this, right? Like you're, you're effectively saying, I have one piece of software that's going to, in an open ended fashion, use its own computer.

Jonathan Wall [00:21:31]: And if you think about the last time we saw maybe a new emergent type of compute pattern, I would argue it's kind of maybe the advent of Spark and databricks, right, where they're like, yeah, Spark is kind of like a bunch of servers that work gets farmed out to, but it has this workflow and a DAG and like. So I kind of think that this might be that different and that kind of unique, that this is just kind of a new compute pattern. And probably two or three years from now, probably the language around all the stuff we're doing and other people are doing will get standardized and you'll be like, oh yeah, there's these dedicated platforms for running agents. And like, yeah, of course it should be that way. But this is the arc we're trying to walk along to kind of establish this as being, you know, just that.

Demetrios Brinkmann [00:22:32]: I could not help but think about Neo in the Matrix. When you said, oh yeah, I need to know, it's like, I know kung fu. I went out there, I downloaded what I needed, and now I know Kung fu.

Jonathan Wall [00:22:44]: Maybe that's, that's skills, right? That is exactly the kung Fu skill.

Demetrios Brinkmann [00:22:50]: The other thing that I wanted to hit on was when you have these environments happening and you have these agents kind of getting the run of it, are you observing everything that they're doing to then be able to go back and eval what they chose and how they chose it?

Jonathan Wall [00:23:13]: Yeah. Yeah. So that's a really important question. So that actually touches on another, an entire, another part of our kind of product suite that we haven't talked about yet. So let me give you the straightforward answer and then I'll give you the less straightforward answer. So the straightforward answer is, yeah, we have like good built in observability and debuggability as part of our platform. So you can see and audit everything that happened that took place for the agent and we'll collect trajectories as well. If you configure our platform that way.

Jonathan Wall [00:23:45]: So that is a great way for you to be like, hey, like what actually happened here? Let me do debug it. But as you kind of talk about evals, I think that's another very important thing. So we have a product we call benchmarks. And what benchmarks effectively let you do is take dev boxes in a known state where you can drop your agent on to try to solve a problem. And you can tell us, you can give us scoring functions to evaluate how well your agent did. And we think this is pretty important, right? Like if you've gone through a bunch of effort to build an agent to accomplish some domain specific task, you want to constantly measure your agent's ability to perform that task. Right. And given that this is a stochastic system, it's not like a unit test or an IT case, right.

Jonathan Wall [00:24:36]: It's not going to be 100%. And so what the benchmark product lets you do is create a suite of these tests, run your agent against those tests, observe the results and say like, okay, great, am I getting better or worse? Or let's say, hey, like I guess clogged four, five opus just dropped. Let's say you're like, you know, I'm going to spend the money, I'm going to use the more expensive model. Well, how much better does your agent perform with it? Well, so important, just go right into the exact same benchmark, flip the model you're using, run through the benchmark and measure the difference. And I think you, you had called these evaluations. I think it's worth pointing out some rash. Yeah, there's a, there's a little bit of a rationale as to why we call these benchmarks. Like a lot of evals are a little more tightly scoped to like measuring the quality of independent LLM requests and responses.

Jonathan Wall [00:25:33]: What we're really trying to do is longitudinally measure the end to end ability of the agent to accomplish its task, maybe across hundreds of LLM calls, hundreds of tool calls, and oftentimes to evaluate the performance of the agent, you need to understand the context it was working in. So let's take a code base, for example, right? Let's say that you have a coding agent that I don't know, is updating dependencies or something like that, right. You might start off with a repository in a known state. You know, you might invoke your agent and then your scoring function would look at, you know, your dependency file after it made the changes and say, okay, like are all the dependencies of a higher than this, this number? Or not, like it should have updated all these things, like I'll score that. But you could also make sure that the, the repository builds, right? Like it didn't hallucinate some version that didn't exist. So this is an example of where, like take that relatively mundane or simple example I just gave you. Probably the agent's going to talk to the LLM a few dozen times and make at least four or five, maybe more tool calls. And then for you to evaluate the correctness of his results, you need to look at the file system.

Jonathan Wall [00:26:50]: Okay, like, what is the state of my package JSON or my requirements lock or whatever, UV lock file, Right. And like, yeah, by the way, like, does this build because you could have made up some new version that is like higher than the old version, but doesn't exist. Right. So again, like, to get back to benchmarks, the point is, is that, you know, for people out there, practitioners out there, you're going to build an agent. You want it to accomplish some specific goal, you need to consistently measure that you're getting better or at least not losing ground. And we think benchmarks are the right way to do that. In your initial question, you hinted at a notion of can we take a runtime environment and maybe turn it into a benchmark? We're not there yet. I think that's ultimately the coolest thing would be, you know, for someone to have their production traffic running against a bunch of dev boxes and for them to flag to us like, hey, like that benchmark, you know, or I'm sorry, that dev box, DBX 1, 2, 3, 4, 5, whatever, that didn't end so well.

Jonathan Wall [00:27:57]: Can you like capture all the logs and the initial state and turn that into a benchmark so I can, like, I can go work against that again and again until I improve that particular pattern?

Demetrios Brinkmann [00:28:08]: Yeah, those fail cases are almost like the. There is valuable as gold, right? Because if, if you have an agent that is working well against your benchmarks and then you start to see, oh, it's got a little bit of a blind spot there that's very valuable. But what you already have set up sounds like it is infinitely valuable just to understand the basics of. Is this shippable?

Jonathan Wall [00:28:42]: Is this shippable? Does it do what I think it's supposed to do most of the time? Does it, you know, with some percentage points of like just chaos? Does it, does it do what I want it to do most of the time? You know, and you can also get into, you know, some of our customers do pretty cool things where they'll, you ask questions about multiple agents before they'll have an agent run and then they'll have an agent that runs after it, that evaluates if, if it did a good enough job and then kicks the previous agent. Right. Kind of the LLM as a judge type thing. So you'll have these agent that did work, unit A, agent that judged work, unit A And then, then pass you on to agent B or kicked back to agent A. So you there, there. You know, once you have these frameworks in place, there are a number of cool methodologies you can start to implement to try to like increase your accuracy. Right. And make sure you, you, you don't produce low quality results.

Demetrios Brinkmann [00:29:40]: Well, I really like the idea of looking at it as a holistic system as opposed to just did this one LLM call that's one of potentially hundreds actually work.

Jonathan Wall [00:29:54]: And if you think about things like code, right. Like the, the LLM might be telling you to pass one file. Well like okay. Like it's very hard to judge. Like okay, that one file patch is immaterial in absence of all the other patches you just did. Right. Like that piece or that single LLM call is not particularly useful. Right.

Jonathan Wall [00:30:17]: Like without the tool chain, the rest of the code base and all the other patches. So I think ultimately it's just, it's, it's just measuring things at a different scale. Right. Like what is, what is the longitudinal result of like my, my end to end execution as opposed to like finer grain.

Demetrios Brinkmann [00:30:35]: Do you remember that meme back in the day that was the little girl looking at the camera in the Bernie houses behind her and it said you know, worked on my machine. It's an ops problem now.

Jonathan Wall [00:30:47]: Yeah, yeah.

Demetrios Brinkmann [00:30:49]: It, it just reminds me a little bit of, of that how your basically helping so that that is not happening. That meme is not a reality.

Jonathan Wall [00:31:00]: Yeah, yeah. And that is, that is one of the all time grades.

Demetrios Brinkmann [00:31:04]: Yeah. You mentioned that you have folks that are using 6,000 sandboxes when they're scaling out. Are all of these agents highly isolated? Is there any type of multi tenancy that you can have? Is it like you want some kind of a shared state sometimes? How do you see the different customers working through those questions?

Jonathan Wall [00:31:32]: Yeah, that's a good question and actually an area of maybe some active development for us.

Demetrios Brinkmann [00:31:39]: Oh nice.

Jonathan Wall [00:31:40]: Right now so our sandbox supports the notion of opening tunnels. So when you first launch a sandbox by default there's no network access to it. But you can say hey I'd like to create a tunnel. I'd like to be able to route network access to this. And, and some people use this to implement like a product ux. Right. Like they might have a front end server running pick place. Right.

Jonathan Wall [00:32:01]: Vercel railway, whatever. And then they launch an agent and then they want to communicate with the agent, they might open a tunnel. We support web sockets, you know, HTTP whatever. But what you can do if you want the agents to talk to each other is you can open the tunnels and they can talk to each other over the tunnels. One area that we're starting to look at now, the folks at LangChain and the Deep Agent team has been encouraging us to look into. This is a notion of what it looks like for these agents to potentially have a shared file system. So if each sandbox is totally isolated, what does it look like if you mounted some sort of shared file system so they could communicate through the file system and that would be this potentially interesting. It's not.

Jonathan Wall [00:32:50]: We have not. This is not something that is like imminent from a product perspective, but it's probably. There are probably a couple things that we're going to prototype soon around a notion of volume mounts. Like, okay, how do I have a notion of a part of the file system that I can save? You know, because when the dev box is gone, it's gone. But how do I have like an output directory that gets saved someplace where I can, I can retrieve the agent's, you know, intermediate work or its notes or whatever later and then, hey, well, I mean, once you have like some sort of file system abstraction that you're saving outside of the dev box, what if it was read, write, shareable? So these are, these are things we're toying with. We need to make sure we have the right use cases and the right technology before we actually turn it to product, though.

Demetrios Brinkmann [00:33:38]: And so right now has the philosophy been very much along the lines of cattle, not pets type ideas?

Jonathan Wall [00:33:46]: Cattle, not pets, meaning like the.

Demetrios Brinkmann [00:33:51]: I don't know if you remember that, like in the. I heard it from DevOps folks where they would say, you want to treat your kubernetes clusters like cattle, where you're okay killing them at any point in time.

Jonathan Wall [00:34:05]: Yeah.

Demetrios Brinkmann [00:34:05]: Or them dying. And as a vegetarian, I'm just gonna say I do not like this, but it was what I heard and lightning strikes.

Jonathan Wall [00:34:17]: Yeah, I got you. I got you. I. I got you. Yeah, yeah. I think, I think it's also just. It's. I think a.

Jonathan Wall [00:34:24]: There's that like, which is just like. I think effectively what you're talking about is like recoverability, like if something got halfway done and I don't know, like, you know, we run on top of AWS and other cloud providers. They're good, but they're not perfect. Machines do go away. Right? So that would be like a recoverability argument. Like, like, how do I like, recover if an agent went away? I think maybe the more interesting thing though is just like what happens like does that allow interesting like interagent coordination opportunities? You know, you could imagine you could have a pipeline where you had agent A on, on one dev box and agent B and then agent C all on separate dev boxes. And you know, you start off a pipeline by asking agent A to do something and it writes something to a file system. Then you tell agent B to read that and continue working on it.

Jonathan Wall [00:35:14]: So you know, you might, you know, and so on. It could pass the agency. So you might end up just having different like orchestration patterns that you make available too. But this is, this is a little forward looking, you know, I'll be sure, I'll be sure to give you, give.

Demetrios Brinkmann [00:35:32]: You an email update.

Jonathan Wall [00:35:33]: Yeah, an update when we, when we start to formalize some of these things.

Demetrios Brinkmann [00:35:37]: Well it also feels very much like the idea of giving the agents memory in a way and how you can let these ideas, whatever it is that they're working on or whatever they've learned, how can that persist throughout, after the sandbox isn't around anymore?

Jonathan Wall [00:35:57]: Yeah, yeah, I think precisely. And what if the, you know, can you let agents self author skills or tools for themselves and save those I think starts then pass it to the next one? Yeah, yeah, yeah. I think that's where this starts to become like quite interesting.

Demetrios Brinkmann [00:36:14]: Well, there is another thing that I wanted to get into around like frameworks versus harness and I thought you had some great points on that. Can you break down like how you look at these two things?

Jonathan Wall [00:36:29]: I think, you know, kind of the observation, I think it's Harrison Chase from LangChain has been kind of pushing this like distinction of framework versus harness and I actually think it's a pretty important one. I agree with him on this one and I think that we're about to see a lot of people take advantage of these harnesses, so to speak. They just make it much easier for them to take like a really capable agent, like cloud code or a deep agent and then build on top of it with their own kind of customizations. I think the harness approach just, it's more batteries included. It just makes it easier to accomplish a goal. You know still power users will build I'm sure from the ground up their own stuff, you know, like startups that are exclusively tackling one specific thing. But for people out there who are looking to build like simple workflows or just automate simple things, I think this will be very powerful.

Demetrios Brinkmann [00:37:24]: And I'm not sure if I fully understood what harness is exactly.

Jonathan Wall [00:37:30]: When you're Using something like the Cloud Agent SDK or LangChain Deep Agent, you're kind of assuming that what is already there is already good at managing context. Doing things like planning and doing things like tool calls and maybe you provide additional tools, but you're maybe not really writing the code for to implement tool calling or you're not really writing the code to manage context. Like I'm sure if you use cloud code, you see where it compacts its conversations, stuff like that. Right? Like doing that really well is hard. I'm sure, you know, startups out there that are purely agentic startups are going to have their own stuff top to bottom because they're specialists in what it is they do. But people out there are looking to build an agent that takes two PDFs and turns them into one database entry. They might be able to build on top of this and be like, great, I have this PDF reader skill. I have this, like, here's my schema export to database skill and like, I'll just give you these inputs.

Jonathan Wall [00:38:34]: You know how to call tools and skills. You know, if they're using Cloud Agent SDK, I'll just build on top of you. And you're a very capable kind of harness for managing context, invoking tools, making a plan and sticking to it and updating the plan. And I think it's, it's really, I think the distinction is more about how many batteries are included and how much does the thing just do for you out of the box without you needing to do that much.

Demetrios Brinkmann [00:39:03]: I hadn't even thought about how with Lane Graph it was very that deterministic. You're always playing with a determinism versus non determinism model or way that you want to build your agent. And with Lane Graph you could get very detailed on. I want you to do this and then I want you to do that. And then if you need to, you have a loop and you, you keep doing that and then once you get this information you go and do that. Right. But Harness throws all of that out the window. If I'm understanding it correctly and it's saying it comes with the ability to know what it needs to do next.

Demetrios Brinkmann [00:39:48]: Good enough.

Jonathan Wall [00:39:50]: Yeah. I think if you look at how like, like the, the Cloud Agent SDK implements skills, you could kind of, you can do it as natural language in a Skills MD file. You can like kind of specify like do A, then do B, then do C and then the harness is good enough at ingesting that and doing it. I think for deep agents, let me I believe it's called a feature. So their notion of a skill, I think is called a feature, but it's a similar concept. So you're, instead of you coding that you're saying, here are kind of workflow primitives or like sequences of steps I want you to take, I'm going to put them in feature files or skill files. And then you're dependent on the harness to be like adept at following those, you know, again, first, I think for easier use cases, you're taking advantage of the fact that the, that these agent harnesses are just getting smarter and smarter and hopefully you have to do a little less to get an equivalent result. I think that's, that's kind of maybe a better crystallization of what it was.

Jonathan Wall [00:40:56]: I was trying to say, yeah, I think directionally this is going to keep happening too. Like, I think these underlying cores are just going to get more and more powerful to the point where, you know, the same with Codex as well. You know, not to leave them out of the conversation, but, you know, they, they have the advantage of huge use. They can tune models specifically for their own agents. You know, they can invent their own kind of protocols and conventions for things like skills and mcp. And I think maybe at least for people doing relatively simple agents, I think it's going to be much easier to just use these harnesses.

Demetrios Brinkmann [00:41:41]: Does make a lot more sense because I remember seeing a lot of friends graphs in, in lane graph that had a lot of different nodes and a lot of different steps. And there was a lot of logic behind you do this and then you do that and then if, if you have this or if you have that, like, think about it. Or here's some. It was almost like a glorified workflow. And I do know that the workflows in a way are still quite valuable. And I was just talking to a lot of folks when I was in Amsterdam last week about how they'll bundle up a workflow. If they see that these three tools are generally called together, then they'll just bundle that up into a workflow, you know, because it's not like we want to have any question about if you're calling this tool and then this tool, you don't call that tool because it's just like, no, these are the three tools. They're always called together.

Demetrios Brinkmann [00:42:44]: And we're trying to abstract up one level on the tool calling.

Jonathan Wall [00:42:50]: And.

Demetrios Brinkmann [00:42:53]: It feels like this idea of harness is very much that. Like, when we had Sid on here from Claude Code, he mentioned one of the Biggest things that folks who are building agents these days are grappling with is how much harness is too much harness.

Jonathan Wall [00:43:11]: I think that's one of the things that CLAUDE code really excels at as well too, is tool use. It's a little bit of a tangent, but yeah, it's very, it's very judicious about when and how it uses tools.

Demetrios Brinkmann [00:43:23]: Speaking of the tools and the agents being able to execute tools, going back to that conversation with Sid, he mentioned that a lot of their big improvements and performance gains that they saw with CLAUDE code was when they were deleting stuff from what CLAUDE code had to do. And so instead of saying like, it needs to use this BASH tool, it would just say, like, here's how BASH works. Yeah, you go and figure out and create your own BASH command, you know, and things like that, which ties into the environment. And it also ties into another thing that I've heard is very difficult with agents, which is the verification or the making sure that if you wrote something, as you mentioned earlier, does it actually, does it build?

Jonathan Wall [00:44:29]: Yeah.

Demetrios Brinkmann [00:44:29]: Can you compile it and will it run afterwards? Or is this just some random thing that looks better but doesn't work?

Jonathan Wall [00:44:39]: Yeah, yeah, I think the, the BASH point you made is kind of goes back to the give give agents their own computers point bash. Like, if, I don't know if you've ever had a friend who is like one of one of my co workers is just like unreal with Bash, he'll chain like 30 commands together. I mean, I'm a capable BASH user, but like he's next level. But you know, with something like bash, you can do almost anything. There's like, I mean, we're talking about 50 years of like UNIX tooling that has gone into this kind of overlooked program, which is bash, between like awk and SED and rip grep and git grep and like JQ and like pipes, like you can do almost anything. So to some extent this kind of goes back to the like, give an agent a computer. Like what you were just kind of observing was like, hey, like, you know how BASH works? You know the common tools in bash, like, use them. Yeah.

Jonathan Wall [00:45:40]: I don't know how often you use any of these kind of CLI tools for coding, but you'll often see like, if you ever like move files or do stuff like that, you'll see in a larger code base, you'll see that they're like, oh my God, there's too many references to the file you just moved. For me to hand Edit, I'm going to write my own bash script and they'll do like git grab pipe X args pipes said to replace imports or like, you know, and you're like kudos, like no one had to write a tool for that. Like you are able to use like a composable extensible tool that is your own computer in a way that is like empowering to you as an agent. So I think that is a pretty cool argument for less is more and having more powerful basic tooling components being a real lift.

Demetrios Brinkmann [00:46:36]: What else is on your mind these days.

Jonathan Wall [00:46:40]: When enterprise is really going to get going a little bit more with agents? I think is quite interesting. I think that everyone has used by hand some form of coding agent or if you're curious, you've authored an agent and seen how powerful they are. But I think it'd be very interesting to see when enterprise starts to catch up and you start to see more actual agents live and production for enterprises. I think that's going to be an important part of just the broader industry life cycle. Right. Like, you know, enterprise adoption I think is important.

Demetrios Brinkmann [00:47:20]: Is there, are there some things that you feel like are big blockers to that right now?

Jonathan Wall [00:47:28]: Yeah. I mean, yeah. So. Okay, good leaning question. So, well, I'll finish my other two curiosities, which I think are related to what you just asked. I think I'll also be curious to see the future of MCP and of things like Skills. I think these are all important components. I would say to your question, like what are the blockers? I would say until recently with some of these harnesses, it was hard to build agents.

Jonathan Wall [00:47:57]: It wasn't impossible. But like, I do think that like the batteries included aspects of these kind of harnesses like Cloud Agent SDK and LangChain Deep Agents and Codex SDK are going to kind of democratize it or make it easier for people to build agents successfully. So I think that was one piece. I think people are still in their head deciding on it. But I think, you know, having a very simple, isolated sandbox runtime environment like what we're building, I think that's an important piece of the puzzle. Like companies like mine, there are other people trying to do this stuff. There are a lot of cool companies innovating in the space, but kind of the existence of a very easy deployment fabric that's secure for deploying and observing and auditing agents I think is important. I think another thing that like MTP and to a lesser extent Skills kind of address is how do you actually get contacts into those agents.

Jonathan Wall [00:48:53]: I think there are a lot of people working on making the auth story around MTP a little more graceful. I think that's going to be an important piece of the puzzle. You know, basically giving short lived read only credentialed access to these agents to operate in an enterprise environment is going to be important. And I have one more that I'll add. I think one of the reasons that agents have been so successful. There are many reasons, but why they've been so successful in coding so far has been the pre existence of a workflow that lends itself to agents being involved. And what I really mean is things like Git and pull requests. You know, you and I are engineers.

Jonathan Wall [00:49:38]: Let's say we were working on something together and you're like, hey John, can you add something built into git? Is this effectively like you know, immutable audit log of every change that's taken place in the repository? And not only does it have this audit log. Right. It has this wonderful feature of just a convention by when I want to make a change, I'm going to send you a review and you're going to look at it and say that's good or bad or John, that's junk. Change this. And then once you've approved it, you're not. Both of our names are on it and it gets added. And there aren't that many other systems with those proper properties. Right.

Jonathan Wall [00:50:16]: Like lawyers. I guess you have red lines in like Microsoft Word or something like that. But you don't have like a proposed review like journaled approach. Right. Like for, for most other, most other that I'm aware of professional workflows. And I think that that's actually like, that'll be like. Let's just say MCP and all the auth stuff is magically solved and companies like mine and others like it trying to build this sort of fabric are perfect and deploy it everywhere. And these harnesses are so great.

Jonathan Wall [00:50:50]: You can just drop a bunch of skills and a few MCP tools in and everything just works. I think that still the piece I just described to you is going to be an important workflow component that I think is still kind of missing. Like what is the pull request of like an insurance claim auditor.

Demetrios Brinkmann [00:51:07]: Yeah.

Jonathan Wall [00:51:08]: Or a marketing app or any, any other industry really. Right. Like, and I think that it's really a notion of like staging, reviewing, accepting work and committing it and having the ability maybe it's unclean or harder in those other kind of spaces to revert. Right. Like you Do a git reverse. Like pretty straightforward, right? Maybe it's a little harder to reverse a table up date in Workday or Salesforce or something like that. Yeah, but I think that that like, I think that's like this is just something I'm noodling on. It's not.

Jonathan Wall [00:51:41]: I don't really view it as Run Loop's problem but I view it as like a broader industry problem of like what is the GitHub of like all these other systems that agents need to interface with. Because the, you know, the GitHub thing makes it quite natural for you to have like a, you know, an agent that you're like ah, go like do a pull request to update these dependencies or something like that and then you review it and you're like yeah, pass CI. It passed CI. It looks same to me. I approve. Done. And if something's wrong we can revert.

Demetrios Brinkmann [00:52:13]: But also it has that ability of does it run or does it not?

Jonathan Wall [00:52:18]: Yeah, the CI step is important too. Yeah, yeah. And like kind of, it's kind of a funny thing where you're just sitting in this like rather highly evolved tooling ecosystem you kind of take for granted and you just don't. You know it's not top of mind that like other industries don't necessarily have this. Or maybe I'm wrong, maybe some of these other industries do have these things and I, I don't know but I think this will.

Demetrios Brinkmann [00:52:43]: It's not ubiquitous. Yeah, I think certainly not as common.

Jonathan Wall [00:52:47]: It's. Yeah, I don't think there's a GitHub for lawyers.

Demetrios Brinkmann [00:52:50]: No. And I wonder too if that is one of the reasons that the training has been so valuable on all of this data. Like I'm not sure the ins and outs on how they trained the models specifically like these, you know, the Claude model. But did they train it on the actual issues and the PRs as much as they trained it on the code?

Jonathan Wall [00:53:22]: Yeah, yeah. Well you now have. So you're surfacing an important observation which is that your pull request log includes the motivation, the rationale as well as what then presumably ultimately a human did to solve that problem. So I think that's very powerful. I think there are other aspects of code that just lent it to it being one of the first things for AI to be great at. You know, it's a, it's a more concise, it's not an open ended language like human language. Right. Like you, there's only, you know, there's strict syntax, there's a strict set of keywords.

Jonathan Wall [00:54:00]: But maybe you know, as importantly you have like deterministic verification functions that you don't have for natural language. Right. Like I can say, you know, Claude or DeepAgent or whatever, Codex, write me hello world. And it can deterministically verify that it correctly did that job. It can compile. That's one deterministic tool it's able to run. It can also run things like linters and formatters, but then ultimately it can also just run the code and be like, did it say hello world. Think now about a lawyer.

Jonathan Wall [00:54:36]: A lawyer proposing some sort of legal doc has no tool that says like yo, compile that. Did it say John deserves to do whatever? I'm writing the note like there's, there's no deterministic verification functions. And what those deterministic verification functions let AI do is, is self play. They can go try things and they can discover things that don't work and then they can ultimately discover things that do work. This can go back into their pre train they learn. It's almost like kind of like the AlphaGo thing, right? Like the AlphaGo thing way the OG AI moment from Google where they just basically let the AI self play against itself and it just had to observe the rules of AlphaGo and there was a deterministic scoring function of who won and they just let it run and run and run and run. And it learned novel things. I guess there are many aspects, but I think that the deterministic scoring function's a big one.

Jonathan Wall [00:55:32]: The strict syntax, the availability, fast feedback, you understand.

Demetrios Brinkmann [00:55:38]: Like when you're talking about this lawyer use case, you probably don't get as fast of a feedback loop, especially if you now have to let it go through court. That's going to take months or years.

Jonathan Wall [00:55:53]: And think about it too, right? You make a program like some trivial programming error like a linter or a builder is going to tell you what line it was on.

Demetrios Brinkmann [00:56:01]: Yeah.

Jonathan Wall [00:56:01]: Like there's no like, there's no like make for like law. And even if there was and it was correct in California, it'd be wrong in some other state.

Demetrios Brinkmann [00:56:13]: Yeah. How would that even look? That is so true.

Jonathan Wall [00:56:17]: Yeah.

Demetrios Brinkmann [00:56:17]: And that's not just for law, but as you were mentioning before, insurance, any of these areas that we play in that isn't so clear, like does this compile or not?

Jonathan Wall [00:56:29]: So I think that the next area is where, where you'll start to see like rapid growth will be things where you have some sort of a verification function. It might not be as clean and as perfect and complete as like the tooling we have in software engineering and I do think that I don't know if this is the right term for it like the GitHub ification of like workflows is going to be important it could be that someone figures out a very generic and extensible way to have like a staging environment that like outputs work that people approve or don't that's maybe more open ended but I think that's that's going to be ultimately another important piece of this puzzle that is.

Demetrios Brinkmann [00:57:13]: So cool to think about the GitHub ification of other areas of work and how valuable that will be for the future agents of the world I think.

Jonathan Wall [00:57:25]: I think it's important I think it's missing like agent had these inputs produced that output human being ax reviewed it it was pushed into Salesforce or workday.

Demetrios Brinkmann [00:57:36]: Or whatever system or rejected and told yeah to go back to the drawing.

Jonathan Wall [00:57:42]: Board and fix changes and then turned into a benchmark see you get it exactly yeah SA.

+ Read More

Watch More

Founding, Funding, and the Future of MLOps

Posted Jan 02, 2024 | Views 5.6K

# Image Generation

# AI

# Storia AI

DevTools for Language Models: Unlocking the Future of AI-Driven Applications

Posted Apr 11, 2023 | Views 3.6K

# LLM in Production

# Large Language Models

# DevTools

# AI-Driven Applications

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

The Future of Data Science Platforms is Accessibility

Posted Nov 29, 2021 | Views 712

# Cultural Side

# Health Care

# Case Study

# Interview

# healthrhythms.com