MLOps Community
Sign in or Join the community to continue

Durable Execution and Modern Distributed Systems

Posted Mar 17, 2026 | Views 3
# AI Agents
# AI Engineer
# AI agents in production
# AI agent usecase
# System Design
Share

Speakers

user's Avatar
Johann Schleier-Smith
Technical Lead for AI @ Temporal Technologies

Johann Schleier-Smith is Technical Lead for AI at Temporal Technologies, the leading provider of durable execution. He previously founded Crystal DBA, which developed agents to manage cloud infrastructure and was acquired by Temporal. He also co-founded if(we), which built a collection of social networks with over 300 million members and was acquired by The Meet Group (NASDAQ:MEET). Johann serves on the board of Sama, a leading provider of training data for computer vision applications. He holds a Ph.D. in Computer Science from UC Berkeley and an A.B. in Physics and Mathematics from Harvard University.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

A new paradigm is emerging for building applications that process large volumes of data, run for long periods of time, and interact with their environment. It’s called Durable Execution and is replacing traditional data pipelines with a more flexible approach. Durable Execution makes regular code reliable and scalable.

In the past, reliability and scalability have come from restricted programming models, like SQL or MapReduce, but with Durable Execution this is no longer the case. We can now see data pipelines that include document processing workflows, deep research with LLMs, and other complex and LLM-driven agentic patterns expressed at scale with regular Python programs.

In this session, we describe Durable Execution and explain how it fits in with agents and LLMs to enable a new class of machine learning applications.

+ Read More

TRANSCRIPT

Johann Schleier-Smith [00:00:00]: It's all about developer productivity, but different aspects of developer productivity, right? Here are some common patterns for building agent systems. Okay, boom, agent SDK. I want it interacting with the world. I want it long-running, reliable, async, temporal. What you can do now is you can just go, you can fix that error, save the file, restart the server, and that'll just finish.

Demetrios Brinkmann [00:00:42]: Should we chat durable execution?

Johann Schleier-Smith [00:00:44]: Let's do it. All right, let's do it. Let's talk about durable execution.

Demetrios Brinkmann [00:00:47]: Can you give me the breakdown, the TL;DR of what it is?

Johann Schleier-Smith [00:00:50]: So durable execution basically just means that your software does what it's supposed to do. At Temporal, we talk about making software crash-proof, and that means that whatever it is, the program starts at the beginning and it's going to run to the end regardless of what happens with failures, which are particularly common in the cloud, right? You have servers that are flaky, services that are overloaded, APIs that are down, rate-limited, you name it. There's a mechanism within Temporal for recovering from that. And what's important about it is that the programmer doesn't need to do extra work to do that. Of course you can write retry logic. Of course you can think about distributed systems. Like you can do it. People have been doing it.

Johann Schleier-Smith [00:01:41]: People have been building reliable software, but how do you make it easy? And that's really what the breakthrough is with modern durable execution in systems like Temporal. Plan for the best, but expect the worst. Folks that are operating at scale and in the cloud, maybe not the complete worst, but often, you know, most of the worst things that you can imagine, like a lot of those things are actually just going to happen. Like that's just the reality of scale. That's why SREs exist. What you're doing with durable execution, and I think I probably should take a moment to just sort of explain some of the mechanics of how it works. But what you're doing is you're basically taking that concern of dealing with making things reliable and you're putting it in a piece of software, right? Which is, you know, say that temporal software and you're saying, okay, we have reliability over here, we have business logic over here. And those can be treated as separate concerns.

Johann Schleier-Smith [00:02:45]: And Temporal is open source, so it's MIT licensed. You can run it on top of Cassandra or Postgres for your persistence layer. You can also use Temporal Cloud if you want, but it's, it's all of the same functionality. You can migrate back and forth. Let's talk about how it works because this is one of the things that it definitely took me a minute to wrap my head around it. I think it takes most people a minute to wrap their head around it. And then once you kind of get it, it just, it makes a lot of sense. So from the programming model perspective, what you're doing is you're working with an abstraction that is going to basically cover up a lot of the things that are complicated about distributed systems.

Johann Schleier-Smith [00:03:34]: So you're gonna be working not with servers as your building blocks, but with programming language level constructs. Like, and in particular, these are in our terminology workflows and activities. And each of those are, are basically just functions. So regular code in whatever language you want. And there is a distinction between the types of code that you put in those. So activities are doing I/O, and basically they're arbitrary code. What makes Durable Execution work is that the workflow code is a restricted program model. And in particular, the restriction that we have on that is to say, if something goes wrong and we rerun that code, provided we give it the same inputs as it had the first time it ran,, it's gonna do the same thing again.

Johann Schleier-Smith [00:04:33]: And so this is called deterministic execution. It's what you get in, you know, normal code. If you have, you know, for loops, as long as you're not calling, say, a random number generator or an LLM, or an LLM, or an LLM, um, which is sort of the obvious case for, for an activity. Yeah. Um, then it's gonna do the same thing, right? And so for example, if you are coding an agent with Temporal, or if you're using one of our integrations in general, that agentic loop, right? That is workflow code, right? And because what is it doing? It's iterating over calling the LLM, calling tools, calling the LLM, calling tools until it's done. Okay. That's a deterministic program that does that. And then whenever those activities, which are say tool calls, maybe calling an API or calling an LLM, Whenever those are made, whenever that result comes back, what happens in the Temporal SDK is that it takes that result and it saves it out to the Temporal server.

Johann Schleier-Smith [00:05:39]: So that's where it's captured durably in Temporal server, which is gonna be backed by a database or by Temporal Cloud. In the case of Cloud, it can even be replicated to multiple cloud regions., right? So, you know, we, we actually see this when say you have a major cloud provider outage, you have a region go down, you can actually take a running program like mid-execution and move that from one region to another region, which is just a whole nother level of reliability. Yeah. Because when you think about it, we think about, oh, I could do, maybe I could replicate my database, right? Will a database move a running query? Is there any database? I haven't done exhaustive research, but that will move a running query from one region to another. When it fails. And what about— Exactly. Yeah. Or when that region fails or a business process.

Johann Schleier-Smith [00:06:26]: So that's the type of thing that this model lets you do and it lets you do it with your regular programming, like regular code. And I think that's, I mean, that's basically the thing that's so brilliant about it where I give the founders and the, you know, the other folks who are innovating on this, like just a ton of credit for figuring this out.

Demetrios Brinkmann [00:06:45]: If we're not using durable execution, What does that look like?

Johann Schleier-Smith [00:06:49]: Yeah, there's a bunch of different things that people do. So I'll say there's some folks who just build systems that are just not that reliable, right? And those are the folks that are, I built them in my day too. And that means that you are spending a bunch of time chasing down what went wrong. You're gonna write some scripts against the database, right? And you're gonna say, hey, you know, maybe I had a transaction, I was, maybe I'm doing something simple. Let's just say we always take the money moving example, debit and credit, right? Can I write a job that goes through at the end of the day and, figures out where things broke and then writes a script to fix all those. I can do it right. It's like, it's kind of doable. Do I want to do that? Is that robust? Is that reliable? Does that scale to a team and complex functionality? What if there's like more business logic? What if it's not just debit credit? What if there's, you know, some checks, you know, know your customer, any of these kinds of things, right? It's going to get really messy really, really fast.

Johann Schleier-Smith [00:07:46]: So then kind of another level on top of that is People say, okay, well, now I'm going to start to use, um, event-driven architecture. I'm going to start having queues and I'm going to put my process together this way. And so again, can you do it? Like, yes, you can do it. You can cover all the edge cases. You can have the dead letter queues and you can have the compensation logic and you can do all of that, right? Like, you can do it with a durable execution abstraction. That code just becomes really simple. It gets cleaned up. You make sure that that activity code on a piece-by-piece basis is idempotent in terms of the operations on the database.

Johann Schleier-Smith [00:08:24]: That basically means you have some key. If that key, you know, the customer key, the transaction ID is unique, it'll, anytime it retries, it'll just, you know, it'll fail out. It'll make sure it gets done exactly once. And then it really cleans that up. And then, you know, I'd say another class that we're seeing now, particularly with the agentic systems, is that you have checkpoint-based solutions. Okay, so what is a checkpoint? A checkpoint basically means take my entire program state, save it down. You also see this often in machine learning training, for example. And, you know, checkpoints are okay, right? But they are coarse-grained, generally speaking.

Johann Schleier-Smith [00:09:09]: So if you are in the middle of, say, multiple tool calls or maybe you have a human-in-the-loop interaction, it's not clear that that point in the code is going to necessarily be a good point to write a checkpoint. And on top of that, you're going to have to write checkpoint logic. And there are some frameworks that do an okay job of sort of hiding some of the work of writing the checkpoint logic. But in the general case, you're still writing a function to take all my state, save it out, and bring it back. And then you're getting more coarse-grained recovery.

Demetrios Brinkmann [00:09:47]: So I thought this was a form of checkpointing.

Johann Schleier-Smith [00:09:51]: How is it different? Yeah, so I think you can define the terminology potentially in different ways. If I were to think about, say, just set aside code and just talk about maybe something like file systems, or backups, right? Where I could say, okay, well, I'm gonna snapshot this file system and then maybe I'll do some incremental backups or incremental snapshot updates. It's basically doing it incrementally in the sense that you have some starting state, which in general is nothing, right? It's just the program starts, maybe has some initial arguments, and then as it runs and as it interacts with other systems, with tools, with With LLMs, it's getting sort of the incremental deltas to each state and that's being saved. And then because we have certain guarantees about how that program runs, we can apply those deltas back to that state. For programs that do run for an extended period of time, say you have an agent that's going to run forever, you may also with Temporal find points where you say, okay, you know what, this is where I'm going to save a full snapshot of my state. It's just gotten too big of a history and, oh, and, and it's a good point for me to do it. Right. And in some cases like that top of that, uh, say, say it's a chat, you know, the top of that loop where you say, okay, now we're pausing the agent.

Johann Schleier-Smith [00:11:22]: Let's see whether what the user wants to do next. That can be a place where we can say, okay, and in temporal language it's called continue as new. And now we basically do a full checkpoint. But it is, you know, the durable execution model, it is a different abstraction and it's, you know, it gives you what a checkpoint would give you, but it's a very different way of doing it and that's why we call it something different.

Demetrios Brinkmann [00:11:50]: Is it right to assume that basically you're guaranteeing the end goal in a way? Like you're saying, okay, a checkpoint is more for— it's almost reactive. Like I'm checkpointing along the way so that if shit hits the fan, I can come back to that checkpoint. But you're saying like, we're going to guarantee you we get to the end. You don't necessarily need to know if we—

Johann Schleier-Smith [00:12:17]: if shit hit the fan along the way. Absolutely. You don't need to know about it. And you actually also don't need to think about writing that save code in most situations. I'll say there is this, like I said, you can do the continue as new, you can do checkpoints for things that are going to run for a long time and generate state for a long period of time. Temporal jobs can run forever, right? I said they make sure they run to completion. It's whatever that is. That actually could be infinity.

Johann Schleier-Smith [00:12:48]: But yeah, if you're doing, say you're doing, you know, model training or something like that, probably be familiar with, you know, you get to a certain point, you say, okay, now would be a good time to save a checkpoint, right? Maybe I've done a pass of fan out and back. Okay, checkpoint this thing. Or maybe after so many iterations or steps, write a checkpoint. And then I have to write also the code to recover, which is to read that checkpoint. If I have one single program that does one thing linearly, Okay, maybe I can do that. Maybe that's not so bad, right? But if things are branching out, right? If I'm doing hyperparameter optimization, I've got like a whole bunch of these that are running in parallel. And then maybe they also sometimes communicate and feed updates back to each other. Now sort of getting clean checkpoints across a sort of concurrent program like that, basically doesn't really work.

Demetrios Brinkmann [00:13:49]: Yeah, it's like a, uh, it's like Santa Claus. It's nice in theory, but it doesn't exist.

Johann Schleier-Smith [00:13:57]: Yeah, don't tell my kids. Yeah, exactly.

Demetrios Brinkmann [00:14:00]: We used to, in Spain, there was a moment that we would, um, joke when something was like, uh, not real. Yeah, it'd be like, oh, it's the parents.

Johann Schleier-Smith [00:14:10]: That was the same, like, yeah.

Demetrios Brinkmann [00:14:13]: Santa Claus, it doesn't exist. It's the parents. Yeah, it's kind of alluding to that. So, okay, I'm getting a better idea of it now. You're a databases guy though, right through and through.

Johann Schleier-Smith [00:14:26]: And so a lot of time in databases, like, marry these two worlds for me. Yeah, so prior to being at Temporal, I had actually started a company that was in the database space looking at serverless database and also AI for database administration. So Crystal DBA, the company was acquired by Temporal last summer. Congrats. So thank you. So what was really interesting there when we did that acquisition was actually the missions of the companies were basically the same, even though the technology was actually quite different. So at the end of the day, when we think about databases and Temporal, it's about making the systems run reliably without developers needing to think about it. Right? So if you think about where the, that sort of just write the code and put it in some system and it'll just, it'll do the right thing.

Johann Schleier-Smith [00:15:32]: Where does that happen? Actually, it happens in these transactional database systems, right? They give you those guarantees of it'll run end to end, it'll make all of the changes. Now, there are a bunch of drawbacks there. So one is that you're often writing in some arcane language that basically nobody, whether it's PL/SQL or the old SQL, the Postgres, everybody's kind of got their own variant of it that just works for that one specific database, no standards. And just to really, you know, take me back to the '80s or '90s, typically, like in terms of programming language style and capability. And, you know, I can't define things the way I normally do, so on and so forth. So what you're getting with modern durable execution is you're getting that guarantee basically of my software is, you know, bulletproof. It does what it's supposed to do. And you're getting it with whatever programming language you want.

Johann Schleier-Smith [00:16:37]: Now, you know, I will say there are some things that databases give you in addition. So for example, they give you sort of a consistency model around the way that different programs interact that the the durable execution doesn't do on its own. Maybe it will someday, I don't know. But definitely being able to just bring it to regular programming languages is just a huge thing.

Demetrios Brinkmann [00:17:03]: That like guarantee that you get is only in the database.

Johann Schleier-Smith [00:17:06]: So that's, I mean, that's the other big thing is that if you are monolithic, right? If your application is, you know, one database and, one, you know, front end or something like that, then great. What happens if I have more than one database? Basically everyone does. 100%. Unless you're, you know, super small. Yeah. You're going to have multiple systems. Yeah, exactly. You're going to have your older systems and your newer, you know, your legacy, the one that's coming next.

Johann Schleier-Smith [00:17:39]: You have to be able to connect all that together. And that's why you really want to take that understanding of reliability basically, and basically just put it in a box and say, okay, it's handled over here. If you're on Temporal Cloud, you know, it's handled by these folks who, by the way, there's like more than half, I think, of the engineering organization at Temporal actually just works on making the cloud thing reliable. Oh, nice. And at scale. Well, I would hope. Yeah. So there's just like, there's like a whole bunch of people people who work on doing that and doing that really, really well, which is why so many top brands trust Temporal to, to run the durable execution for them.

Demetrios Brinkmann [00:18:27]: Who do you normally have using it? Is it the SREs that are bringing it in? Like, who's champion for Temporal?

Johann Schleier-Smith [00:18:36]: So there are a couple of patterns and we're seeing more emerge. One is that in organizations that have platform teams, the platform team is often the owner for Temporal. And well, because they're the ones that get called, right? Yes. And they're the ones who have a responsibility. You have a platform team. So somebody's gonna say, okay, reliable infrastructure. Somebody owns that. Somebody's measuring it.

Johann Schleier-Smith [00:19:07]: Somebody is, but also I'd say developer productivity, right? Because the idea of a platform team is, hey, let's have some shared infrastructure so that all of our other teams can move faster and that we can kind of have some standards and best practices around that. Different orgs do it in different ways, but the productivity angle of Temporal also really plays into that where, you know, we see platform teams that say, okay, we can, we can offer this to our developers and, and then all the application teams can just build a lot faster.

Demetrios Brinkmann [00:19:38]: There's like flavors of Airflow, right? There's flavors of like Argo workflows. How do you kind of like place yourself in that market map?

Johann Schleier-Smith [00:19:49]: I'd say Airflow is relatively domain specific, you know, ETL, data movement, and you could build an Airflow with Temporal, with durable execution. And we actually, I think some of our customers effectively do that. They build their own sort of in-house frameworks for different sorts of processes. Oh, I'm not saying that they're running Airflow exactly, but they will do Airflow-like things maybe with their own framework, you know, which connects to their data sources and syncs and so forth. We definitely have folks who build DAG executors on top of Temporal. That's sort of a, a normal thing to do. Temporal is, it's a more general programming model. It's sort of the one of the key things.

Johann Schleier-Smith [00:20:43]: And so where you see that displacing things like Airflow sometimes is that, okay, if I needed to ETL a bunch of records, maybe Airflow's the right solution for that, right? But if I am, say, say I got a dump of, legal documents and media files and other sorts of things. I need to go, I need to figure out what these things are, maybe do some enhancement on them, and then maybe I'll either dump them somewhere else or I will maybe index them, you know, either in Elastic or vector database of some kind. The code that you write for that is not necessarily most naturally expressed as a DAG or something like, like just write Python code, right? Call these different things and run it with durable execution. And then, you know, you'll make sure that you're not missing page 406 or whatever it is.

Demetrios Brinkmann [00:21:41]: Okay. Can you gimme more examples like that?

Johann Schleier-Smith [00:21:44]: Yeah. Hyperparameter optimization. So, you know, I can write a driver program for that as a Python script. And as long as the server that I'm running that on, maybe it's even my laptop, as long as that doesn't crash,, right? Great. Now if I want to say close my laptop and walk away, or if I maybe have that running for, for a longer period of time, right? Even over, overnight, maybe even for a week. And that job could, you know, who knows what will happen in the cloud with that VM. Maybe I put it on a, on a virtual machine in the cloud. You don't have to basically worry about any of that with Temporal, because what it really does is, is it, um, the, the actual part of the program that matters, which is the program state, right? Essentially compute is, it's all essentially ephemeral, fungible.

Johann Schleier-Smith [00:22:34]: You can replace the compute. What matters is the state and the state of that program, which iteration is it on, on which rollout and all that is going to be saved in the server. And then there's a, you know, a server process that knows that if any of the compute resources go away, you know, that it needs to go find someplace to run that again.

Demetrios Brinkmann [00:22:56]: In a way, it's like the idea of like this declarative type of stuff where you just say, hey, look, I want to declare this. I don't care how you figure out what is going to happen. Like abstract all that away from me and just make sure that what I declare is what happens.

Johann Schleier-Smith [00:23:17]: Yep. And, and to make that concrete, I'm going to declare something like, here is my program— Python, TypeScript, whatever you name it— here's my program, and I've put that on a, on a server. So, so the way it works in terms of the model is that we talked about the temporal server that's holding the state backed by a database or the cloud. Um, there is sort of the, the worker which is going to be listening on that server and saying, what should I run? And then you'll have a client, which is going to be sort of the declaration of go run this workflow. These are the arguments. Go run that code with these arguments. Once that acts and once that returns, it's going to go make it happen.

Demetrios Brinkmann [00:23:57]: Yeah.

Johann Schleier-Smith [00:23:57]: It reminds me of the Kubernetes philosophy very much like just, yep, here's what I want. Yeah. System, go figure it out. Exactly. Yeah, exactly. In terms of other examples, agents are definitely—

Demetrios Brinkmann [00:24:12]: Well, that was kind of where I was going to go is it probably gets a little bit more messy when you have agents doing this, right? Because—

Johann Schleier-Smith [00:24:23]: So it could. So, but, but actually it works pretty well. So there's a misconception that's out there a little bit that people have, and I understand it can be confusing when you think about LLMs being non-deterministic and then you say, well, wait a minute, but there's this like deterministic execution thing. And so how is that gonna work? Like, is that even gonna work? Yeah. Can I even use durable execution with, with LLMs? And I think as we talked about before, that agentic loop actually is deterministic code. It's not the output. The, the output of the LLMs, even if it's controlling. So, so the difference with an LLM and with an agent really versus say other types of LLM-enabled applications is that the control flow at the end of the day, do I continue, do I call a tool, is going to be dictated by the LLM responses, right? Versus perhaps one of those document processing examples where, yeah, you're still using LLMs, but you're gonna say apply transformation A, B, C, summarize, and so forth.

Johann Schleier-Smith [00:25:32]: And so, so that's, that's sort of very It's actually a big use case, right? But that's sort of one box of uses. And then, which is basically non-agentic AI. It's still AI, it's still generative AI, it's still LLMs, it's still really cool, really useful, new, but it's not agentic. So agentic now, the LLM is gonna decide what to do next. But effectively the way that works with Temporal is that that is still, that's like state relative to the workflow of, LLM said call this tool that comes back in an activity and then we run it and it's captured and it's captured durably. Where we see a lot of additional value for temporal in agents is A, long running, right? So if you are say monitoring something or you have some business process and now we're seeing a lot more of this too with the emergent Open Claw type uses where things are running on timers, right? It's in the name. It's kind of obvious, but yeah, but, but like that, that, that's just like super, super natural of like, okay, this is going to run every half hour, every day at 3:00 PM, whatever it is, it's gonna check something and then you have to wake stuff up. You have to do whatever you need to do.

Johann Schleier-Smith [00:26:45]: You have to remember where you were. Yeah. And then you're gonna go back to sleep for a while. And the important thing there is that you don't need to have compute resources for that. What matters while it's sleeping is just, okay, what is the state? Where was I? So all that is saved in the server, and then, you know, it's basically brought back and rehydrated to whatever it was in that code path at that right time. The other case that's pretty valuable is for human in the loop. Really, we talk about humans, but frankly, it's any sort of interaction with the outside world. So if it was a human, it could be, okay, I am getting to a certain point and then maybe I have a tool call that is actually going to go generate a document for signature and send it out for e-signature and then continue when it gets that back.

Johann Schleier-Smith [00:27:38]: Right. And so maybe that's gonna be quick. Maybe that's not gonna be quick. We don't really know if it's not quick, we want to be able to reclaim the resources, which are basically in the temporal. Those are gonna be cached compute resources. And then it'll just come right back to where that was. And continue in that flow. But you can also, you know, you can also set a timer on top of that, which is say, oh, it hasn't been signed, you know, and it's been, you know, 6 hours or 24 hours or whatever it is.

Johann Schleier-Smith [00:28:05]: Okay, great. Now, you know, we're gonna kick a reminder or we're gonna expire it, or actually somebody came along and said, we have to retract this. And so one of the things that Temporal lets you do that actually some of the other Durable Execution Frameworks don't quite have the same level of flexibility, is it allows you to have interactions with that running workflow program through what we call signals, updates, and queries. So you can basically write these handlers that sit in that class. You've got your entry point, kind of your main function, right? But then you can have these other pieces of code, and they're basically, they're running in that environment. And so if that is, you know, some sort of, oh, we changed our mind or whatever it is, it'll just come right in there and it'll interact with all that state in that program and it'll do whatever it needs to do, take it back and, and, um, and just, and just execute with regular code.

Demetrios Brinkmann [00:29:05]: Tell me more about that because that, that sounds fascinating. Is it's almost in a way you've got this scale to zero, like serverless type of style. Yes. And then you've also got these little side quests or what different types of pathways that you can go down. If-else statements. If this happens, then we've got this logic over here that you can go and you can run.

Johann Schleier-Smith [00:29:31]: Yep. It is in your workflow, right? It is a regular— let's focus on Python. Um, so with Python, you tend to have an async/await style of programming. And so you can have concurrent things happening at different times, right? And in Temporal, what we've actually done is we've replaced in, in our SDK, we've replaced Python's event loop, its async event loop with a deterministic one that basically makes sure that the interleaving on replay is, is always gonna be exactly the same.. And so that allows you to do any of the things that you would do with concurrent programming, like, you know, having queues and handlers and stuff like that. If you could, you could basically, you get your full concurrent programming model inside of this Durable Execution. So, so, um, you know, oh, I ran something. Oh, maybe it's, maybe it's taking too long for that thing to finish and I need to get a response back to a customer.

Johann Schleier-Smith [00:30:38]: Right? Okay. Set it, set it, just sleep. We replaced the sleep function, right? So that it's, instead of just doing the system sleep, it's gonna be temporal sleep. And then that means that if it's short, you know, at least it'll be, it'll actually keep it cached. It'll be deterministic and durable. So if it did crash during that sleep, it would recover. But then if it's long, we'll actually effectively intentionally crash it. We'll clear the cache, flush it out, reclaim the resources, and then come back again.

Johann Schleier-Smith [00:31:10]: But yeah, full programming model. That's the thing that's really different about durable execution.

Demetrios Brinkmann [00:31:16]: How are you seeing folks take advantage of this? Because I instantly am thinking about the Open Clause situation and thinking, oh well, what I would do is just probably program in Hey, every X amount of time, just wake up the LLM or send off, uh, something to an LLM with context.

Johann Schleier-Smith [00:31:37]: Here's what's happened. Here's all of the information. So there are a bunch of folks who build on top of Temporal directly and who build different sorts of experiences. Uh, you know, I can't go into the specifics of the details, but I, you know, I can tell you that. In addition to OpenAI's Codex on the web, we have Replit and Lovable, and they use Temporal, as do a bunch of other folks. And so just using those primitives, you can build very sort of from the ground up, very robust agentic systems. Now, another thing that we've actually done is we've done a number of integrations with some of the leading agent frameworks. So for example, OpenAI Agents SDK, um, Pydantic AI, um, um, AI SDK by Vercel.

Johann Schleier-Smith [00:32:38]: Those are ones that have been released. Mm-hmm. We're continuing to work on others. We have integrated with Braintrust on the observability side, we've integrated with LangFuse. So our goal with our integrations is really to give users choice of basically what pieces of the AI ecosystem do you want to bring in and use to build your agentic system, right? And let us handle the durability, let us make it reliable, let us give you the ability to use these sort of primitives to send messages and queries and kind of ask the agent, what are you doing right now? Right? And things like that. Those are all things that come with a temporal programming model and that by when we integrate with those frameworks, it makes it easy for you to access it.

Demetrios Brinkmann [00:33:38]: Yeah, I was thinking exactly that, where you're almost using the temporal state, feeding that to the agent to say, here's where we're at, now what should you do next?

Johann Schleier-Smith [00:33:51]: Yeah, it's really just separating different concerns. So the temporal state is your program state. Another way to think about it is, you know, whatever your stack and memory state is, is effectively protected by Temporal, but it's, it's your programming language state, right? Um, where are you in terms of that execution path? And then the agent framework is about making it easy to put together, you know, multiple agents and guardrails, or, uh, increasingly now what we're seeing, the coding agents and the file systems and all these other pieces, it's actually moving towards almost this new category, which is the harnesses, right? And so it's all evolving quickly. We're evolving with that ecosystem. But the key thing is really just kind of, it's all about developer productivity, but different aspects of developer productivity, right? Here's some common patterns for building agentic systems. Okay, boom, agent SDK. I want it interacting with the world. I want it long-running, reliable, async, Temporal.

Demetrios Brinkmann [00:35:03]: Mm-hmm. Okay. So now can you break down, like, what do I do first?

Johann Schleier-Smith [00:35:08]: Okay. So the way that you start is you really have, have two options. You can either download the Temporal CLI and run it on your laptop, which is a totally good way to do it. You can also start on Temporal Cloud, but you do need a Temporal server of some kind. And then you get the SDK these days. What I would recommend that you do, frankly, is that you just grab your favorite coding agent, uh, create a project. We have a Temporal skill, and, you know, probably just ask it to, to build you a hello world. Look, we've got great documentation, we've got tutorials, um, on the website and so forth, which, which I definitely encourage people to, to look at as well.

Johann Schleier-Smith [00:35:49]: Um, but the— if, if you want to be forward-looking and if you kind of want to see where we're going with some of this stuff is like actually just say, you know, hey, agent, teach me, teach me Temporal.

Demetrios Brinkmann [00:36:01]: Yeah, that's a common pattern that I've heard folks talking about is how before they'll jump into any new, whether it's a new programming language or a new, new tool. Yes. They'll ask the chat. Yes. Give me some of the, the TL;DR and then go a little bit deeper. And then once you have this mental model of how it works, You can then start using the tool much more effectively.

Johann Schleier-Smith [00:36:28]: Yes. Grab the skill because that will ensure sort of the consistency. The LMs, the foundation models, are pretty good. They understand temporal pretty well. Um, but there are, you know, certain edge cases and gosh, I mean, we know that LMs aren't always perfect. Um, and the skill will definitely accelerate that.

Demetrios Brinkmann [00:36:45]: Yeah. It feels like right now in this conversation, we've established what it does and why you would want to use it, right? Then maybe the next step is, all right, now I'm using it. What are some things that I should be focusing on while using it?

Johann Schleier-Smith [00:37:05]: Yeah. So I definitely recommend that people spend a little bit of time wrapping their head around the model. Some of this It's just because it's a little bit different sometimes. I think if you just focus on it and understand it, then you'll be good. But just understanding like, okay, I'm going to have this worker, right? Which is where the code actually runs. I'm going to have a starter program that's going to be the client that initiates that code. And then I'm going to have the Temporal server backed by, you know, whatever reliable backend it has. And those are kind of the different pieces.

Johann Schleier-Smith [00:37:41]: So making sure you understand that. Part of the model, making sure you understand that you have activities. Activities do anything, and you have workflow restricted, and that's kind of the model. So really thinking about, well, how do I break my program up into these pieces? The simple question to ask usually is, does it do I/O, right? If it does I/O, that means it's going in an activity, right? Because that means that state that's going to come in that we need to save. And then, yeah, you know, work on modeling your problem. We have a number of sample applications. We have an AI cookbook that we're continuing to add recipes to, and so those are good to sort of, sort of pull from, and you can kind of see how, um, if there's something that matches what you're trying to do. And then the other thing that I encourage folks to do is to use the Temporal UI.

Johann Schleier-Smith [00:38:33]: So that when the— if you start up the CLI, if you go to the cloud, it'll just be there. If you start up the, um, the server on your local environment, you'll get a URL. You click that, you open it up, and basically what you can then see is you can see the program executing. So you can see, um, you know, the, the, the workflow. You can see the activities. You can dive in. You can see the arguments, the return values. You can see any sort of retries.

Johann Schleier-Smith [00:39:00]: You can see failures. And actually, one of the cool things that I encourage folks to do this. This kind of blew my mind when I first saw it is that suppose you write your code and you're like, you're just developing, right? And you did something wrong, right? Or your coding agent did something wrong. There's a bug. It's a syntax error, whatever. Python blows up 3/4 of the way through, whatever. So what you can do now is you can just go, you can fix that Error. Oh, save the file.

Johann Schleier-Smith [00:39:35]: I know where you're going. Restart the server and that'll just finish. Wow. And so that's obviously can save you time in on your local development loop. Um, but also if you think about if I've set some job to run, right, and this is always my fear and, you know, and that's, look, I'm, it's great to use, um, you know, type annotations and other stuff in Python to try to reduce this, but definitely like you kick that job off, it runs for a week and now it's broken. And because of something silly, right? And now you're like, okay, well, I kind of have the pieces. I can pull them together and I can write a new program that will pick up kind of from where that one broke. And like, I'll do that.

Johann Schleier-Smith [00:40:16]: But it's— but with Temporal, you don't do that. You just like fix the thing and then the thing finishes. It remembers where it broke down. It remembers where it broke. Yes. It can move to the new version of the code.

Demetrios Brinkmann [00:40:27]: Yeah. It also makes me think there's a lot of time we talk about memory. Right, for agents. And we talk about there's the personalization side of memory, and then there's the 'I did something as an agent, and so how do I remember how I completed that?' Kind of is moving in that direction a little bit.

Johann Schleier-Smith [00:40:47]: It is. So one of the things that's, that's interesting about temporal, and this is kind of where I think 2026, frankly, we know it's going to be a wild year, um,, and we'll see how things play out. So the file system has been sort of the main repository for memory in most of these agents. Certainly something like OpenCause is using the file system. If I'm using Claude Code, it's sort of keeping a session history in the file system and so forth. And that's okay. And you can build file system-enabled agents and applications with Temporal. But actually, you don't necessarily need a file system for, depending for a lot of agents.

Johann Schleier-Smith [00:41:33]: So if you're going to be generating code, running scripts, doing sort of sophisticated computer use type tools, then you probably want a file system. If your agent is calling APIs and, and LLMs working with databases, then you can probably keep all the state that you need in Temporal, right? And certainly if it is sort of that episodic memory relating to a particular session, then that should be totally fine. For memory that, that sort of crosses sessions, then I think And frankly, that's an area where we're still sort of experimenting and exploring things. And frankly, the whole ecosystem is, both in terms of how do you structure it and then where do you put it.

Demetrios Brinkmann [00:42:26]: Yeah, and how memory is such a fascinating one because you want that memory, but you don't want to leak that memory too. If there's different things that are, if you're going across sessions and they're not the same person for that same session.

Johann Schleier-Smith [00:42:44]: Well, absolutely. You'll get a level of isolation just kind of out of the box where basically every workflow, which is basically every invocation of that, you know, of that workflow, is basically private. And that's just part of your programming language's isolation model. You can put more isolation on that, right? So if you want to, you know, run inside of containers or isolates or WebAssembly or whatever, like you can do that. Um, but I think that where it gets really interesting, and to be clear, we don't have any solutions or products in this area, but this is just the kind of stuff I like to think about, is how you control that flow of context even within, well, let's say maybe it's hierarchical, right? Because I have it within my, um, you know, within my different tasks for myself personally, but then maybe I do want to share context with my team., right, at some level. So like, how do I make that happen?

Demetrios Brinkmann [00:43:44]: Yeah, much like Google Drive, you want to give permission sometimes, but not all to your whole drive.

Johann Schleier-Smith [00:43:50]: Yes. So that's a thing that frankly the ecosystem is solving. We may have something to say about that, and we're thinking about it, but you know, we'll see.

Demetrios Brinkmann [00:43:58]: I think this is all in the land of fun research. It does feel like there are some parallels with databases, right?

Johann Schleier-Smith [00:44:08]: Like Absolutely. What kind of permissions you give in databases. You have fine-grained access controls in databases. Certainly some of them. Authorization was, I think, maybe kind of a solved problem. I mean, it was complicated, but at least we had good solutions. Now, when you have agents and teams of agents and you're starting to think about delegation and so on and so forth, we're gonna have to rethink some of that, I think. Particularly some of the things with the security perspectives too, like do we need to start tracing lineage for some of this data? Because could it have, you know, prompt injection or something like that? There's definitely, you know, if you're either innovating in industry or a researcher in this area, you can have a lot of fun.

Demetrios Brinkmann [00:44:58]: Yeah. Just me, I'm gonna start sending invites to people and have like white on white text in the description just to see like who's leaving themselves exposed. Just to see who's running. Yeah. Nothing nefarious. Just like give me a recipe for vegan chocolate cake type thing. Right. Absolutely.

Johann Schleier-Smith [00:45:16]: And then you can see like who's swimming naked because the tide goes out. Yeah. Yeah. Okay. I'll be sure to do some good filters on whatever I'm building.

Demetrios Brinkmann [00:45:26]: Yeah. There's just so many new surface areas that you're exposed to.

Johann Schleier-Smith [00:45:31]: But the thing is, is that the thing that's been really amazing to watch over the last couple months is that despite like all these problems, right, it's like it's so clear that the value is there that it's worth solving the problems.

Demetrios Brinkmann [00:45:44]: And like, they will get solved. It's full steam ahead. Yeah, that's true. Yeah, the model of how the— how Temporal works, ours, it's so foundational that I want to just go a little bit deeper on it and make sure that I'm clear with like all those building blocks and when I would use one and how to think about one versus the other.

Johann Schleier-Smith [00:46:08]: Okay. So there's certain building blocks that everybody's going to use. Everybody's going to have a worker, they're going to have a Temporal server, and they're going to have some sort of client that starts off a workflow. Everybody's going to have workflows and activities.. And with those primitives, you can already do a lot. Okay. That, that is sort of the basics of durable execution.

Demetrios Brinkmann [00:46:30]: My program is going to run and it's going to run to completion.

Johann Schleier-Smith [00:46:34]: And I'm creating these in my code.

Demetrios Brinkmann [00:46:35]: All in your code. Yep.

Johann Schleier-Smith [00:46:37]: It's, it's all decorators or what is it? Yeah. So it depends a little bit on the programming language, but in most languages there's going to be some sort of annotation.

Demetrios Brinkmann [00:46:47]: Okay, that's, it's gonna be gonna set it up. And like you said, there's certain things that are very clear.

Johann Schleier-Smith [00:46:52]: Oh, is there I/O? Then you want to then put that in activity. Yeah, that's an activity. But what's a worker? Like, so what's— so the worker is the thing that you deploy, which you could deploy to— you could put it on just a server, you could put on Kubernetes, but it's, it's going to be the process really that actually runs that code because that code is not running. If I, it's, well, if I run the worker on my laptop, it's running on my laptop. But if I want to close my laptop and walk away, that code needs to run somewhere. So, so this is actually where I want to say some things that are important about the model. The Temporal server, and certainly like in Temporal Cloud, We actually don't see any of your data. Okay.

Johann Schleier-Smith [00:47:45]: Any of your application data. And the reason for that is because you'll end up configuring in production. You'll configure the client, the Temporal client to encrypt everything before it gets sent to the Temporal server. And then you have the keys. We don't have the keys. Yeah. Or if you're hosting it yourself, you can do the same thing. That system doesn't have the data.

Johann Schleier-Smith [00:48:11]: So what does it mean if you can't see the data? Well, actually you can't compute on that data either. And so that's where you bring up the worker, right? And the worker is running in your VPC, your trusted environment. And that's actually where you're going to decrypt the data, and then it's going to run, you know, just like it would run ordinarily on decrypted data.

Demetrios Brinkmann [00:48:36]: Okay.

Johann Schleier-Smith [00:48:37]: So the worker is where the code gets executed. The worker executes the code. Yes. Yes. And that worker could be, you know, simple ways is to deploy to some sort of container service. And those can be set up to auto-scale and so forth.

Demetrios Brinkmann [00:48:53]: But, but you can really, you can deploy however you want. Okay. So you're not dealing with the auto scaling?

Johann Schleier-Smith [00:48:58]: We can give you some signals. Okay. And so for example, in Kubernetes, there is an auto scaling mechanism that can talk to Temporal server and can say, hey, this thing needs to be scaled up. And this is actually, it's an area where we're continuing to do more work and make it sort of more and more transparent for people because ultimately what we are driving towards. And frankly, there's a pretty deep connection between durable execution and serverless. And so that's really, you know, I would expect, you know, sort of no concrete product roadmaps to share right now, but just even just looking at it and looking at the ecosystem as well as the things that we're thinking about internally, it just, it makes a lot of sense to move it more towards a serverless model.

Demetrios Brinkmann [00:49:51]: Well, it's almost like You guarantee the durable execution. A way to get out in front of that is before stuff blows up, you auto-scale in case that's one of the problems. It's almost like you're preemptively trying to do things so that you never have to restart.

Johann Schleier-Smith [00:50:12]: So you could definitely predictively scale. Yeah, that's sort of one option, and, and you could think about ways to do that. Um, but it, you know, as it pertains to servers, like most server, like, okay, if you go back to serverless function as a service, that is actually mostly reactive, but they just found ways to scale it really fast. And there's a bunch of tricks that you play and a bunch of bin packing. But the way that I think it really comes together is that what the sort of promise of serverless was originally, and I think what got people so excited about why they frankly— why that name attached itself to function as a service, which was— it actually wasn't— they didn't come out and say, hey, this is serverless. That was something that sort of the community attached to it. And it's this idea that I should be able to write the program, save the program to this massive computer called the cloud,, and then that just like the right stuff should happen, right? It's like, why would you burn a bunch of resources when my program isn't doing anything? Like, like, don't do that. Um, why does my program like break in the middle? Like, don't do that, right? Just, just like do the right thing.

Johann Schleier-Smith [00:51:24]: Like, and, and so we can get into all these definitions of serverless. And I, I, you know, I did my PhD on, on serverless. I walked through them all. Is it, does it mean it's stateless? Does it mean this? Does it mean that? At the end of the day, like, and you know, people start, it's less of a thing now because, you know, AI is now the new buzzword. It was, you roll back, you know, 8 years and it's all about serverless, right? You can serverless wash stuff in your marketing and whatever. At the end of the day though, it's just like the thing does what it's supposed to do.

Demetrios Brinkmann [00:51:57]: That's all the serverless is that I don't need to worry about server nonsense.

Johann Schleier-Smith [00:52:03]: Yeah. The parallel there for what you're doing with Temporal is so clear. It's very connected, and I think it is effectively solving the same problem. It's solving the problem of the cloud is just introducing complexity. It's giving us a lot of benefits, right? It's giving us benefits in scale. It's giving us benefits in the resource elasticity, which gives the cost. It gives us benefits of getting access to all types of cool hardware and GPUs and all this, all these benefits. Right.

Johann Schleier-Smith [00:52:39]: Um, but then, you know, it, it does, it has in the past had these fundamental, I actually, I'll take that back, non-fundamental, uh, limitations and costs that came along with it. Yeah. And when I think about kind of the progression in the last 10 years, and really what the, what's been happening, the way I interpret servers is sort of taking all of those non-fundamental awkwardness things away.

Demetrios Brinkmann [00:53:11]: And that, that's basically what it is. Yeah. You get more room to do cool stuff, but when you add this complexity, there's just more ways that it breaks. Yep. Okay. So you have the worker, that's a fundamental building block. Yes. You also have different pieces, and break down these other pieces for me.

Johann Schleier-Smith [00:53:32]: Yeah, so you've got the worker, that's where your code is running. You have the server, that's where your state is being made durable and it's being tracked. And then you have the, the client, and then what are you writing on top of that? You are writing in terms of these simple primitives, the activities, which is where you're doing the I/O, And the workflow, which is the, uh, the, the control of that, which is sort of your main entry point to that program.

Demetrios Brinkmann [00:54:04]: Okay. So the workflow is what I'm writing, what we want it to do. Exactly. Whatever it is. If it's that document processing pipeline, it's do XYZ. Yep. And that's what it's, as it says, you know, workflow.

Johann Schleier-Smith [00:54:18]: Yep. So kind of clear thing. Do it beginning to end. It could be my agentic loop.

Demetrios Brinkmann [00:54:23]: I want you to loop until somebody says stop. Okay, whatever it is.

Johann Schleier-Smith [00:54:27]: And this is on the client? Well, let's see. So, so the workflow and the activities, those are running on the worker. The client, uh, which you can trigger also from within workflows and activities, but, uh, what that's going to be doing, it's going to be saying that sending the command, say, to start the activity or to to start the workflow. That's— and that's where you can have that sleep. Yes, yes. We actually do have a new feature that's coming out, which is standalone activities, which run if you have something simple that needs to run without even being in a workflow. You can do that and you can get the benefits of the reliability. Like, that thing is going to run.

Johann Schleier-Smith [00:55:09]: It's guaranteed to run. It'll be retried and so on and so forth. It's all in your, your dashboards and UI as well. But now we're kind of getting into power user features. There are a lot of power user features. Okay. There are things like, you know, schedules, you know, which is think of that like more sophisticated cron functionality. We've talked about the signals and the updates.

Johann Schleier-Smith [00:55:31]: Um, we talked about the, you know, the workers being able to update to a new version of the code. One of the things that we have released. Recently is something called worker versioning. So sometimes you actually don't want to go to the new version of the code, or maybe you want to do a blue-green deploy or something like that. So for more— very rolled out— more sophisticated operators, we provide a lot of control over that. We have a feature called workflow pause, which means that you can say, you know, either on command or when it maybe it encounters some sort of exception condition actually in your logic, you can say, well, wait, just stop this. We're going to look at it before it makes progress. Right.

Johann Schleier-Smith [00:56:18]: So there's just, you know, there's a lot of things that are continuing to be developed. We are working on some support for larger data sizes, large payload storage. Right. So if you have something that is maybe, again, like some of these media files, you may not want to send them to the Temporal server and store them there. Even if they're encrypted, you say, you know what, that thing lives in S3, right? Or something like that. But I want Temporal to understand it. I want Temporal to track it. I want to see how large the objects are and so forth.

Demetrios Brinkmann [00:56:54]: So we have large payload storage making that.

Johann Schleier-Smith [00:56:55]: So in that case, you just send the metadata? There are different variants of it, but yes, at the end of the day, What's going to happen is that, say, that output that got generated, it's going to be— the Temporal client is going to be configured, or the SDK is going to be configured to go put that in your blob storage, remember the path to it, and then encrypt that, put that in Temporal server instead of actually putting the object in Temporal server. Yeah, yeah, just have the pointer. Exactly, exactly. It's, it's Yes, at the end of the day, we know we're talking about pass-by-reference rather than pass-by-value. Yeah. So we're doing that, we're doing streaming, which is, you know, another thing that's really important in agentic use cases. You actually have pretty good support for it out of the box. People do it all the time with Temporal today, either using the basic primitives, you know, updates and signals and queries, or sometimes using third-party storage or third-party streaming like Redis Streams, for example.

Johann Schleier-Smith [00:58:03]: So there's a bunch of different ways that you can do it. We're making it first class.

Demetrios Brinkmann [00:58:08]: So that's another thing that we're working on.

Johann Schleier-Smith [00:58:12]: For what use cases? For streaming? Like audio? So certainly audio is one of the use cases where we have folks that have Agentic applications that are audio applications. And then it's also just any sort of updates that you're going to get in between your main interactions. So if I am, say, in a coding agent and that coding agent is telling me that it's running a bunch of tools and checking these files and doing other things, those are all from the perspective of a UI, a streaming update, right? So it's, it's basically saying, that you don't wait until the end of the workflow or until a point of pause or human input in order to say what you're doing. And then of course you combine that with things like the signals. Now you can start doing interruption, right? Because I say, oh, wait a minute, I don't want you doing that. I want you doing something else. Those are additional things where You know, there's layers upon layers of functionality that you can access. I think the good thing is that you don't have to start with all of that.

Johann Schleier-Smith [00:59:28]: And I think increasingly now, you know, we'll see, but I expect that you're going to be able to access pretty much all the features through your coding agent just by asking for it. You know, we've been doing this skill and, you know, we put it all in there and it really should be able to do it.

Demetrios Brinkmann [00:59:45]: The coding agents are pretty The thing that I see is you have to know what to ask for. So that's the new thing, and that's why I think it's, it's really cool that we get to talk about it. Hopefully somebody that's listening, yeah, goes, oh cool, now I understand Temporal has that feature, so I can go ask my coding agent.

Johann Schleier-Smith [01:00:04]: Yes, for that. Yes, that's—

Demetrios Brinkmann [01:00:05]: I think it's great that you point that out because not a lot of people are writing code by hand these days. No, but no, you, you still want these things to happen, and so if you know that they're there, it's like it's latent and you can go and you can probe the coding agent to say, hey, use the temporal streaming feature for this.

Johann Schleier-Smith [01:00:28]: Yes, absolutely.

+ Read More

Watch More

The Rise of Modern Data Management
Posted Apr 24, 2024 | Views 2.1K
# Modern Data
# Machine Learning
# Gable.ai
Product Enrichment and Recommender Systems
Posted Aug 04, 2022 | Views 721
# eezylife Inc
# Product Enrichment
# Recommender Systems
# eezy.ai
Durable Data Discovery: Making Exploratory Analysis Stick
Posted Nov 17, 2021 | Views 439
# Monitoring
# Presentation
# Coding Workshop
# Superconductive
Code of Conduct