Managing Memory for AI Agents // Ben Labaschin // Agents in Production 2025
speaker

I'm a Principal Machine Learning Engineer at Workhelix, where I've been building enterprise-scale GenAI platforms and production ML systems as a founding engineer. I wrote the O'Reilly book "What Are AI Agents?" and am currently writing a follow-up called "Managing Memory for AI Agents" about the practical considerations of working with and managing data with AI agents. I recently published research in AEA Papers and Proceedings on measuring firm-level exposure to large language models and their potential productivity impacts. I spend most of my time figuring out how to actually deploy AI systems that work reliably in production—from async LLM APIs and embedding systems to the messy real-world challenges of putting GenAI into enterprise workflows. My work spans the full stack, and I split my interests between the theory of AI and its impacts on labor, to deploying production-grade LLMs and AI Agents
SUMMARY
Drawing from my forthcoming publication, I'll explore the foundational decisions that determine whether AI agents deliver lasting value or become expensive technical debt. Rather than focusing on specific frameworks or tools, I'll cover the core tradeoffs between flexibility and performance, the memory patterns that actually matter for agent reliability, and how to architect systems that can evolve with the rapidly changing AI landscape. The key insight is understanding what problems agents fundamentally solve—automating complex, multi-step workflows—and designing memory and coordination systems around those core needs rather than getting caught up in today's specific technologies. You'll leave with a framework for making architectural decisions that will serve you well regardless of which models, frameworks, or tools become dominant next year.
TRANSCRIPT
Ben Labaschin [00:00:00]: Everyone really appreciate you having me here to talk about managing memory for AI agents. My name is Ben Libashan. I'm principal machine learning engineer at work Helix. I work with agents all the time. I was working with an agent Probably 15 seconds before I got on here. I was watching, working, as I'm sure many of you are. And I think that agents aren't the future. I think they're the now.
Ben Labaschin [00:00:36]: I've actually thought that for a While. Back in 2023, O'Reilly reached out to me and they said, hey, do you want to write a publication on agents, their immersion into our use cases? And I was so excited to do it. And then relatively recently, they asked me again, hey, Ben, would you like to write a follow up on managing memory for AI agents? And I thought, well, this couldn't be more timely. I think managing memory for AI agents is honestly one of the biggest topics to be thought about right now. And I hope in this presentation I can sort of convince you of all the nuances of why that's the case. So happy to get into it and speak to you about that. Excuse me. Okay, so first of all, I'm talking about AI agents and I'm saying managing memory for AI agents.
Ben Labaschin [00:01:24]: But what is memory for agents? Well, I actually don't think all of us are entirely going to agree with that. I think a very simple and perhaps straightforward answer is that memory is data for agents. And under that interpretation, I think, you know, it can make a lot of sense. Because of my backgrounds in machine learning and data science before that, and I understand data. I think if a lot of you are software engineers, developers, machine learning engineers, data is something that we understand. So we say, okay, we can get our hands around this. Some data is relevant, other data not so relevant to your tasks. Okay, well, now this just becomes a sort of retrieval and storage problem.
Ben Labaschin [00:02:05]: And I'm going to talk a lot about storage and retrieval throughout this presentation. But what becomes kind of relevant here for memory and agents is it's not so straightforward. That's one thing that I write about in the publication. That's one thing that I think we should keep in mind when we're thinking about agents. And that's because agents come with a catch. And we kind of all know what the catch is at this point. It's sort of trivial to say. But agents are by their nature dynamic, right? They're not programmable.
Ben Labaschin [00:02:35]: Programmable in the traditional sense. So on the one hand, we can think about memory and sort of a data centric perspective, and data is Something we understand databases, retrieval. On the other hand, the system that is accessing and using that data is not traditional, right? It's dynamic. It's everything that we think of when we think of agents. It's one of the benefits of agents, right? So the agents themselves are tool based systems under constraints, accessing information, trying to provide us an information or do work for us. Right? That's a relatively vague general definition of agents themselves. But they need to access that data and they need to store that data. That becomes a little bit more complicated.
Ben Labaschin [00:03:20]: So this storage retrieval paradox between, you know, classical and non classical programming kind of, it kind of becomes more complicated because the retention of knowledge itself is stochastic. We're trying to tell the LLM, hey, store this information, don't store this information, but it's under constraint. The storage itself can be classical in a sense, if you've read designing data intensive applications and when you're thinking about memory management, you're off to a really good start. You can store text, let's say that you've given to an agent as embedded lists of floats that have inherent meaning to the models that are embedding them, but not to us. You can use even relatively traditional infrastructure like PGvector, et cetera, to store this information. But the behavior of the knowledge and storage and retrieval of the agents, that is the thing I'm trying to really harp on here. The same task can return different results. A same query can return different results.
Ben Labaschin [00:04:33]: And as something I will be talking about a lot, language itself is relatively fuzzy. I don't think this is talked about quite enough, but I think it should be talked about a bit more. I mean, the romantic poets of the past were talking about this all the time. It's not anything particularly surprising here. But language is fuzzy. The thing that you say and you mean is not necessarily the thing that I say and I mean. And of course LLMs are pretty good at being able to, you know, be sort of internal vector databases for us and restore sort of that latent space of knowledge. However, when it comes to AI agents and their storage and of memory, what should they choose to retain? Well, the language itself is fuzzy.
Ben Labaschin [00:05:22]: That becomes a little bit more difficult. So we can use different types of algorithms to try to do that, and I'll talk about some of those today, but I want to keep all of this stuff in the fore of our mind that it's not simply, oh, we just need to design the proper algorithm. I'm not quite sure. One of the things I sort of conclude from My research is not quite sure there's ever going to be a perfect algorithm because that would sort of entail that language itself is, you know, is something that can be constrained in that way. This leads to sort of an overarching point which is, you know, typically, traditionally we talk about AI agents and memory as sort of, there's three types of memory, and that in and of itself is sort of a taxonomy, right? But the problem with that is that taxonomies are by them, by their nature limited and might not sort of capture what really memory is. And that, that leads me to this conversation that I was having during the research for this. I was talking to a really smart guy at Reddit, his name's Andrew Brookins. And I thought I sort of gave him sort of trick question or sort of a gotcha question, not because I was, you know, trying to trip him up, but because I really wanted to hear his answer and it was really fast.
Ben Labaschin [00:06:41]: And that's this. We talk about AI agents and memory, right? And memory in itself is sort of a loaded term, you might, you might say, because that sort of conflates human thinking with agents. And agents aren't humans. Why should we follow the kinds of thoughts of humans when it comes to AI agents? Is that, should we assume that that's the right way to think about this type of thing, thing? And so it gave me a really, I thought, well thought out and empathetic answer to the problem. He said it is a flawed way of thinking about agents. Memory is a flawed way, but despite its flaws, it's, and I'm paraphrasing here, despite its flaws, it's actually very useful. And that's because with memories and AI agents, we sort of take from the literature of psychology, and psychology, for all of its flow laws, has a lot of research into humans making decisions under their memory, under constraints. And what we see there is that humans will make all these sort of choices, they will make all these sort of decisions that even under those constraints can be very stochastic, that can be very random.
Ben Labaschin [00:07:53]: Some of them are sort of pointed, but under certain constraints they're not. And that is very helpful because as you can probably tell, that's a lot like what AI agents do as well. So thinking about it in terms of memory and human types, memories is actually probably flawed, but it's probably flawed in the right way. And I wasn't sure of what the answer was there, but leaving sort of this research, I sort of end up in that space. It's probably the right type of block so even though it's flawed, let's talk about the right type of flawed here. Traditionally there are three types of memories, memory that we kind of think of when you think of AI agents and memories, right? There's short term memory or working memory. I think I skipped one. Yeah.
Ben Labaschin [00:08:36]: So let's start with short term memory though. There's short term memory and working memory. That's really intuitive, right? That's I am speaking to the agent, I am giving it context, it is talking to me. What is it doing right now? That makes a lot of sense. Let's go to another very intuitive type of memory which is long term memory. Long term memory, also pretty intuitive there. It actually comes down to their sort of discussion about three different types of thinking there as well, which is episodic, semantic and procedural. Now, is that exactly how the brain work? Is that exactly how memory works? Almost guarantee you it's not that it's not that simple.
Ben Labaschin [00:09:13]: But for working with agents, it might be very useful. Where episodic can be thought of as past conversations, semantic can be sort of preferences and procedural can be, as you might imagine, methods, you know, how to do certain things. Then there's this initial one that accidentally skipped. But it might be a good idea because it's the least intuitive, right? That is sensory memory. This is, you know, not everyone agrees that this is a category for agents, but those who do and those who build around this concept, which is something, we'll get to think of it as sort of that perception layer, right? Deciding what it should stay in the memory versus what should not stay in the memory. And that actually leads us to this point here, which is, well, how do you decide what should be saved in memory? If we took all of the information that's given to an agent, like the long, long conversations I have with Claude sometimes, for instance, that would be probably a mistake. That would be an economic mistake because it's expensive and that might be a computational mistake because quadratic explosion of search, right? So we got to be careful about what we, what the agent retains. This classification in and of itself can be non deterministic for the agent, right? We want to instruct it on what to keep, but the agent ultimately decides what's going to be kept.
Ben Labaschin [00:10:43]: It's going to hit the right APIs for storage, etc. That leads me to the point of, well, if we know that, we've seen that in experience, if you work with Claude code and you're going through and you're about to get your conversation compacted, which is a Bunch of different memory storage techniques, then you might want to rush and get as much out of the cloud as possible because it's going to forget some things, it's going to forget some details, it's going to push the problem in a fundamentally different way than it just was with the current context window it had. I thought Joel Grus had a great meme about this the other day, which is me finishing to do a task before CLAUDE code compacts. And I thought that was very relevant to how I think about memory and how I think a lot of us do. In the future this might not be the case, but as AI agents stand now, this is very much the case. So there we go. So talking about this storage challenge that we have, do you store everything well? Different companies and different backends have different approaches and maybe you're building your own agents now, so you might want to consider them as well. There are a few algorithms that are out there.
Ben Labaschin [00:11:55]: I mean there are plenty and there are plenty more coming out every day. But some that I think are relatively relevant here are important scoring Cascading memory, intelligent compression and vector store offloading. A lot of these can be think of like as FIFO and LIFO or just compression in general, or you know, just storing memory as you go and summarizing. Or you can even think of this as recommendations and user preferences. A lot of this is normie based retrieval and storage mechanisms that we've been using for decades. The problem is the system itself choosing when and where to store that memory and how to do it. It's very important to keep in mind as well this memory, as I said before, embedded into vectors, those vectors go into databases, et cetera. Now there are some innovations in this space that I think are relevant here.
Ben Labaschin [00:12:59]: Two that I thought I'd bring bring up in this conversation are ner, which is very classic, and semantic caching, which is in some ways classic, in some ways not. So let's talk about NER first. Ner, you know, named entity recognition. Well, if we perform NER on the conversation that's happening, maybe we're going to get a lot of the important information, but reduce the search space for the storage and the retrieval aspect of the memory. Very helpful, right? So if we can perform NER very quickly, maybe memory becomes both relatively just as effective, but also with less space. And that's something that people who are building agents we think about all the time, especially for power users such as myself, who might even like break CLAUDE with how much I use it. And then there's Something out there called semantic caching. It's a concept that's out there, but I think Redis is doing it relatively well here, which is storing information that's frequently accessed, which is again, a classic designing data intensive applications technique, right? Can we store the information that's used more often and then instead of using the agent to, you know, go, go back to the vector database and grab the information in a computationally heavy approach, can we use the cache? Caching is like the savior of most apps, right? If you can cache at most layers or frequently used things, then your ultimate compute costs are going to go down.
Ben Labaschin [00:14:27]: So ner semantic caching, two ways to approach saving money and being more efficient with our memory storage and retrieval. And that kind of leads me to the economics of memory itself. I mentioned this a little bit before, but like these companies are spending money to provide us these services and some of them are spending a lot of money, which is very intuitive for all of us. We're probably very well educated in this space. But maybe one thing that's a little bit less talked about I think is sort of the fundamentals behind the economics here, which is, you know, when memory storage and retrieval is efficient, the marginal cost shifts down on the curve, right? The marginal cost per usage of the AI goes down. And that means something. That means the complex tasks become more economically viable. You can have longer sessions with your agent, you don't have to compress as often.
Ben Labaschin [00:15:27]: Your marginal benefit starts to exceed your marginal cost. And so what's really important for the people who are giving these talks, the one for me, the ones that come after me, are that every time we add innovation in the space of memory, specifically with AI agents, something happens. I'm going to bring us Back to Econ101. My background's in econ, so I tend to think of these things in these kinds of fundamentals, which is this. On the Y axis we have marginal value. On the X axis we can think of tax complexity, neither API calls cost, et cetera. When you are using agents to their full capacity, your. Your marginal value is meeting your marginal benefit, right? That's basically what we're seeing here.
Ben Labaschin [00:16:10]: And as we get an innovation in this space, as storage becomes more efficient, as retrieval becomes more accurate, something happens, the cost, the curve shifts up. The marginal cost for engaging in AI starts to shift up. And that means we also shift our marginal benefit up, which means that we can do more with our agents. This is more of a fundamentals approach to thinking about memory. But I guess the takeaway I would specify here is that when memory storage and retrieval improvements improve, like the kinds of things that a bunch of us are doing here in this talk, and marginal costs go down, that just consequence, then we can do more things with our agents. And I believe that means that they're going to diffuse into society more as well, and they're going to diffuse into more tasks as well because the more complex the task can be done at a lower cost. So I'll just finish up here because I know I'm running out of time. I don't think that having a taking a wild guess about what agents will be in the future is particularly useful personally, because one thing I've learned sort of in the last few years in particular is trying to predict the future is sort of a fool's errand.
Ben Labaschin [00:17:32]: What I will say is that on some timeline it seems to be the case that, you know, these static LLMs that are trained once and then we just use them. I think that we're going to go to a space where continuous learning happens. I'm already seeing that there are papers out there that show that patching LLMs as you go might be a viable solution. So I think that'll happen and I think that will mean for the state of agents that you'll start to get more personalized agents over time, which is pretty helpful and something you might like. I think we're going to also sort of double down on this perspective of the human memory and how it can, even though it is flawed, it can be very applicable to agents and LLMs in particular, things like compression in REM sleep or REM sleep to long term memory. I think that that's going to be a space that expands over time. But I very much appreciate having this opportunity to speak about why memory is important for AI agents, the different aspects of it. You can check me [email protected] the publication for O'Reilly should be out in the fall.
Ben Labaschin [00:18:39]: And if you do have any questions, feel free to in the chat or just reach out to me. I'm happy to talk. I love talking to people about this kind of stuff and I think about it all the time. So yeah, thank you so much, econo.
Adam Becker [00:18:54]: Ben.dev is a great domain. Ben, this was absolutely wonderful.
Ben Labaschin [00:19:01]: Oh, I'm so glad to hear it.
Adam Becker [00:19:02]: I love that. We'll see if folks have questions in the chat. The other day we had a meeting in our reading group, so we have a monthly mlops reading group if people are interested in that. By the way, I'll Put that in the chat as well. And we go over a bunch of papers and we had a phase where we mostly just focused on hygienic memory.
Ben Labaschin [00:19:23]: Oh yeah, really?
Adam Becker [00:19:24]: It was also very interesting because there's all these different approaches. We read all these different papers where you think about how. So if you go back a few slides.
Ben Labaschin [00:19:32]: Yeah, of course.
Adam Becker [00:19:33]: The non determinism of where to store the memory and how to store it and how to retrieve it and go back. Maybe it was.
Ben Labaschin [00:19:44]: Yeah.
Adam Becker [00:19:47]: Yeah, no, just that one.
Ben Labaschin [00:19:48]: Yeah.
Adam Becker [00:19:49]: The storage decisions. No, no, the next one.
Ben Labaschin [00:19:51]: Oh, sorry. Okay. Yes. Oh yeah, exactly.
Adam Becker [00:19:53]: Storage decisions themselves are non deterministic. The agent must decide dynamically what's worth keeping. That's so interesting. And I wonder if like. And you know, this can and probably does evolve over time.
Ben Labaschin [00:20:05]: Right.
Adam Becker [00:20:06]: And perhaps memory that used to sit in one place, maybe right now it makes sense to move it into another and then you move it back into the place that is. It's is most relevant. And managing all of those transitions sounds very much like the way I imagine the brain works. Right. And I think it does. I was reading this book that was never heard of this why We Sleep book.
Ben Labaschin [00:20:31]: I read it. Yep.
Adam Becker [00:20:32]: And I was just going through a very intensive data engineering phase in my life when I read that. And I was like, oh, this is.
Ben Labaschin [00:20:40]: Like the same thing.
Adam Becker [00:20:42]: And, or at least there's so many parallels and I wonder if that's. You read that book. Did you have a similar. What was your impression?
Ben Labaschin [00:20:50]: I mean for all there are, you know, there are, there are a lot of criticisms of the book, but I found personally that the book was very informative even at, at a large scale, you know, directionally about the brain and about memories. And I also drew a lot of parallels to this conversation we have here. I think that the routing problem of storage and retrieval and managing memory and retaining them, I think that's going to be one of the most important aspects of agents moving into the future. And I do not think, as I mentioned earlier, that there's going to be one like self attention mechanism or whatever that's going to solve all agent memory. Because I don't think memory is like that. Like just as an example, even when you try to tell yourself to remember something.
Adam Becker [00:21:37]: Right.
Ben Labaschin [00:21:38]: Like I try to tell myself, well I'll stop sharing, but I try to tell myself, hey, like I'm studying something, remember it. Like that doesn't mean that I will. And I'm. I'm a conscious agent. Right. So I think that it's going to be Pretty difficult for this routing problem. And I look forward to working on those kinds of problems in the future. I think I'd be very interested in building out those types of things.
Adam Becker [00:22:03]: Yeah, I think it's just to the point of conscious. So every time I try to remember at some point I realized that I think there's a trade off between fast retrieval and fast writing.
Ben Labaschin [00:22:17]: Yeah, right.
Adam Becker [00:22:18]: And I might have gotten that from maybe even from like that designing data intensive application. I don't remember at what point it became very clear that this is the trade off. And then I began to realize that this is the same trade off that I'm operating with internally. Right. So like somebody can say something. Sure. I'll store it in some like short term memory and then it's going to be flushed out. And if you ask me to retrieve that a week later, there's very little chance I'll be able to.
Ben Labaschin [00:22:41]: But no, very low chance.
Adam Becker [00:22:43]: Trying to really like emblazon the thing and tattoo it into my brain that if I spend a lot of effort and I come up with different techniques for how to remember it, the chance increases, I'll be able to retrieve it.
Ben Labaschin [00:22:54]: And that's very similar to I think how agents work. I am not, I'm not someone who think that they're like intelligent system is sort of an AGI thing. But you never notice, just like working with an agent where you're saying, hey, retain this information. It might not, but if you give it a structure by which it is enforced to work with, then it will retain that information. And I think those techniques and those structures are effective for only certain problems. Right. Like the structure for one space for it to recall memory and to work with a tool is not necessarily going to be the proper structure for another one. And so all of these strategies prompt engineering be damned.
Ben Labaschin [00:23:36]: I think it's the structures and the methodologies for memory ultimately that are going to be very relevant into the future and will be perfecting for a while.
Adam Becker [00:23:46]: Been hearing you talk. There is a meme I want to share with you in a little bit. I'll share my screen and we could go through that in a minute. But I want to see if we might have a couple of questions in the audience here we have from Brahma. Can we ask LLM to drop or clear all the context windows under a certain scenario?
Ben Labaschin [00:24:06]: Yeah, I mean like you can, you know, if you're. I'm just go back to cloud code. Cause I think it's sort of ubiquitous at this point. Like that isn't to Say that there aren't programmatic restrictions on these types of things. Like you can do a clear right, you say okay, stop clear. And that'll force, that'll flush the system or that'll compact the information or you can just create a new instance. I think what when it becomes difficult is when you're really trying to retain information. That's when it becomes a lot more dynamic than when you're just enforcing a pre designed programmatic call.
Ben Labaschin [00:24:43]: If that makes sense.
Adam Becker [00:24:45]: Yeah, we got another one from Ricardo. In memory management approaches to in memory management approach context, what have you tried that you've seen has the worst results?
Ben Labaschin [00:24:58]: Oh, that's such a good question. I love that one. Was that Ricardo?
Adam Becker [00:25:02]: Ricardo, well done.
Ben Labaschin [00:25:04]: Nice job. That's really interesting. What is the worst tactic I've done? The worst tactic I've done and I've been guilty of it time and again because sometimes you're lazy. I think these things teach me to be lazy and I need to stop that. Right. Is a problem's not working and I keep trying to tell the agent just figure it out instead of planning. If there's one thing that I would suggest to anyone is almost at all times, if your agent is not following following a to do list and a plan, then the agent is, is failing. You know, like you're not working with it as efficiently as you should.
Ben Labaschin [00:25:45]: So that's my suggestion. Suggestion for you. If you're going to work with an agent and it's doing some work for you, tell it to give you a plan, criticize the plan. Make sure the plan is step by step and then when you're done explaining, sure it's following the plan. If it's not that you're in, you're in a poor state. So basically pressing enter and not being engaged with the agent, that's the bad strategy, my friend.
Adam Becker [00:26:09]: Nice Ricardo. We hope you're satisfied. Let us know otherwise, you know Ben, so the, the reason that right now I'm grappling with these memory considerations is because I think you mentioned something similar to this. Depending on like the, you know, the size of a conversation, current conversations can just be multiple sentences. Could be much longer. Yeah, we're building a platform to help facilitate better political conversation. Some of the in particular about Israel, Palestine. And we're having people speak sometimes for 15 hours a day, like a 15 hour conversation and we have a bunch of different people.
Adam Becker [00:26:44]: And now I'm feeding that context to LLMs to make a lot of decisions about how to move the conversation forward. In more interesting ways, but now I start to like bump up against. So basically like the more context I feed into it now, it starts much more, you know, it just gets distracted. Okay. So now I need to be much more precise about what I'm actually feeding into it. And that is again a dynamic consideration because it is very context dependent. What's actually happening in the conversation right now that's worth keeping. And that depends on what the AI thinks should be the next steps in the conversation.
Adam Becker [00:27:19]: And all of that is still open ended. We're trying to figure it out, you.
Ben Labaschin [00:27:25]: Know, what I would do. I mean, not what I would do. One interesting idea that comes to my mind that doesn't mean it's the right thing to do is, you know, it depends on what the goal is for the agent. Right. One thing I found in the research or in a lot of people's experiences, if you have many different agents focused on different things and they're really good at one thing, then you have a central planner or maybe a few central planners. It tends to get better results for what you're at. We're trying to get out versus if you have one in this case agent that's containing all the information for all the conversations, that starts to become difficult. So if you can split up that context and say, oh, this topic is for this agent, this topic is for this agent, this nuance is for this agent.
Ben Labaschin [00:28:07]: And then now suddenly it's just good at that. One aspect feeds up to the central planner, which then feeds out to the user.
Adam Becker [00:28:14]: Right. And then you, those. So perhaps like you might even store those memories separately for you by different agents. So you can have like one giant vector store, but it's really kind of partitioned into subspaces.
Ben Labaschin [00:28:26]: That's exactly what I'm thinking. You partition that, the data and the memory based on context and then the trigger, the central agent basically says, oh, this topic go to this agent. This topic go to this agent. Because the memories go there. And then you still have to do that kind of compression and storage. But it's only stored relative to its sort of, you might have even think of it, of its relative latent space or of its local maximum or something.
Adam Becker [00:28:51]: I'll try it out and let you know. I'll keep you posted on how this works.
Ben Labaschin [00:28:54]: Yeah, it'd be really cool to hear.
