MLOps Community
+00:00 GMT
Sign in or Join the community to continue

A-MEM: Agentic Memory for LLM Agents // April Reading Group

Posted May 01, 2025 | Views 28
# Agentic Memory
# LLMs
# AI Agents
Share

speakers

avatar
Arthur Coleman
CEO @ Online Matters

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .

+ Read More
avatar
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
avatar
Nehil Jain
MLE Consultant @ TBA

Hey! I’m Nehil Jain, an Applied AI Consultant in the SF area. I specialize in enhancing business performance with AI/ML applications. With a solid background in AI engineering and experience at QuantumBlack, McKinsey, and Super.com, I transform complex business challenges into practical, scalable AI solutions. I focus on GenAI, MLOps, and modern data platforms. I lead projects that not only scale operations but also reduce costs and improve decision-making. I stay updated with the latest in machine learning and data engineering to develop effective, business-aligned tech solutions. Whether it’s improving customer experiences, streamlining operations, or driving AI innovation, my goal is to deliver tangible, impactful value. Interested in leveraging your data as a key asset? Let’s chat.

+ Read More
avatar
Matt Squire
CTO and Co-founder @ Fuzzy Labs

Matt is CTO and co-founder at Fuzzy Labs, a consultancy dedicated to using MLOps to help technical teams get the most out of AI and ML. He enjoys AI, bio-inspired computing, and functional programming.

+ Read More

SUMMARY

This paper introduces a novel agentic memory system that dynamically organizes knowledge—going beyond traditional methods by linking memories contextually, adapting over time, and evolving as new information is added. Inspired by the Zettelkasten method, this system allows LLM agents to build a structured yet flexible network of past experiences, improving their ability to tackle complex real-world tasks.

+ Read More

TRANSCRIPT

Arthur Coleman [00:00:00]: My first time being a moderator, so I'll ask your patience as I learn the ropes. I've already learned a few things at this point, like making sure I get the right host link to start the show. But we're there and I want to talk about what we're going to talk about today. A little introduce our speakers. We are talking about agentic memory, amem. I wish I had spelled that correctly for LLM agents. I think it is a very important paper. Is someone else admitting or am I doing that? Okay, so in reading it, I found it exceedingly interesting.

Arthur Coleman [00:00:42]: I'm not going to bias the discussion here, but I know that we had some interesting discussions on it in the Reading Group Leaders Slack channel and I think you're going to find it quite fascinating when we get into the discussion. Our speakers today are Adam Becker, who is the COO of the mlops community, but he's also founder of Headon, which is an AI for political discussion. I need to understand that better. Adam Matt Squire, who's the CTO of Fuzzy Labs, but even more importantly the author of mlops W newsletter, which I read and will tell you is very interesting. You should sign up for it. And then Nahil Jain, who I can't tell you what he does because he's founding a stealth startup, but he is a fellow of On Deck and he was previously a technology expert at McKinsey and I hope he was an expert in AI, but you never know. It doesn't say that. So great speakers.

Arthur Coleman [00:01:40]: Our agenda today is Adam's going to go over the historical context for amem, the background as to how we got here, because I think it's really important to understand the prior art in this space and what its limitations were and what AMM was trying to accomplish. Second, we're going to have Matt talk about the basic structure of an Amen memory and its link generation, because that's basically the core of Amen is a token that is effectively a super token or meta token that is a memory. And then Nahil is going to talk about the results of the work and some commentary on it. And then we're going to get into the discussion. The before we go, there's some things we ought to talk about first, and I'll always start with this first of all, these sessions, they belong to you, okay? They are intended. The presenters will present for 35 minutes and then we will discuss for 25 minutes. And the more you participate, obviously the better the outcome. So some rules of the road.

Arthur Coleman [00:02:43]: This is a no judgment zone. It's a safe space to learn and there are no dumb questions. And trust me, I need this myself because I'm the guy usually asking the dumb questions. Put your questions in the chat. This is my first time moderating this way, so I have to pick up the. The questions from the chat. I'll be putting it in Obsidian. My Zettle, I can never remember the term Zettelchenist.

Arthur Coleman [00:03:09]: New note taking capability, which as a result of reading this paper I started to try and get into and had some interesting experiences with that as the discussion occurs. If we're interactive, raise your hands. There's ability to do that in zoom and I will call on people to make comments. And with this I'm going to turn it over to Adam. Adam, I'm going to stop sharing and put it over to you.

Adam Becker [00:03:36]: Thank you very much, Arthur. Matt, do you want to take over? Do you want to start with context and then I'm going to take over once you're done?

Matt Squire [00:03:43]: Yes, the notes slightly contradict what we thought we were doing, but that's not a problem, actually. I'll do the background and context and then I'll hand over to Adam who's going to talk about some of the implementation and the code as well. Let me just share my screen. I just want to be able to present the paper more than anything. So hopefully everybody can see that. And as always, I'm happy to take questions and thoughts and comments as I go. Please feel free to interrupt me, let me know if. No more.

Matt Squire [00:04:23]: If someone can't see my screen. Okay, so I supposely the big context for what we're looking at here, the idea of giving agents memory, I think to be able to think about that, we need to take a little bit of a step back and think about some example applications. One of the things that I found we're looking at internally, and I'm just going to use this as a motivating example, is as follows. Imagine that we are. Imagine you're a. Wow. Hello everybody. Am I still here?

Arthur Coleman [00:05:22]: Yes, you're still here, but you were eating in and out.

Nahil [00:05:26]: I was not sure if it was me or it was Matt, maybe.

Arthur Coleman [00:05:30]: Matt, go on mute. Take your video off.

Matt Squire [00:05:33]: No, I just completely disconnected, actually. Apologies for that. All right, let me start again from the beginning then because I'm not sure how much everyone caught of that. I think I was talking about motivating examples. So the key thing here is that we need a motivating example in order to be able to think effectively about agents and moreover about memory for Agents. One example that we've been looking at internally that I want to use is as follows. Imagine that you are a DevOps engineer or an SRE engineer, and you are responsible for a vast amount of infrastructure in production. You have different web services, backend services, databases.

Matt Squire [00:06:16]: They live in different cloud environments, they have replicas, they're spread all over the world. And the key is that you're responsible for keeping all this stuff up and running. I think as MLOps engineers, this is something that. I'm sorry, I'll reshare the screen in a second once I come back to the paper. Apologies.

Arthur Coleman [00:06:34]: I think that's what's called the crash. If you use the wrong version of Zoom, the 32 bit on a 64 bit operating system, it will crash.

Matt Squire [00:06:45]: Interesting. Well, I'll tell you what. Well, I'll share my screen again in a moment and we'll see what happens in that case. Yeah. So you're responsible for keeping all this stuff up and running. Now, one thing we've been looking at is can we replace that task with an agent? So can we build an agent? Which, let's say, once it becomes aware that a particular system has crashed, its job is to go out and inspect the logs for that system and maybe come up with a solution. Maybe it looks at the logs, it finds some errors, it goes and looks at the code and it comes back and it suggests not only the cause, but also perhaps a fix for that as well. It says, maybe if we change this code here or scale this service in this way, whatever it might be, it comes back with a suggestion.

Matt Squire [00:07:35]: Now, that kind of system would be all the more interesting if. If it can learn about the infrastructure that it's managing as it goes along. Imagine then you can say, well, this particular service, maybe the, I don't know, the authentication service crashed and I went out, I discovered a particular error. It was because we didn't have enough memory provisioned on the node that it was deployed to something really simple like that. So, as an agent, I suggested we increase the memory. I ran that by a human expert. The human expert said, yes, approved, go ahead and do it. So the agent has done a bunch of actions, but it's also learned something.

Matt Squire [00:08:21]: The idea then is, well, can we store that somewhere? Can we remember what it did and what the solution was and what the human feedback was, and then next time around, use the same information. So that's kind of the context of all of this stuff comes from. So I am going to try to share my screen again and I'm hoping it doesn't crash. I'll just do a full screen share and we'll see how we get on. So can everyone still see my screen? Can everyone still hear me? That's the key thing here. Marvelous. Good. So that's the kind of background that I wanted to give just to sort of motivate some of this stuff.

Matt Squire [00:08:59]: Now we come to the paper. So one of the things that they start out by talking about the general broad capabilities of large language models and the advancements in the field. This stuff is familiar if you've been following large language model development. If you haven't, the key thing here is to note that they talk about the integration of LLMs with tools, so the use of tools. But they highlight a challenge where we need memory to provide that long term ability to do interaction with an external environment. The DevOps agent that I described there is exactly that kind of case where it needs to work over months or maybe years and it needs to be able to build up a body of knowledge about the system that it's, the environment that it's operating in and the system that it's interacting with. There are existing ways for LLM agents to store information. There are existing solutions for memory and they talk about some of those.

Matt Squire [00:09:58]: So they talk for instance about, and you can kind of look into the details of each of these solutions yourselves, but they talk about this thing called MEM0. They talk about graph databases. The idea of a graph database is that we have a bunch of entities and we store the relationships between those entities. That's quite interesting and quite powerful and it's closely analogous to ultimately what they propose in this paper. And they talk about some limitations. So graph databases provide a structured organization for memory systems, but they rely on predefined schemas. And the authors claim that similarly other systems that they've looked at rely on predefined schemas. And that's up here somewhere, I'm not entirely sure where, but the common theme they're pulling out when they critique existing solutions is that reliance on a predefined schema and the lack of flexibility that that gives us.

Matt Squire [00:10:59]: Ultimately, they want to move away from these rigid structures. They want to create what they call a universal and flexible memory system that allows an LLM agent to have long term interactions in complex and evolving environments. So crucially, what they're kind of after here is can we have something that almost is akin to how humans learn things? You know, I've interacting with my environment, I've learned facts about the environment. And some of those facts will relate to other. I start to build in my head a set of relationships between different facts. You know that the authentication system has crashed twice in the past because we didn't have enough memory. That sometimes we see this particular error in the log of some other service. But actually it's not a problem.

Matt Squire [00:11:58]: The DevOps engineers have told me that this is fine. It's actually not a critical issue. Don't worry about it. Or maybe we found that often on a Friday night the service becomes overwhelmed with traffic and there's a need to scale it. So we start to build up all these little facts and connections between the facts that represents our overall knowledge of the environment that the agent is operating in. So their approach, which they call amem, is they claim it enables dynamic memory. So that's kind of dynamism means in this case that it can evolve over time, it can adapt to changes in the environment, it can adapt to things that we don't currently know, but may know in the future. We don't impose a structure, we allow.

Matt Squire [00:12:46]: And we'll get into this when Adam talks about the implementation. What's really interesting about this is we basically allow the large language model to specify what information is relevant in a particular piece of memory, but we don't impose a particular structure or a particular set of expectations over what, what that memory looks like. So we end up being able to store text and tags and as you'll see, embeddings that connect to these different things. But we're not imposing anything more than that. We're trying to be as general purpose as possible. So for each new memory, they construct comprehensive notes. They have these text attributes that attach to the memories. They also have embedding vectors that they store with them.

Matt Squire [00:13:34]: So they can ask what? Which memories in my system are similar to a particular one. And they can, over time, every time they learn something new, they can update everything else. They can go and update existing memories. And that's quite useful, right? Because if, I don't know, there's classic examples like you look at a change of government, for instance, the President changes. And if that happens, then it's necessarily the case that other members of government change at the same time. So if you've got a bunch of disconnected facts like the President is so and so and the Defence Secretary is so and so and so on, you need to update all of those facts at the same time if one of those pieces of information changes. So they're kind of looking at that. Their approach is inspired by A system of knowledge organization called Zettelkasten, which I'm told means.

Matt Squire [00:14:34]: It's a German word. And the first word means note, and the second word means box. The Zettelkasten method, the idea is that you organize knowledge into very atomic notes, very discrete, small pieces of information. So you might have one note that just says who the president is or what the solution to a particular error on this particular server is. And then you also tag those notes. You kind of have some general way of categorizing the notes. You put related notes together, you organize them together in boxes, which is hence the box component of that. What they want to do is take that and almost say, what if we give an LLM the ability to use the Zettelkasten method in order to organize its knowledge? And as I say, what they're going for ultimately then is something that resembles how humans learn and humans build knowledge.

Matt Squire [00:15:35]: Now, let's just see if there's anything else that is important to cover here before I hand over. So I've talked about the update mechanism, but we'll look at that in more detail in the next section. But that's crucial because we're not just each time storing new pieces of information, but every time we store new pieces of information, we are also updating what else we know. And they talk about how they conduct evaluations as well, finally. So there they want a standard baseline for measuring how this memory system impacts the performance at certain tasks as well. I think then that probably covers enough of the background and the motivations behind this. Are there any questions or from the other panelists, anything I might have missed here?

Adam Becker [00:16:32]: Not on my end, no. I think that's great. Great context.

Matt Squire [00:16:36]: Marvelous. In that case, I will, if I can get hold of my mouse, stop my share and hand over to the next presenter.

Adam Becker [00:16:47]: Thanks, Matt. I think one thing that was interesting is you said, like, let's perhaps we can organize memories the way that humans organize them. I don't know if it means how humans organize knowledge and information or if that's how human memory is organized. And I think, Arthur, this is your question in the chat too. Like, is this actually a reflection of what we currently do sort of like naturally, or is that something that is a very formulaic and kind of like a methodical exposition of how we should be doing things? And I wanted to dive a little bit into this because I don't know if it happened to you guys, but like, a few years ago, Zettelkasten just went viral again. It might have these, like.

Arthur Coleman [00:17:32]: Right.

Adam Becker [00:17:32]: Like the Cycles. And a few years ago a bunch of people texted me and they're like, hey, there's all these different startups and everybody's trying to organize notes and memories in like a Zettelkasten type of way. And now there's some AI that's helping here. And there it was, it became like a big thing. I remember that was the first time I heard about it. And now reading this paper, I realized, oh man, I think I drew the same connection.

Matt Squire [00:17:53]: Okay.

Adam Becker [00:17:53]: I think it was the same method that went viral a few years ago. If that's the case, let me do a little bit more homework about it and I'll share my screen.

Arthur Coleman [00:18:01]: But Adam, while you're doing that, I would argue, having tried the Zettelcast and note taking capability over the last week, that it's really good for AIs. It's not necessarily as well done or humans.

Matt Squire [00:18:15]: There's two things come to mind on this as well. One is that I'm not convinced it does really reflect how humans intrinsically store memory. Maybe it's closely related, but I did wonder about, like, if you started to think about weighting the links and still make more sense when we look at the implementation, but if you started to think about weighting the different links, that maybe gets a little bit closer. The second thing though is we are not typically conscious of how we're storing memory, whereas here we're asking the LLM to be conscious of it, to be cognizant of it.

Nahil [00:18:49]: Yeah, there's also the long term, short term, which I don't think exists in this pattern at all.

Matt Squire [00:18:53]: Yeah, indeed.

Adam Becker [00:18:55]: Yeah. There's long term, short of. The other thing that doesn't exist here is almost like the value, the incremental value of memory in the sense that like, oh, storing this is making me more efficient and more productive in a certain way. And therefore there is no like reinforcement learning type of feedback loop that you'd sort of expect to happen with, I imagine happens with human memory, right. Where like I end up, memory that is just so totally useless might get flushed out and memory that I rely on constantly might be a little more emboldened. I think they might be getting at that a little bit with like the count, the retrieval count, but we can, we can talk about that. So I tried to figure out a little bit more about what the history of this Zettelkasten method is. And if you've ever seen this actual historians doing work, it always looks really messy and they always have these little index cards and especially you go to libraries.

Adam Becker [00:19:48]: I remember spending some time in Library of Congress, and they would have these boxes specifically for different index cards that people write in order to categorize and reference all of their thoughts. And I didn't quite understand it, but it turns out there's a very fascinating history to this entire practice of coming up with these little cards. And I think that those cards are. If we understand those cards, I think we'll get a really good sense about, like, behind the intuition of what it is that these guys are doing with. Amen. So I'll just go through it very, very quickly. 16th, 18th centuries, at some point we move away from notebooks to just these just like little slips of cards. And at some point they start building these cabinets that are very specific for cards that have indexes, so that you can then look up by index what those cards are.

Adam Becker [00:20:31]: Carlin Ayes comes up with a standardized 5 by 3 inch slip. And then you end up seeing 19th to mid 20th centuries. You see a bunch of books about how to organize knowledge, especially for history research, and then more and more for biology research. And they say exactly how you should do it. And so it becomes much more formulaic. And then you could see it in like. Nicholas luhman created a 90,000 card Zettelkasten with unique branching ID. So at some point they come up with IDs.

Adam Becker [00:21:01]: At some point they come up with branching. At some point they come up with, I have a card and that card should reference other cards. And maybe I come back and I erase a particular reference and then I come back to it. So this is kind of like the intuition behind this whole thing. I thought it was very, very interesting. You can see some diagram of this here. This is some version of this Zettelkasten. The idea is very simple.

Adam Becker [00:21:21]: Let's say we start with just a card and we write the idea on it, right? So we write an idea, we give it an id. Let's say ID one and. And then we also tag it. Let's say it's about computing. And then we come up with another idea. Okay, let's idea two. It's also computing, but maybe in the environment. And then we come up with another idea, but that idea is actually related to idea A, but related in a nesting type of way.

Adam Becker [00:21:50]: So it is hierarchical, right? So instead we don't just categorize it as one. We don't identify it as one. We. We say it's one slash one, right? So it's almost like a sub memory or sub note to that parent node. And you could Continue to do it and cascade down like a hierarchy. And so this one would be, let's say idea g would be 1, 1, 1. And then this one is related to idea E. Right.

Adam Becker [00:22:14]: And so you can write here, okay, it's related through 1/4, 2. So this is the ID of the other idea that it's related to. And then you can have a very large kind of referential index by tag. So you say, oh, if you're interested in environment, okay, find this ticket and that ticket and this ticket and that ticket. So is the intuition here clear? Everybody does that make sense? Right. So that's sort of like the inspiration behind how they're going to go about creating this amm. So what they're doing is their innovation is twofold. One link generation.

Adam Becker [00:22:50]: So there is a new memory that comes in, and let's establish connections between this memory and other memories. We'll figure out how they do it. But that's the idea. You get a memory, you embed it, then let's evolve the complete memory structure, because by the introduction of this new node, by the introduction of this new memory, perhaps now we can see the other memories that are related to it in a different light. So perhaps it isn't that the idea itself changes.

Arthur Coleman [00:23:21]: Right?

Adam Becker [00:23:21]: You're not gonna go and erase the idea. I don't think. I don't think that's what they're doing. The idea remains the same, but maybe the other references, and maybe there's some more metadata and maybe there's some more. Maybe there's some other stuff that we're sandwiching this idea around. Right? And that can change. And so essentially what you're seeing is that those memories begin to evolve. There's another concept.

Adam Becker [00:23:44]: I think it's buried deep in their code, but I haven't seen them actually implemented where some of these ideas can then merge or you can prune and you can delete certain connections. I don't think that they've actually implemented this, but they sort of like pay lip service to it. So that's the idea. Come up with a note, come up with a memory, learn what it's related to, and then see whether any of those relations need to change and evolve in some way. Once that happens, then you can use it in a much more interesting way because you can now engage with an agent or with an LLM. And that LLM is retrieving only very relevant memories, memories that had evolved over time based on the introduction of new memories. So that's kind of the general concept here. Also maybe like plant one flag about a distinction that perhaps we can explore in the discussion about the difference between knowledge and memory.

Adam Becker [00:24:43]: Because if you think about this, this isn't really memory. They're just doing. It's not like, oh yeah, I remember when I was five, I ate ice cream. This isn't that kind of memory. It's more about, oh yeah, this king invaded that other kingdom. Something relatively specific. What is the connection between memory and knowledge? I think there's a lot more to be said about this, but so now let's just kind of. I think maybe one way to reflect on this might just be.

Adam Becker [00:25:12]: Let's look at. Let's start with an example. So this might be the content. Content is like the raw input that we get from the interaction with the environment. It could be a part of a conversation. Right? So it could be something like this. Hey, Calvin, long time no talk. A lot has happened.

Adam Becker [00:25:28]: I've taken up photography and it's been great. Been taking pics of the scenery around here, which is really cool. That's it. That's going to form like the seed of that memory. Now what we do is we feed this into an AI and we say, hey, give me some context about this. So essentially the AI is then taking this and adding a little bit more semantically rich and interesting type of discussion around it. So the context might be. The main topic is the speaker's new hobby of photography, highlighting their enjoyment of capturing local scenery, aimed at engaging a friend in conversation about personal experiences.

Adam Becker [00:26:02]: Little bit higher level. This is a little bit more standardized. In addition to that, it comes up with tags. So in the same prompt we asked the AI, we said, also give us a bunch of tags that are relevant to this content. So tags, it gives us as hobby photography, personal development, conversation, leisure. We also say, give us some keywords, some other nouns that might be useful in understanding what this is about. So photography and scenery. We then add a couple of other things like metadata, like the time, when was it, as well as the related memories.

Adam Becker [00:26:35]: Related memories for now is just going to be blank because we've just received this piece of content. We create a memory out of it, right? So we made a call to the AI, enriched the memory, and now we're going to do a bunch of different things with it. So the first thing that we're going to do is we're going to do an embedding of all of the textual components, the content and the context and the memory tags. And the key just come up with a long and vector and just embed it in some space and collect all of the relevant nearest neighbor memories. So this is a very rough cut. We're taking one memory, enriched it with AI, now we've embedded it and now we're saying, okay, given that vector, give us all of the other ones that are close to it. You could get 5, you could get 10, you can get 20. Naeel, you might be talking a little bit about the differences there between the K, what should be that value that you're going to keep as the nearest neighbors.

Adam Becker [00:27:35]: But because we've only done it like this type of rough cut, think about it, we're not actually going into the nuances of the relationship between one concept and the other concept really. All we know is that they're close by in that high dimensional space. Right. So then we take all of these memories and we feed those into an AI and we say examine each of those memories. These are related memories. Are any one of these actually good? Should we form links and if we should form links, go through each one of their links, go through each one of those memories and potentially update their properties. Which properties? This one. The context, the tags, the keyword.

Adam Becker [00:28:18]: I don't think they're changing the content again. I think the content ends up remaining the same. This is kind of the seed kernel for the memory, but that's I think the entire process. You take content, enrich it with some more data, embed it, collect a bunch of similar memories, examine each one of those memories in finer detail and see whether or not you should zoom in and modify any of its properties. Okay, so far so good.

Nahil [00:28:48]: One thing that came to my mind when you're presenting this example was the tags piece. So last year I took my huge Notion notes database and I was like this is all over the place. Can you just categorize them by tags? And I found it very, very hard for LLMs or any system to know the right ontology for a set of just infinite space of content. Did they talk about how to constrain it or how to know what is the right tags? So I think that is an unsolved problem. Then you can just randomly create a lot of tags and it might actually cost more harm than good for this kind of system.

Adam Becker [00:29:26]: I think they just, they don't talk about it. But you could see in the look at this. Several broad categories, themes for classification include domain format and type tags, at least three tags. But don't be too redundant. That's as much as they've given. Same thing for keywords. Several specific distinct keywords. That capture key concepts and terminology order from most to least important.

Adam Becker [00:29:53]: Don't include keywords that are the name of the speaker over time, at least three keywords, but don't be too redundant. That's it. In my mind. I'm right there with you. I think this is a, I wouldn't say missed opportunity these guys have to publish a paper, right? But there's a lot of research opportunity because, and I do think that this should ultimately be probably bound up with some reinforcement learning type of thing because the keywords are likely to continue to change and evolve and you should be mindful of them. Okay, so we can see the whole thing diagrammatically real quick. So there's the environment interacting with the LLM agent in some way. We get a conversation out.

Adam Becker [00:30:32]: Can you help me implement a custom cache system for my web application? Send it to an LLM. Create that note with the context and the keywords and the tags and all of that stuff retrieve from the memory bank. Perhaps you've already embedded a bunch of other memories in your vector database. You keep the top k nearest ones and then you analyze each one of those and insofar as you need to form connections, update those connections insofar as you need to evolve any one of those actually go through the evolution. What does evolution mean again? It means changing any one of those other kind of like metadata. So change the tags and change the keywords and change the context. So that's the idea. Now what happens during retrieval? So a query comes in, right? So let's say now I say, hey, do you remember that I wanted to do? Or even I'm just.

Adam Becker [00:31:30]: Let's imagine that I'm referring to something in the cache system. Can you help me implement a custom cache system for my web application? It needs to handle both memory and disk storage. Afterwards, I'm going to ask it a different question. It's going to retrieve a memory that is related to cache retrieval, all these different things, and then it's going to feed that into the AI as context. Okay. There's a lot of interesting things that you could do about this retrieval mechanism right here. And we'll see if we have time to get into it. But we could be a little bit more formal about it.

Adam Becker [00:32:03]: When an agent interacts with its environment, we construct structured memory notes that capture explicit information and LLM generated contextual understanding. So M sub I really this is just a set. C sub I is just the original content, timestamp, keywords. This is tags and contextual description. And over time this is going to be the embedding vector, right? So we're going to take this and we're going to embed it and then we're going to have links to all of the other memories. So this is the definition of memory. We don't need to be too fancy about it. You can just see it here.

Adam Becker [00:32:39]: This is a memory note, which is a class they've defined, can feed in, you must feed in content. Everything else gets generated on its own. Right? So we have keywords, links, retrieval count, nail. I think this is what we were talking about last accessed. Perhaps if it was accessed just very recently, perhaps you might want to surface it. Not sure. Okay, context and category and tags, the context generation. Okay, we've sort of seen this a little bit.

Adam Becker [00:33:08]: Just a prompt. Generate a structured analysis of the following content by identifying the most salient keywords, extracting core themes, creating relevant categorical tags, and then we tell it how to format the response. Keyword context tags and then they sort of enforce that schema when they call the OpenAI API. We could see that if we want, you create the embedding vector, right? It's just a concatenation of all of the textual components and then you pass that to however it is that you're embedding. So I think in this case they're using like chroma, so ChromaDB, and they're just like embedding it there. But they have a couple of different versions of this. Okay, now we have the agentic memory system that's kind of the. That's the big engine that does all of that stuff.

Adam Becker [00:33:56]: You can specify what retriever you want to use and there's a couple of interesting options for retriever and we can get into that. So, okay, add note. This is literally what it looks like. You create a memory note feed in the content, the timestamp, perhaps other keyword arguments if you already have them. So that's the note. You process the note, you process the memory and there's an evolution label like should we evolve yes or no? And then you get back the note and then you add. This is the embedding part. You're now going to add the document itself as basically you're going to take this and add it to your vector database so that we have that also available for future retrievals.

Adam Becker [00:34:41]: And every now and then you see this little mod here. Every now and then you can consolidate the memory. They don't do it every single time, but let's say every five times, every 10 times. And so until then, they Kind of store things a little bit in memory, store the memory is in memory here. So then they kind of like flush it into the database, probably just for efficiency. Okay, this is the retrieval piece.

Arthur Coleman [00:35:06]: Oh yeah, Adam, we're a little long. It's 8:38.

Adam Becker [00:35:10]: I think we're basically done. I think we're basically done. So give me one more minute, let me wrap this up. When you find relevant memories, you just actually make a call again to the search. So you're retrieving search with the query, the initial query and this is the evolution itself. So you're just saying given all these memories, go about evolving them, you could see how they actually do it. And then you ask the suggested connections here. Again, these are just like memory IDs and I think we're essentially done.

Adam Becker [00:35:46]: This is just the actual implementation of what it looks like. If, if you're interested, you could see it in their code base. They've open sourced it, but it's not very clean. Then you embed it again and then for retrieval you get the query, you get your top documents out either through the cosine similarity or by a combination of cosine similarity and then just actually just using the keywords and seeing if the keywords match. So you're doing some kind of like some weighted average of both. So it isn't just the embedding. So they test out different things and they have all of the code for that. That's all I got.

Arthur Coleman [00:36:30]: That was a fabulous summary. So Nahil, I mean really well done, Adam. Thank you. Nahil, you're up with the results. Then we'll go into questions.

Nahil [00:36:43]: Yeah, yeah, let me share my screen. Hopefully this works correctly.

Arthur Coleman [00:36:58]: And again, please put your questions if you haven't in chat. I've been capturing all of the comments and questions and turning comments into questions. So keep going guys.

Nahil [00:37:09]: Cool. So you guys can see my screen.

Adam Becker [00:37:14]: It looks like a gradient to me.

Nahil [00:37:16]: Yeah. Okay, as long as you can see something.

Adam Becker [00:37:20]: Yeah.

Nahil [00:37:20]: So when they were kind of looking at comparing their results with other things that both Matt and Adam mentioned, these were some other options they could have looked at for comparing memory. So I wanted to just take a stock of what's going on before we look at the metrics and what the results were. So they looked at Locomo dataset, which is a different dataset, especially for this purpose, which is very, very, very long chats, like 35 sessions and 9K tokens per session. So it's like a lot of conversation that they need to retrieve from and otherwise I don't think it's useful given the size of the context itself is increasing that you even need memory. Then the first baseline is basically just take it and give it to LLM without memory and shove everything as much as possible. I'm pretty sure you can't do that with all LLMs, but the ones where you can put the token, just put it and then trim whatever is left. Then they had another technique which was read agent which basically is kind of doing rag and skimming to the different pages and and then giving chunked summaries of the different memories but still giving all of it. And then memory bank is one of the things that we were discussing.

Nahil [00:38:36]: I think Adam asked that question and I responded after reading this, that one thing they're not doing in even AMM is what memory bank was doing where you forget older memories because maybe it's not relevant anymore. So you're waiting based on time and it feels more like how humans do it. But the way they were storing it was structured. MEM GPT is this open source, another. Another project and they do that short term, long term thing and they're also giving more weightage to like recent stuff versus the long term memory. And the metrics that were interesting are basically F1 score which is. Yeah, F1 score which is like is it right? And complete the answer that AMM got versus what the other methods got and sorry, what the ground truth is in the locomotive data set. These metrics are especially interesting for most of us in the mlops community because we have to do a lot of evals to actually get the thing ready for production.

Nahil [00:39:42]: And I find this section of most papers very interesting of how they actually said if it is correct or not. So that's why I put a little extra emphasis on the metrics itself. Blue is a way to check how much word similarity is there between what the ground truth is versus what the AMM responses were. Just simple word matching. Rouge is kind of similar, but it's doing substrings. So it's like N gram, what was the longest substring which was exact match. So in that they're also checking the order of the words and of course the words as well, like blue. This is simple.

Nahil [00:40:17]: Sbirt is semantic distance. And then another thing they're checking is how much money does it cost? Like what is the total cost of putting all this memory in? Because eventually it'll be both a latency and a cost problem if you're just shoving everything into context. And I remember earlier this year we had A discussion about very large context models. And one of the things we were seeing then was can you actually put everything in memory? One is, yes, what is the recall of it? But then apart from that, what is actually the latency and cost benefits or detriments to it? So even AMM won on multi hop, it didn't win in open domain and some adversarial stuff because I think the base LLMs are just trained to be better at it. But it was multi hop where you have to look at multiple different sessions to actually get the answer. I think that's where AMM won by a huge margin and it was also cheaper. So I think that's where it should be the best. It's showing the best promise.

Nahil [00:41:23]: In the next slide I'll talk a little bit about some things I found suspicious overall in the paper and things that they could have done better. They ran a study to figure out if you don't do linking or if you don't do evolution, what is the response? And it seems like the best response is when you do both linking and evolution on your memory. And so for this paper, it's a good win that you need both of them and they are significantly adding to the performance. This is what Adam was mentioning. So they ran a bunch of different K, like how many nearest neighbors? Because eventually it goes back to the token length and also it can hurt recall and so K is equal to 30. But they were doing it with 400 mini, so I don't know what happened. Parts of the studies were all done with 400 mini, probably because of a research group running it for cost purposes. But I think there's something there where maybe K is different for different models and you might have to do a bunch of more testing here without just saying, I ran 10 different K's and this is the right answer for everything.

Nahil [00:42:33]: Then the last one is they did find that if you don't do evolution versus if you do evolution, you find tighter clusters off the memory itself. Which means that kind of similar to one of the points that Adam was mentioning, that you form gold set of memory like the thing, which you come back to a lot more, it's more trusted, you have high retrieval on it. So I think they did find that if you plot it in a space using T SNE that your memory after evolution has tighter clusters and you can bring things together, which is very useful. But then there were some things which I found sus. I'm like, really? And if you look at the paper and the metrics, sorry, fouro Mini outperforms Fouro by a lot. And I'm like, is that really the truth? Is that what's happening that a smaller model with less power can outperform? At least in their study. I mean, yesterday OpenAI released all the other new models, so. Well, we need to redo the whole paper with newer models.

Nahil [00:43:36]: But yeah, this was one such thing for me. The other thing they didn't talk about, I think, and correct me if I'm wrong here, is that what happens with runaway evolutions. And coming from the data world, I've seen this a lot where you modify things and you can just have a cascading effect which you cannot control and it's very hard to roll back. And what do you do after that? For my purposes, I felt that the Locomo data set doesn't cover everything. Yes, it covers like single hop, multi hop, etc. But doesn't cover a lot of the case that Matt actually motivated us with was a code related case where they didn't talk about like how does it do in other general purpose tasks? Only conversational memory stuff and also knowledge bases. What if I bring my own database of structured memory or embedding space that you need to retrieve from as well? Does that change the results at all? And the last thing is what is the latency of doing this? Finding the retrieval and doing the evolution and then actually making it go to production. Those were some things I found kind of questionable.

Nahil [00:44:45]: I understand that. I think again Adam was saying that they have to publish a paper, totally get it, but it just desires to be like there's more that can be done. Yeah. So that's all I got.

Arthur Coleman [00:45:01]: Given time, I'm going to go fast here just because of time. We have about 12 minutes for questions and discussion. Let me, let me look at some of the questions and let me go back to something early on. Someone raised the cost problem on this. Let me ask to the speakers, how big a problem do you perceive this to be? And if you perceive, because reading it, I had this question and if so, do you think things like compression or other approaches will allow us to get to a point where this actually is cost effective in production or is it already cost effective in production? Do you think that this is a, a limited enough set because they're living, they're limiting the number of links?

Nahil [00:46:05]: Yeah, I think in the cases where AMM works well, the cost won't be a big problem because the cost of all of this is going to go to zero. Almost like. Right. Like every day you're Getting a newer model with more intelligence, which is cheaper and cheaper and cheaper. So I think by the time this becomes mature, the intelligence of the base models will be cheap enough. That's my hypothesis at least.

Adam Becker [00:46:30]: Yeah. I think that the biggest challenge that I had was in figuring out what counts as a memory in the first place. So am I triggering this once a second, once a minute, once an hour? What is. And when you have Interact, I understand how they're sort of like testing it and like a setup where I'm just texting with the agent. Fine. But really, agents are operating in a much broader space than just interaction data in that type of way. And so I can conceive of myself doing it for knowledge. I can conceive of myself doing it for conversations.

Adam Becker [00:47:10]: I can conceive of myself doing it for utterances within conversations. And I feel like how I choose to deploy it is just as relevant to the cost. Question.

Arthur Coleman [00:47:22]: The code, See a timeline. When they were doing updates to the linkages, was it like every hour? Was it variable?

Adam Becker [00:47:31]: It was based on the number of updates necessary. So then they kind of like threshold. It's like every five modifications will do it at once.

Arthur Coleman [00:47:47]: Other comments from anyone? Okay, let me take another. Any. First of all, feel free to ask questions. If anyone. I can go from the question list or you guys can ask questions. Anyone have a question to the speakers specifically? All right, if we're not going to get that. One of the questions was about TF IDF and similarity. Can we talk a little bit about the similarity methods that they're using? And do you think that is flexible and optimal to use?

Adam Becker [00:48:40]: Yeah, I mean, I can. I don't know if they've described it in the performance and in the results, but in their implementation they tried a few different things. They tried the TF IDF just for keywords and that that would be kind of like the retrieval mechanism. And then they tried just a very simple embedding. And then they look at the nearest neighbors on the basis of that. And then they tried a combination where it's like, okay, let's. They give it. They call it an alpha and you're just weighing how much you want it to be, how influential the component of just the embedding vector, the cosine similarity should be versus the TF IDF for the keywords.

Adam Becker [00:49:22]: And I don't know what they've landed on. It was just all in the code. I don't think they mentioned it all that much in the paper. But yeah, same.

Nahil [00:49:31]: They haven't mentioned that in the analysis like the experiment section. But at least if they're doing hybrid search on like using BM25 for keywords based on the code you were showing the score sign similarity. I think that covers the performance you'll get from TF idf. Also is my gut feeling. But yeah, they haven't described it too much.

Adam Becker [00:49:50]: Yeah, that was my intuition. Yeah.

Arthur Coleman [00:49:55]: Because obviously there are other similarity metrics that you could use and I've used in the past. Tdfi, TF IDF is not always. Or cosine similarity is not necessarily the best similarity metric in all situations.

Matt Squire [00:50:09]: But I wonder how much the similarity metric matters in the grand scheme of things versus overall having the right information and organizing it in a sensible way. We may find that it's marginal gains on the selection of the particular method you use.

Arthur Coleman [00:50:29]: That's an interesting comment in the area for research. Very interesting.

Adam Becker [00:50:34]: Yeah. Yeah. I would think that because you have two components here. The first is just grabbing very roughly all of the potential neighbor neighbors.

Nahil [00:50:42]: Right.

Adam Becker [00:50:42]: And there's going neighbor by neighbor to figure out if they actually need modifications. I wonder if that second process is where you get even more bang for your buck. I wasn't persuaded that that one is done with sufficient.

Nahil [00:50:59]: I think somebody has a question.

Isa [00:51:00]: Is that yes related to what you guys are talking about? I found it a bit strange that they are doing this text processing with the LLM then creating these links which I guess are going to be similar semantically to the content and then end up stuffing all of this into a vector and just doing semantic similarity. In any case, I mean if the content was semantically similar, surely adding the links and then just doing semantic lookup, it just feels like intuitively this wouldn't work for me. So maybe in the code you mentioned they were doing BM25 just on the links or something. But I'm not. I didn't get that at all from the paper.

Adam Becker [00:51:40]: Yeah, not even on the links. I think they did BM25 just on the keywords. The keywords that were the first person.

Arthur Coleman [00:51:50]: To raise your hand. I didn't see it. My apologies. So I'll be watching for that. Now I know how to do that. Now I'm watching. Let me ask another question which I think is important. Which is M, M prime and M double prime, which is the evolution of the memory.

Arthur Coleman [00:52:07]: Now I come from the entity resolution world. For example, Arthur Coleman. Arthur L. Coleman. 123 Main St. 456 Smith St. Are these the same person? And what we do over time is we keep the history of those memories like you'll keep and watch how they go and over time we use the historical information to then validate whether that's the same person as we build the database. Adam, are they keeping M prime when they go to M double prime? And do you think that that loss information or that if they're not, if they would that necessarily improve the quality of the contextual data? Because it would allow us to go back and have richer potential history on that link.

Arthur Coleman [00:52:52]: Because again what they've added is context. Right. The whole point of this was to make a richer token and, and very brilliantly too in my mind. But do you think that the loss of data if they don't keep it is a problem?

Adam Becker [00:53:11]: They make space for it, but I haven't seen them actually use it. So if you see the code here, are you guys able to see my screen? They have evolution history here. They're not making use of it in the context of entity resolution. I have seen this be useful, right? If I have, even when I know there's like there's a Jon Snow and a Jonathan A. Snow, they both live in Boston and at some point I have more data to tell me that they're actually the same person. Even though in the beginning they started out separate. I'm going to rely on the fact that yes, now I know that both of them are the same. Right.

Adam Becker [00:53:51]: So I suspect that in the trend of connection there's like meta insights to be gained. But what is the shape of the pruning and the merging and the context update? They're not taking that into account and I think that would be fascinating to plug in.

Arthur Coleman [00:54:12]: Yo Matt, any comments on that?

Nahil [00:54:15]: Yeah, I just feel like a lot of the questions like a meta level are all thoughts and experiments that actually need to be run because of non deterministic nature of how LLMs work. It's very hard to definitely say like you know, this is going to give you this result and there are some limitations for how many experiments they could have run as well. But definitely more experiments to be run to see what hypothesis actually lands.

Matt Squire [00:54:41]: And what about the quality of evaluations on memory related tasks? I sense that there's probably more work to be done there because if we are, and I agree with you Nahil that we need to evaluate all these different things but then how do we evaluate it? What's the baseline that we're comparing against here?

Nahil [00:55:01]: In general, agentic evals is just hard. It's a very new thing. Maybe next paper should be about state of the art of evals for Agents if we didn't discuss that already.

Arthur Coleman [00:55:15]: And something you said triggered something else for me which we haven't talked about, which is the new agent to agent capability that came out of Google Next. If you haven't seen this guys, you should because they're talking about having a. Not a repository, what do you call it? A reference where people find other agents and they can find them and link to them and use their functionality. But then if one has a memory and another has memory, are they sharing memory? How does that work? I think that's going to be one of the more interesting applications of this problem because if two agents, three agents, four agents want to work together, they have to be working off common information to do the best work that they can. So that's an interesting challenge to this model. Lastly, I want to ask our audience, we have about a minute. Does anyone else have thoughts? Do they like this? Do you guys think this is brilliant? Do you think it's like eh. I like other methods better.

Arthur Coleman [00:56:11]: You know Maja, Reese, Savannah, Chris, anyone? I'm looking at the screen of all our listeners. I'm trying to engage you guys into contributing here. Anyone have thoughts about. Do you. You know, on a scale 1 to 10, how are we doing with this thing?

Nahil [00:56:26]: I think Isa has one more question or.

Isa [00:56:29]: Yes, yes I do. Maybe just touching onto what Neil said about being a bit suspicious about the results. Didn't you guys also find find it suspicious that the brute force Locomo method on the models with 128k context window didn't just always win because they're talking about in this Locomo data set the average session history is 9,000. So you could just put everything in the context. Surely that must be better than going through all this convoluted lookup.

Nahil [00:57:04]: No, I think we discussed this earlier in January a paper where there's lost in the middle problem. So probably that's what's happening. But it doesn't like if you have a very very full context, it doesn't put same emphasis and doesn't have recall across the whole context properly.

Isa [00:57:19]: Okay, that would make sense.

Arthur Coleman [00:57:22]: All right folks, we are at 9 o'clock. Even I'm going to keep us on time. I want to thank our speakers. You have been fabulous in presenting this information a way that even I could understand and I'm the least technical technical guy in the room. So thank you everybody. I hope you found seeing the video where Benoy, where is the video posted? Is it posted in the reading group chat?

Nahil [00:57:44]: Yep, that's right. When it's available, we'll notify in the Reading Group channel. So stay tuned to that. And if you guys have any suggestions for what we should cover for the next session, so please do drop the links in the Reading Group channel. And yep, Arthur, you did a tremendously great job, honestly.

Arthur Coleman [00:58:01]: Thank you. First, I'll get better with time, guys.

Matt Squire [00:58:03]: Thank.

Arthur Coleman [00:58:03]: Thank you so much. Live long and prosper. Be well.

Matt Squire [00:58:08]: Thanks, everybody.

Nahil [00:58:09]: Thanks, everybody.

Matt Squire [00:58:10]: Bye.

+ Read More

Watch More

MLOps Reading Group - December : A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents
Posted Dec 27, 2024 | Views 330
# AI Agents
# Observability
# AI Systems
Guarding LLM and NLP APIs: A Trailblazing Odyssey for Enhanced Security
Posted Nov 13, 2023 | Views 693
# LLMs
# OWASP
# Cohere
Building a Product Optimization Loop for Your Llm Features
Posted Aug 08, 2024 | Views 181
# LLM
# AGI
# Freeplay