MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Overcoming Agentic Memory Management Challenges

Posted Oct 17, 2025 | Views 7
# Agentic Memory
# AI Agents
# Cortex
Share

speakers

user's Avatar
Biswaroop Bhattacharjee
Senior ML Engineer @ Prem AI

Biswaroop Bhattacharjee is a Senior ML Engineer at Prem AI, hacking with LLMs, SLMs, Vision models and MLOps in general as an. Biswaroop has also worked on ML platforms and distributed systems, with stints under startups in Conversational VoiceAI @ Skit.ai, Chatbots from pre-and-current LLM era @ circlelabs.xyz and a bit of Fashion Hyperforecasting @ Stylumia.ai.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

What if AI could actually remember like humans do?

Biswaroop Bhattacharjee joins Demetrios Brinkmann to challenge how we think about memory in AI. From building Cortex—a system inspired by human cognition—to exploring whether AI should forget, this conversation questions the limits of agentic memory and how far we should go in mimicking the mind.

+ Read More

TRANSCRIPT

Biswaroop Bhattacharjee [00:00:00]: So imagine memory as a tool, something which is still debatable if the AI agents actually need a mechanism to forget. Imagine it's not us who are controlling these tweaking parameters, but it's an agent. The vision memory should be able to communicate properly with the textual memory that it has access to from day to day.

Demetrios Brinkmann [00:00:21]: You dove into the memory world for the past five, six months because you built out cortex. Can you give me a lay of the land at the high level?

Biswaroop Bhattacharjee [00:00:32]: So when I was starting, I think at that time there was not anything very specific. There was like MEM GPT and there was LANG mem, which was quite popular at that time. What piqued my interest was AMM kind of came out at that time. It's like around I think five months ago. Five or six months ago. Also mlops community had a very nice reading group discussion on amm and we just discussed the whole paper there, which was very helpful to understand things even better. One of the key ideas I saw in AMM was let's say you are inserting something into the memory or database or whatever sort of a collection you're saying which the agentic system will have access to. So the key differentiating factor was it sort of forms relationships.

Biswaroop Bhattacharjee [00:01:26]: I'm having this talk with you right now and currently we gonna have this talk like I knew that in my mind, so let's say a few days back. So currently I am connecting this current state with my past memory because it's sort of interconnected and if you see there's a relationship between these two memories, it's like we can also like name these relationships. Something like is a result of that. So current state is a result of the previous state that we were discussing about previously. So something similar like AMM does is it takes different memories that you want to store inside the whole memory collection or whatever you say. And it sort of forms relationships between them. So if I dive a little bit deeper, say you are inserting something in the memory, you take that and if it's like the first thing you're inserting, you just insert it directly. But before inserting you just analyze it a bit, sort of come out with a summary of what it's about.

Biswaroop Bhattacharjee [00:02:32]: So we are calling it Context. Then we also come out with keywords and tags. So what are the main key things in this now, by the way? This small memory chain can be as big as possible or as small as possible. That's on the client side that we're giving and that's on how you actually use this memory system, that's what AMM was trying to do.

Demetrios Brinkmann [00:02:55]: In a way it's like a knowledge graph. You've got this chunk and then you have metadata around it and that metadata connects to other pieces of metadata.

Biswaroop Bhattacharjee [00:03:05]: Yep, yep. So the metadata doesn't directly connect, but you have these sort of metadata and they're just so currently, let's say you have like few memories already in place, then a new memory comes in. It sort of does a semantic search between all the memories that are in the collection and let's say it takes out like top k of the memory. Now once it has the top K memories it asks LLM with their memory representation. So we don't use the whole memory, we use the summary and the key words or tags and whatever other metadata we store. It can even have the timestamps. And then we ask the LLM with the new memory that's coming in that hey, do you see any sort of relationship between these existing ones which we found to be similar to the current coming memory? Now if you do, then just connect them. So what I mean by connect them is update the existing metadata and also give a nicer metadata to the current coming memory.

Biswaroop Bhattacharjee [00:04:09]: And that's how the whole graph keeps on forming. So it just sort of makes a connections in between these two and at the time of using of the memory. So like let's say a query comes in, now we have to look for like the most relevant ones from the memory. So what we do is we just analyze this query with an LLM. So the one differentiating thing I think in AMM is on every insertion and before like every retrieval we do an LLM call because we are trying to pre process and AMM doesn't really focus highly on the latency side but more on the quality of the memory that you are getting out.

Demetrios Brinkmann [00:04:52]: I was going to say it sounds very akin to a search problem.

Biswaroop Bhattacharjee [00:04:56]: Yeah, yeah, for sure. So currently we just take out the metadata thingies and now we just literally we do a semantic search on using two things. One will be what the query was and the other thing will be the keywords from the query that we extracted. So we do two retrievals parallelly at the same time and then we take those two results and just feed it onto the context and do whatever we want with it, like pass it to another to answer something or.

Demetrios Brinkmann [00:05:30]: That's very much on the personalization side, right?

Biswaroop Bhattacharjee [00:05:33]: Yes, I mean you can literally do anything with that. And this is like the most basic core part of AMM where it didn't really perform that well compared to the existing methods that was there. But what was striking to me was the relationships forming in between.

Demetrios Brinkmann [00:05:53]: Okay.

Biswaroop Bhattacharjee [00:05:54]: And I was kind of trying to think very deeply on like how humans think about memory and how we can actually improve it using this relationship foundation. Because the only place I saw these relationships in place was in Graph Rag. So. But I, I currently feel that in Graph Rag we have all of these like different nodes and we compress down information way too much. So that's, that's one of the issues. And that was not there in the AMM thingy.

Demetrios Brinkmann [00:06:26]: Then you got inspiration from Amen. You thought deeply about what and how memory works within humans and then you said, let's take it another step and let's create our own memory.

Biswaroop Bhattacharjee [00:06:40]: Yeah, yeah. I was like brainstorming with a few of my friends and whoever was in this space and was reading papers basically and saw that. So currently I was trying to, I think I use deep research to understand basically how human also thinks specifically from the memory perspective on a very high level and also a bit low level. So they have this thing called long term memory and short term memory. So I mean we have recently seen in the AI space or the LLM space, the short term memory, it's nothing, it's just the recent window of the chat that you're having with the AI agent or the conversation assistant or something that's one form of short term memory. But I wouldn't say that's a short term, short term memory because it has to be a bit bigger than that. But about long term memory is this is something crucial. I think that's where all the magic happens in the long term memory.

Biswaroop Bhattacharjee [00:07:44]: So the relationship forming and like everything storing nicely and this whole big graph that we are storing inside that is part of the long term memory. So and we don't expect to long term memory happen like as soon as possible. So I mean it can always keep running on the background. It's something we don't expect to happen instantly, but it will take its time. But as we are having this conversation, we are building onto things constantly in the back of our minds. And it's not like, I mean I'm using certain parts of it, but that's from my short term memory. So short term memory is sort of like a window. It's constantly updating, updating, updating.

Biswaroop Bhattacharjee [00:08:24]: But in long term memory we are just storing everything like kind of dumping. And while we are dumping things are processing parallelly in the background which is extracting lots of metadata out of it and interconnecting things in between.

Demetrios Brinkmann [00:08:37]: Yeah, I remember taking a online course a while back now and it was called Learning how to Learn. And it talked a lot about this, how we can create memories because that helps us learn. And some of the things that it mentioned was the more that you access from the long term memory, the more solidified it becomes as something that you now know. And so the frequency. And that's why you do like flashcards when you're trying to learn something, because you try and remember it and you can, or you can, maybe you can, and then you flip it over and then you remember it. Okay, cool. Yeah, there it is. And so it accesses it again and you basically do that until you now know it and then you can spread out the frequency.

Demetrios Brinkmann [00:09:28]: So if you're trying to learn something and you're struggling with it, maybe you're doing these flashcards every day or twice a day, three times a day. But as you then don't need to flip over the card to remember something, you can space it out and do it once every three days, once every five days, once every week. And so it feels like what you're talking about, it has a little bit of that inspiration. But the other piece that I think is interesting is that you're also mentioning we've got this long term memory and we're able to throw things in it. How do we weight it and how do we know that things are important from it? Because for humans we have something that is like, oh, the more frequently that we access it, the more important or the more solidified it becomes in our memory.

Biswaroop Bhattacharjee [00:10:18]: So I have used these like Anki flashcards. I think it's quite popular among everyone who's trying to learning something. I remember like trying to learn Japanese. So I was just like learning hiragana characters through like Anki flashcards. And I mean the concept of space to repetition, I think that's the term. So yeah, people say so that's quite popular. And this is something very much seen in humans specifically because we tend to forget. We tend to forget things a lot.

Biswaroop Bhattacharjee [00:10:52]: And this is something which is still debatable if the AI agents actually need a mechanism to forget or not. So there are a few people who actually believe that, yeah, AI systems need to forget because information gets outdated. We have to prune the things out. And there are like lots of strategies to remove things out. And also some people believe that, no, it's an AI agent, It doesn't have to exactly act like a human, like just dump everything and just use the most important ones and archive maybe in some way.

Demetrios Brinkmann [00:11:25]: Yeah. Well, also the other piece I remember that you were mentioning too was how humans sleep. And sleep is a huge factor in how we are able to bring things into our long term memory.

Biswaroop Bhattacharjee [00:11:37]: Right, exactly, exactly. So yeah, I mean there are papers like where say that human keeps on processing when they're like sleeping. It's the thing like the term is kind of consolidation. So like human memory consolidation happens when you kind of are resting or like when your stress levels are like quite low. So. And your brain is like not, I guess multifunctioning and doing a lot of things to it. But yeah, so AI agents doesn't need to sleep. So it can do so many things in the background.

Demetrios Brinkmann [00:12:10]: But when you're thinking about how to consolidate different memories, how did you go about that? Because as you were just saying, you're inserting memories in with their certain metadata and then you're constantly updating this knowledge graphics.

Biswaroop Bhattacharjee [00:12:26]: So cortex is something which I built on top of AMM at first. So I took all the inspiration from AMM on how it was happening and I thought, hey, we can actually improve this thing. And there's so many possibilities in here. So what I did, the first thing was. Oh, before I mention this, I have to also mention this, that. Have you ever used Obsidian?

Demetrios Brinkmann [00:12:49]: Yeah, yeah, yeah, of course.

Biswaroop Bhattacharjee [00:12:52]: So they used one method called. I think, I think I'm pronouncing it wrong, but it's like zettelkaisten, something like that. So what it is, you can reference things in any page from anywhere. So at the end, Obsidian gives you this whole interconnected graph where you can check that, check different clusters on what all things you have mentioned and which page connects to which page. So your whole writing journey is kind of captured in a graph and what all are the interconnected webs in between that comes out for you to see so you can get the whole bigger picture. So AMM paper mentions a lot about Zettelkasten and sorry if I'm pronouncing that wrong, but whatever. Yeah, so, mm, just forms like relationships in between them. So what I thought is currently we value things differently all the time.

Biswaroop Bhattacharjee [00:13:49]: So our values and everything depends constantly on lot of factors. It might be our own biases and what our previous memory was. What have we seen, our life, what we have gone through every day, I think in our life, which is totally different for every single person. And by person in this case, I mean a memory. Okay, so yeah, currently when we are like, what the key change I think I made was I sort of named the relationships that each memory is being connected by. So what I mean by that is I had some key terms. So initially I was just like experimenting with what if we just name it extends by or something like definition of or something. So I had some hard coded key terms in here and I tried to extend it extended more.

Biswaroop Bhattacharjee [00:14:45]: So that was there. And also like there were key terms like reciprocal of something like defines. So a memory defines this memory. But we don't really know that how much it defines is it like. So if we talk about it mathematically zero being the. There is like no definition between each other and one being like it fully is the definition of another. So one being the definition of other. So it can be scored.

Biswaroop Bhattacharjee [00:15:17]: So we can actually score that. What is the relationship strength between these two?

Demetrios Brinkmann [00:15:22]: Nice.

Biswaroop Bhattacharjee [00:15:23]: So and this is something we don't really do by like, we don't do like manually or like any sort of instantaneous thing in the pipeline. So while we are forming these metadata connections and relationships between these memories through an LLM, when we are asking the LLM process, we ask it to also give a score and only score it if it's like very confident that there is some sort of like high weighted relationship is present. So that kind of gives you the whole overweight, I mean whole weight of the whole thing. And the other thing is also we ask it to like name the relationship out of these possible naming schemes. So that's just a big set that we have and it just basically chooses that how it is connected to each other. So does it extend or is it a definition? And there's one more other option that is should you be merging two memories? So sometimes in our memory we actually tend to even merge things into each other and I kind of feel like that it's very important. So that's some feature I also added there.

Demetrios Brinkmann [00:16:41]: And yeah, sometimes we can falsely remember things because I think a lot of times we will merge two memories and then we'll remember. Oh, that was maybe if I've been to a place twice. Yeah, but it was a long time ago. I mix up which time I did something.

Biswaroop Bhattacharjee [00:17:03]: This is all experimental. This is. I just wanted to see like what happens at the end. So there were like few types of merge that we had something like an update, which you just said. So in update, what happens? We just concatenate different memories into one and it just becomes a new memory. Yeah, yeah. By the way, we are maintaining all the history in the metadata. So.

Demetrios Brinkmann [00:17:26]: Okay, if you need to re Separate them at a certain time, if we can do that. And did you take any different approaches from AMM on the LLM calls and the search and retrieval style?

Biswaroop Bhattacharjee [00:17:40]: Yes, from amm. I mean they just had a simple prompt where at the time of inserting a memory they used to process it first and then insert. And what I mean by insertion is like it forms relationships between the top most similar memories that it could find in the database. And at the retrieval it just used to take the query and then just take the query and do a semantic search on the top most. And it takes the top most K memories and also goes one level deeper. Because now we can see for all of those top K memories what are the other connected memories. And you just take them also into account. Obviously they will be less similar to the original query that you have.

Biswaroop Bhattacharjee [00:18:29]: Yeah, I think these two were like the key factors. But something which we added in Cortex was we added bi directional connections between these two. So there was only a single connection between the memories as far as I remember. So I changed that to like bidirectional connections because it doesn't make sense if one is going that to only one side. Because for every connection that is represented in the graph, it shouldn't be only that, it defines this. I mean we can even have a backward connection saying it is defined by. So that at the time of retrieval, if this one comes up. So this also has some thing to give more context to the model so that at the time of retrieval we can use this and this together.

Biswaroop Bhattacharjee [00:19:24]: This is what we call depth in graph rag. So we can actually customize how deeper to go, how many connections deep to go. So generally in practice it's nice to use just a depth of one or two because three gets way too much sometimes and too much noise might come out.

Demetrios Brinkmann [00:19:42]: Were you vectorizing this also?

Biswaroop Bhattacharjee [00:19:45]: Yes, it's a whole vectorized data set only. So I mean we just use a single vector db. There is no graph rag, sorry, there is no graph DB that we are managing for this specific use case. Because at the time of retrieval we are just doing semantic search. So there's that. And by the way, what I'm talking about right now, that was like the intermediate part of Cortex when it was just coming out and I was experimenting there.

Demetrios Brinkmann [00:20:12]: Oh, there's a V2.

Biswaroop Bhattacharjee [00:20:14]: Yeah, there will be improvements on top of it. Yeah.

Demetrios Brinkmann [00:20:16]: All right, let's talk about that a little bit.

Biswaroop Bhattacharjee [00:20:17]: Sure, sure. So currently there are people are like coming out with these AI assistants or agents where they have to do multi domain tasks. So the domain is not really connected to each other. It can be connected somehow. It might even not be connected. So let's say you have ChatGPT right now, right? Or like any sort of AI assistant which you use to ask for a lot of different things. Now it can be about your work, it can be about your personal life, or it can be somewhere in between because your friend might be working as a co worker for you. There has to be some sort of distinctions or like collections I would say that we currently automatically have as a human in our mind and we know how to separate them and also sort of form connections between them.

Biswaroop Bhattacharjee [00:21:13]: So I kind of visualized this and I did some research on the same that this is sort of like hierarchical collections. That's what we are going towards right now. So the existing systems, whatever we are seeing in the AI agent space for memory, we're only focusing on the flat memory structure. That's what they're calling. So what happens in a flat memory structure is there is no sort of like hierarchy or so when you say.

Demetrios Brinkmann [00:21:45]: Flat hierarchical structure, are you talking about something that is less like notion where we have all of our different files or any file structure where you've got your file, you click into another file, you click into another file and it just keeps going down the line. That's not what you have right now and that's where you think it's going. Or is it because right now, as you mentioned, it's more like Obsidian where everything is just on this flat connected space and you have clusters over here and clusters over there. But there's no hierarchical way of doing it.

Biswaroop Bhattacharjee [00:22:23]: Exactly. Yeah, exactly. You explained it very nicely. So just like how a file system works. So in file system there is always this hierarchical structure. So first of all, if you're searching for something in your laptop or desktop, you're looking for let's say a folder which has a very high level overview of what you're looking for. Then you go inside of it and then you look for maybe a very specific things. So sort of this is like the concept of categories and subcategories.

Biswaroop Bhattacharjee [00:22:57]: So imagine in high level you have lot of different topics in your mind. Like work, it can be personal life. Now all of these different topics have like subtopics internally inside work. It can be your current company that you're working at. Maybe you have a side project that you're also working at and it's making money for you. And money also maybe is connected with like finance, it's like a bigger topic in general. In your life and finance even connects with money. But it has also a sort of triangle for each of these like top to bottom structures.

Biswaroop Bhattacharjee [00:23:33]: We can see sort of like a triangle forming which has like multiple things connected to each other. It's like a tree, basically. Yeah, just a tree. This is something that humans, I think, use deeply because when we are thinking of something, we sort of think. We think like very fast, but we sort of think through hierarchies first and kind of connect things together. So we go down from top to bottom. Also we are taking into account our short term memory, which creates some sort of biases on how we are kind of traversing the whole thing. But this is something which is missing on the current systems.

Biswaroop Bhattacharjee [00:24:14]: It doesn't really represent these hierarchical collections or the topics directly. So this is something we should represent also and take into account and sort of build like a hybrid search which uses both of the. Both of the previous way we were doing like retrievals and storing. And also it makes use of the whole hybrid thingy. What I just told about the collections. So in cortex we are saying it, it's smart auto collections.

Demetrios Brinkmann [00:24:44]: So but what does that enable? Is it just that it's faster search?

Biswaroop Bhattacharjee [00:24:50]: No, it's not faster search. What it enables a higher quality search. So it gives you much lesser noise compared to what I mean flat search will give. Just to give an example, let's say you search for something like fix this. Now fix this can mean literally anything. It can mean like fixing your car or it can mean, yeah, what is this exactly? So fix this. Like what is even this? So this, it means so many things in your life. And if you use a flat hierarchical model, it can just fetch out the fix your car or like, I don't know, fix your friend's brain or something like that.

Demetrios Brinkmann [00:25:31]: Anything with fix in the metadata, it's going to grab.

Biswaroop Bhattacharjee [00:25:34]: Exactly. We are kind of providing like the whole. We are providing an option to give context. Also this actually needs. This is ultimately a context problem. So when we are kind of doing retrieval, there's an optional context parameter. But overall it should first of all go through the. What are the top level collections are.

Biswaroop Bhattacharjee [00:25:58]: What are the top level keywords are? I will talk about it in terms of two, two ways. One is like let's say you're inserting some sort of data. So memories, these memory systems in general, I think kind of has two key functions. One is insertion, another is retrieval. So imagine you just have something which you're constantly inserting or it's happening in the background and at the background you can constantly keep on retrieving and use that however you want in your agentic systems. Keeping these two things in mind, if we talk about insertion for let's say this auto collections thing which we have in Cortex. So when a new memory comes in, let's say you have some existing memory already in place. And when a new memory comes in, it kind of categorizes based on what the user Persona and what are your priorities are.

Biswaroop Bhattacharjee [00:26:50]: This we can actually write down initially. So based on that it kind of categorizes that if that memory represents just on high level like work or if it represents work, then kind of subcategory can be first job. And then there can be some more subcategory like Python. Maybe it's something related to Python that you have are trying to like store in your memory now sort of in this way. We like every single memory. When it comes in, it sort of creates like all of these topics and then it's subcategories only if it means something. I mean it's optional. It can just be a very high level thing.

Biswaroop Bhattacharjee [00:27:35]: Also that I hate my job. So that's like just something which is like connected to your work and it doesn't really have any other smaller subcategories to it. It might by the way, so something it can have like let's say emotional aspect of it. So job then emotions. Now we form all of these like small, small categories and after it reaches a certain threshold, we check what are the frequencies of all of these different memories and what are their categories are. So if it exceeds a certain threshold we form a collection out of it. So now this is a collection and anything whenever like we. By the way, I didn't mention that how we are kind of forming these categories.

Biswaroop Bhattacharjee [00:28:24]: So at the time of getting the metadata out of a query, if you remember. So in Ammon or.

Demetrios Brinkmann [00:28:31]: Or in Cortex, you're making the LLM call.

Biswaroop Bhattacharjee [00:28:33]: Yes, in LLM call we are also asking to give it a category specifically and with some context which is optional to take into account that hey, it already exists with these. Try to keep it minimal and noise free. And so that it doesn't really give you any kind of category because you can name one thing in many different ways and that won't be good for us. So how it started, it should kind of like keep on going on that direction and not really diverge a lot. After this what happens? We ask the LLM for the different categories and their connections. I mean not connections, the categories and the Subcategories, just name them basically. So we check the frequency, we form all of these like small, small collections. And when we are following, when we are creating these collections, what we are saying that hey, now we have this, let's say 15 or 16 memories together, we are forming this collection.

Biswaroop Bhattacharjee [00:29:32]: Let's give it a description, like what the whole thing is about. So now the description will be sort of like what are the key things that we are keeping in the memory? So this is something that happens in the background, by the way. It's not something is actively happening when you are like using the memory system. So it keeps on running on the background, which keeps on checking that if it has reached a certain threshold or not. If it did, then start a background process to get a collection out of it, where you give it a summary which we are also calling as a description and give it a query helper. So what query helper is, this is something that will come at the time of retrieval. It kind of, it is a prompt, so it's a meta string, you can say, which will help you to create a prompt on how to query this.

Demetrios Brinkmann [00:30:28]: And so when you're making that background call and you're putting it into, you're inserting it into memory, you're doing both at the same time. One for the hierarchical structure and then also all of the stuff that you were talking about before with is defined by and what the score is. And so it's one LLM call that will give you all of that and then you can parse that out and say, all right, we have everything we need now for both of these structures.

Biswaroop Bhattacharjee [00:30:55]: These two are happening parallelly in the background. So it's not really synchronous. So they don't really affect the latency together. So one we are calling as the global search, it searches things like globally and the other we are calling as like auto collections search. So this is like a very constrained and narrowed down search if you look at it. So at the end we take some of this and take some of this and only take the most important ones that matches the most with the query, like what actually we are looking for and just feed that or return that as a memory retrieved item. You can say. So yeah, to continue about I think how I was talking about the retrieval of the auto collections.

Biswaroop Bhattacharjee [00:31:43]: So we sort of like form this query helper which is like how should you even go about querying this collections? Because now if the queries like fix this now if it matches with the collection. Python, sorry, collection work. Let's Say now in work, the query helper would be based on the description. Does it really help to even query this? So there's like two options. One is to query or even not query. So now if you actually want to query this whole collection, then how should you modify this query? Because in Rag there is this concept of like query expansion or rephrasing where you actually rephrase the query based on what the context is. Sometimes like fix this might be related to the previous 10 tens that you are talking about because if you don't provide it some sort of context it can't understand. So query helper takes into account like the context and this new query and it checks if it's relevant to this collection or not because it has access to a description also.

Biswaroop Bhattacharjee [00:32:55]: And since it has all of these things, it will generate an answer in yes or no. Like let's say if it's yes, then if it will modify this query with fix this thing about Python, let's say so it will kind of modify that whole query to be retrievable or like searchable. So okay, so till I, since I mentioned this point, so there is one key thing that is at the time of retrieval, we are taking this and we are doing two things. One is on the smart collection side, we are doing two things. One is we are taking this query and sort of doing a semantic search over the different subcategories and collections that we have. I mean over the whole collections that we have, let's say you figure out four top collections out of it, the most similar ones based on its descriptions. Now you give it 30% importance only. So now we form sort of a composite score because currently we are giving it only 30% importance.

Biswaroop Bhattacharjee [00:34:04]: Now we are giving 70% importance to what are we querying inside all of these top K collections. So inside all of these top K collections, like each of these collections might have like 15 or 16 memories or like whatever the threshold you have set. Now among them, we do a semantic search through the modified query for each of them, by the way. Now we are giving this 70% importance and finally by two important scores like 30% and 70%, we are finally coming up with a composite score for each of the memory points. And then we sort it and then we see like what are the most relevant ones. Let's say you only select top J out of the final composite scored memories that you've retrieved. And then you take this into account from the auto collections retrieval and you take into account the whole global search retrieval also. And then you just Select, I mean however much you need and you just give it back as something like which LLM systems can use.

Biswaroop Bhattacharjee [00:35:11]: Yeah, that's like the whole overview of it, I guess.

Demetrios Brinkmann [00:35:15]: What about when you want time based questions or recency? And I say tomorrow I'm doing this, but then tomorrow becomes last week.

Biswaroop Bhattacharjee [00:35:29]: Currently I realized this a bit late while we were building Cortex, but it has support for the same thing. So there are two things I can see right now. One is when we want to do these recent based queries and when we want to do a particular date range based queries. So it might be something like do you remember that what happened between March 2025 to April 2025 like on the first 15 days. So so that's like one way and another can be something like hey, what did I talk about? What did I talk today, yesterday? Something like that. So there is one step that we are wanting to do. So for all of these date range based queries, where you're expecting them have two options. One is to provide a date range and another is to kind of, I mean, yeah, we are calling it like temporal.

Biswaroop Bhattacharjee [00:36:32]: Wait, so what temporal weight is now the weight can be anywhere between 0 to 1. And what temporal weight being 0 means is you don't give any sort of bias to recency. So currently it just happens like normally how it was happening the whole hybrid search. But if the temporal weight is like 1.0, now what you do here is you do the retrieval from just stm. So what I mean by STM is like from just the short term memory that you have. And since short term memory is like a window and now you give importance to the most recent ones only. So that's something. And now if the temporal weight is like let's say 0.7 sort of.

Biswaroop Bhattacharjee [00:37:25]: So in 0.7 what we do is at the time of retrieval we are going to the long term memory and we are taking the whole query. And if you remember, we have always a limit at the time of search that how far to look back, what should be the size of K. So currently we are kind of multiplying it by some factor. So based on what your weight is, we create a factor and we multiply that so that we get even a bigger window. And now we basically take all of this into consideration because at the time of auto collection search we have this composite score mechanism. If you see so in composite score it doesn't matter if a collection gets like higher score because it will only have like 30% importance, there might be 70% more importance on something which was like lower down in the line. So overall the composite score can give you like a better distribution.

Demetrios Brinkmann [00:38:36]: I can say I think I'm understanding this. Is it keywords that are triggering how much weight you're giving to that long term versus short term?

Biswaroop Bhattacharjee [00:38:48]: With keyword. Yes. So we have like all of these set of keyword checks.

Demetrios Brinkmann [00:38:52]: If I say last week that's a keyword or if I say last year that's a key keyword and that determines K. Yes, exactly.

Biswaroop Bhattacharjee [00:39:00]: So recent would have like something 0.6 as the K. So recent. I mean, yeah, we shouldn't like keep it more higher. So last week we can have something if a higher. So the, the temporal weight is getting decided based on like what the keywords are. We have like a huge set of keywords. I mean this we basically did because it's efficient. But an even better, better method will be to ask an LLM if you are okay with having more latency on the whole system.

Biswaroop Bhattacharjee [00:39:34]: But that's something I think we all can improve and like how we are currently deciding the weight but the functionality is there. And there's also one more parameter where we can pass the date ranges. So it will constrain the whole search between only these two ranges and it will use all of these auto collection thingy and the hybrid, the global search between a certain date range only. Whenever like you're taking a query, you can always extract some date range out of it using the whole context before querying the memory system so that you have some date range to look around for.

Demetrios Brinkmann [00:40:12]: So it seems like you've got a lot of data that you're pulling from whether it's the short term context and you're deciding how much short term versus long term versus the hybrid search or the file structure, for lack of a better word. I can't remember what you were calling it. And the knowledge graph style.

Biswaroop Bhattacharjee [00:40:33]: Yeah.

Demetrios Brinkmann [00:40:34]: Do you find that it all kind of collapses upon itself if there's a right or wrong answer or. Because I imagine it all leads you to the same place if there's one answer. But maybe when it's very fuzzy you can just get this bloat of a whole ton of noise.

Biswaroop Bhattacharjee [00:40:53]: Yeah, yeah. First to check the fuzziness. We are kind of doing like when similar things are coming up, we obviously are doing like deduplications there. So because at the core the memories are represented as like an ID in the database. So you can do dereplications there. Now how much noise to actually incorporate and that actually depends on your Limit. And what I mean by limit is how many memories you actually want to retrieve. So you can always tone it down.

Biswaroop Bhattacharjee [00:41:26]: So it can be like I just want two memories. And that reduces the noise itself. Yeah, I think if you're losing out on something, if you are losing out on something, just increase the K probably so you can get more memories out of it and you can totally shut down the global search. That will be super constrained then. So you can just remove the. Not remove.

Demetrios Brinkmann [00:41:51]: You can do that on a case by case basis or you do that just. Oh, really?

Biswaroop Bhattacharjee [00:41:54]: Yeah. So it's just a flag in the code base. So it's like it will be totally shut down and only the auto collections thingy will be working. So there you can get like a much constrained and if it's like, like a case where you don't want any false positive. So that will be the case probably.

Demetrios Brinkmann [00:42:13]: Have you messed around with. I feel like there's potentially some cool stuff you could do where it's cascading where first you try with one and if it doesn't work then you can go a little bit deeper and a little bit deeper. Maybe you try with just the short term memory plus hybrid. Yeah. Or if you can't find it, then you add in the knowledge graph and if you can't find that then you just say, all right, well let's try again with everything and see if we miss something.

Biswaroop Bhattacharjee [00:42:44]: Or let's increase K. So this idea, this is something that I have in the plan that is currently you see the whole cortex as something that human is using right now. Right. And it has like lot of these different parameters in it. But what I want to actually convert this into is sort of tool where it's not humans that will be using Cortex, it will be the agent systems that will be actually using Cortex. So imagine memory as a tool. So I mean currently the MCP is like MCP trend is like skyrocketing. So I guess it makes sense to call it as an MCP tool.

Biswaroop Bhattacharjee [00:43:22]: But imagine it's not us who are controlling these tweaking parameters, but it's an agent because it. So if it's an agent, there it is automatically processing your request and it is forming that hey, what the date range should be while calling cortex. So it has cortex only as a tool. So in that way it can tweak all of these things and see like, like what parameters give it the best result.

Demetrios Brinkmann [00:43:50]: Okay, let's get some coffee from Tiero and then we can. All right, we're back. Maybe we can talk now for a second about context engineering and how this all fits into that?

Biswaroop Bhattacharjee [00:44:04]: Yeah, sure. We at prem are thinking very deeply about how to kind of create agentic systems. And we are as we are like building one internally that we realize that giving all of these agents these different tools, but making sure that the tool usage and the number of tools that these agents actually have is highly, highly important. And making sure that there is only one thing on the memory side. So that can be something very crucial.

Demetrios Brinkmann [00:44:40]: Because now you're eliminating noise in a way. If you only have one tool for memory, then you don't have to go and use up all of the this memory kind of search aspect if it's not needed.

Biswaroop Bhattacharjee [00:44:58]: Yeah, that that actually reduces a lot of noise on the tools number of tools it has access to. So there is like a paper out there that the number of tools that you use, the accuracy of like the task completion rate of an agent goes actually down. So the graph is kind of like this, I would say. Yeah.

Demetrios Brinkmann [00:45:20]: And I've heard people talk about how much more expensive it is if you now have to give access to all these different tools. Your input tokens go up.

Biswaroop Bhattacharjee [00:45:28]: Yeah, yeah. So like they're just getting more confused and confused with like so much noise of like different tool definitions and all. So I mean there are some strategies where we can eliminate that. So there has been very nice blog about the same context engineering by Manus also. So where they talk about prefix caching and how to just block one tool from the LLM to C so that it doesn't really use them when it's not really required or it doesn't even have that in its context. So all of these new patterns that are emerging as a whole, the whole context engineering thing. So I mean it was all prompt engineering before, but it's so much bigger right now because currently the way you use these tools and how you feed in all the information for the whole completion of a task has become the more thing and like more and more people are working on the same now. Giving the whole memory as a tool to an agent makes very much sense because we don't have these like static flows anymore where like a human will like for only certain cases, like a human would retrieve on what it's required because agent can do the same thing.

Biswaroop Bhattacharjee [00:46:42]: And we are constraining on how much freedom to give to the agent by working on the memory tool aspect itself. So how we are like building cortex is like we are reducing on how much freedom to give the agent who is using this As a tool we can give it like a lot of freedom with a lot of different parameters or we can constrain it down to very minimal parameters. That's one way I think how we can reduce down entropy overhaul. Yeah.

Demetrios Brinkmann [00:47:11]: Last thing you're interested in, Vision memory also.

Biswaroop Bhattacharjee [00:47:16]: Yes. So this is a coming soon feature I would say in Cortex that I'm working on right now. So I think there are some nice implementations done by a lot of different people. Currently people are thinking a lot about how to represent all form of senses or like signals I would say as a part of their. A part of the agentic memory. Because currently if you see it's all text, like everything that is people are working with are just text. But what I'm seeing the trend right now is people diving so much more deeply into the whole video vision and audio also. So like just to see the bigger picture, I think we have to target the whole five senses.

Biswaroop Bhattacharjee [00:48:07]: Like how human see and feel and can experience different things through all of their five senses. Something similar has to emerge from the agent ecosystem also because all of these has to merge together and have. There should be a way that it's able to intercommunicate with each other. Like the vision memory should be able to communicate properly with the textual memory that it has access to from day to day. Or it can be just some audio. Obviously you can compress these down to only one domain. Like compressing down a video to only a text and it just interacts with everything. Is a text at the base then same audio?

Demetrios Brinkmann [00:48:48]: Yeah, yeah.

Biswaroop Bhattacharjee [00:48:49]: But you're losing context there. So this is not a lossless compression. You're just compressing down the video and you're losing a lot of context. Even if like you are explaining it very nicely on what the video is about. So there is this nice thing that got released by Memories AI. So they are working on large vision models something it was large language memory vision models something it was like that where they are kind of processing the whole, processing a whole video to make it more searchable, more indexable. So now this indexable thing is not like that. It's just converted into text, but sort of like an embedding.

Biswaroop Bhattacharjee [00:49:35]: But feel free to check out memories AI. But I think that's like the start of the things that we are going to see in the coming months, like let's say next six or 12 months where we are able to properly index and store video data or like any sort of vision data or audio it can be. And at the core it will be just numbers it will be just vectors or like however you want to represent them. Because it shouldn't be text, because text should be a more high level version of it. So at the core it should be something all same. And which is. Which has like most of the context without having any sort of like lossy compression, I would say.

+ Read More

Watch More

Agentic Relationship Management
Posted Apr 27, 2023 | Views 848
# Tech Innovation
# LLM
# LLM in Production
# Hearth AI
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
# Petuum.com
# mckinsey.com/quantumblack
Privacy Policy