Beyond Prompting: The Emerging Discipline of Context Engineering Reading Group
speakers

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

Matt is CTO and co-founder at Fuzzy Labs, a consultancy dedicated to using MLOps to help technical teams get the most out of AI and ML. He enjoys AI, bio-inspired computing, and functional programming.


Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .
SUMMARY
LLM performance isn’t just about the model—it’s about the scaffolding we build around it. “Context Engineering” reframes the conversation: prompt design is the toy problem, while the real frontier is systematically engineering the information environments that shape model behavior. Surveying 1,400+ papers, this work defines the field’s taxonomy—retrieval, generation, processing, management—and shows how it powers RAG, memory, tool use, and multi-agent systems. The survey also reveals a paradox: LLMs can absorb increasingly complex contexts but remain clumsy at producing equally complex outputs. This tension signals a coming split between research obsessed with cramming more into context windows and the harder question of whether models can ever match the sophistication of what they’re given.
TRANSCRIPT
Arthur Coleman [00:00:00]: What I'm showing on screen. So. And I'm getting pinged by so many people joining at the same time. I'm doing, I'm doing bit slicing here. We have a really interesting discussion this morning on a paper that I'm sure you've all read. I mean word by word, I'm positive with 60 pages I can tell you I read it again, at least a part of it last night and just the amount of context of, you know, catchphrases, words on different subjects, I was brain dead by the time I finished. So I'm, I'm looking forward to getting a summary of it all from our speakers who are really, really quite good at this at making things simpler for there's Rohan for all of us. So let me introduce them to you again.
Arthur Coleman [00:00:48]: I'm getting pinged all over the place, so I apologize if I seem a little off. Rohan Prasad is a senior staff engineer at Evolution iq. Rohan's company uses AI to do insurance claims. They claim that they are making things more humane. I have a Rohan, I gotta tell you, not sure I'm that excited about that. And we could have a long conversation about the ethics of AI and insurance. So we'll take that offline just to comment. A wise ass comment from the moderator, Adam Becker, who is the founder of Headon and one of our continuing and ongoing speakers and organizers of mlops community.
Arthur Coleman [00:01:30]: He's been around for since the beginning and well known to everyone and I've yet to challenge him on his knowledge of Greek and Latin history even though I keep threatening to do so. And then we have Matt Squire who's the CTO of Fuzzy Labs. He's also very active and I don't know if you're one of the organizers our London group, but I got your newsletter last night which I think is called wmlops WTF and typical British joke inside thing there. So I, I enjoy that.
Matt Squire [00:02:03]: WTF means whatever you want it to mean.
Arthur Coleman [00:02:06]: Exactly. Our contact inference. We love to talk to you, love to connect with you. Our linkedins are there. I'm sorry that they're not clickable to you. Can we put them, can somebody put them in the.
Rohan Prasad [00:02:19]: Yep, I'll do that.
Arthur Coleman [00:02:20]: Don't worry about it. Thank you. And in the chat and then when you do connect with us, please indicate that you know in the when you that says can you send a note with it tell us that you met us at an ML Ops event because I get, I cannot tell you how many I invites I get from people I've never met. Every day and I tend to reject them unless I know them. So please make sure you identify yourself as coming from this group because I will always accept a request from this group. Okay, so today we're going to talk about context to everything, everything you ever wanted to know about context engineering and are afraid to ask. Adam is going to start on the three sort of foundational areas of context, context generation, context management. And I'm working from memory here.
Arthur Coleman [00:03:11]: What's the third one? Context processing. There we go. Then Rohan's going to go into rag and memory systems for about 18 minutes and then Matt's going to talk about tool, integrated reasoning and multi agent systems. We are not going to cover every part, especially the evaluation side and the data set side because we just don't have time. We wanted to make sure that what we did cover you, we would really cover well and give you insight into those pieces. So given limited time, I'm going to turn it right over to Adam now. Sorry, one last thing before I do. As you know, for those of you who have been here, you've heard me give this spiel before.
Arthur Coleman [00:03:48]: These sessions belong to you. They are interactive. We're going to Talk for about 45 minutes today and then have a discussion. The more you participate, the, the better the outcome. Thank you for starting the recording. This is a no judgment zone and anyone who's asked questions in this knows that we are very understanding and accepting group. And if you've been here before, you know that the more that you ask and don't worry about, you know, we're not sitting in judgment. We want you to ask the questions.
Arthur Coleman [00:04:18]: It's a safe space to learn. The folks who have participated like Stephen, are really tremendous additions to the conversation. So please do jump in. And there's a Google Doc, it's in Slack, this link is in Slack and it's in the chat. So please link to that. And the way we do questions is it's a pop stack. So first in, first out basically. So it's just the order the questions are asked.
Arthur Coleman [00:04:45]: So put your name in, whoever asked it and then the question I will ask you to please do ask the question yourself. I'm not going to repeat it. So if you put a question in I will say, you know, John, please, please go ahead and ask your question. Okay. All right. With that, oh don't forget to fill out the post event survey. And with that I will turn it over to the guys.
Adam Becker [00:05:11]: Arthur, thank you very much. In addition to that link, I want to also put a link to the Miro board in case people want to follow along, put it in the chat and I'm going to share my screen to look at that mirror board. And I want to just reflecting on this distinction between context engineering and prompt engineering. So over the last couple of years, everybody's been hearing about prompt engineering, but more recently we've been talking about. And this paper really seems to be trying to parse the difference here. Okay, so I want to zoom out because this is a survey paper, after all. And so it gives us an opportunity to take a step back and to reflect on everything that's been going on. If you think about it over these last few sessions of the reading group, what we've been primarily doing is learning about different research done on how to manipulate the context given to LLMs.
Adam Becker [00:06:03]: So this is what we've been doing. We've only been really managing the manipulation of context. And the reason is that while LLMs show very strong capabilities. Right. Those capabilities are nevertheless fundamentally governed by the context that you give it. And so the more we ask from LLMs, the more you can sort of think about it. The LLMs are asking us back. These are no longer just like an input output, very specific task following instruction.
Adam Becker [00:06:32]: No, no, no, this is now they're becoming the complete reasoning engines for entire applications. And so they demand much more nuanced understanding of the context that you feed it. Okay. So as we ask more from LLMs, they're in return asking us to manage the context better. This is a survey paper, which means they probably over like a thousand different research papers over the last few years. And they tried to give us some map that allows us to navigate this landscape.
Rohan Prasad [00:07:00]: Right.
Adam Becker [00:07:00]: So we should just be able to understand where does all of the research go and have some ability to intuit what's happening to the field. And so what they're trying to do is to come up with a framework that will allow us to do this. And they've come up with a few different things. The first is foundational components. These are the building blocks. So here is context generation, retrieval, context processing, context management. We're going to go into them in a little bit more detail, but these are the building blocks. On top of those building blocks, we can now begin to build various implementations.
Adam Becker [00:07:30]: And you see a lot of papers that are examining, here's a particular implementation, they're going to fit together some elements from context processing, some elements from context generation, and then they're going to be a big paper about implementation. Then we have evaluation, which Arthur said we're not really going to be talking about. And we're also not going to talk about future directions and challenges. Those are fascinating sections and they're not very long, so. So the rest of it is like, yeah, it's 60 something pages, another hundred pages or so of references, evaluation, feature challenges. I think those are really fun if you just want to spend five or ten minutes. So the way we're going to handle it now is I'm going to go over the foundational components. Obviously we're not going to zoom into each paper that they mention, but I want to at least motivate why we have this as a component and to put in your mind some vision, some pictures of what kind of research goes into each of these so that when I hand over the mic to Rohan and to Matt, they'll be able to cover implementations in much greater detail.
Adam Becker [00:08:28]: But at least you'll have it in your mind how all of this fits together. Okay, so we don't have much time. I'm going to go fairly quickly over all of this. So. But you can follow along in the mirror board. You have the link. All right, so context engineering versus prompt engineering. They go about it by trying to be very formal in their definition.
Adam Becker [00:08:48]: We can talk about whether or not this was a wise thing to do, but the way they go about it is like this. They said the model, normally for the standard probabilistic autoregressive models, parameterized by theta, generates an output sequence given an input context and it's trying to maximize the conditional probability. This is just a traditional standard large language autoregressive model. Now, historically, what do we mean by this? C, what is this context historically in the world, in the paradigm of prompt engineering, C is just a monolithic static string. It's just a text. Just I give you the text. That's it. Now they're saying now, as we have more demands from LLMs, this is insufficient.
Adam Becker [00:09:27]: So we have to reconceptualize this context, just this one monolithic string as a dynamically structured set of a bunch of different components. And then all these different components are going to be stitched together and orchestrated in some way using an assembly function. And they spend another couple of pages really motivating how all of this is really just a large optimization problem. And therefore you should be bringing in intuitions and mathematical constraints. I'm not sure it's necessary for us right now, but we should at least reflect on what are the different components and how they're all assembled together. So the different components might be C Instruct. Okay, so these are just the instruction that you give the system, the knowledge, external knowledge that you're giving the system, some tools, function calling that you're able to call the persistent memory, the state, dynamic state of the user, the world, the multi agent system and then the user's query itself. So the idea is that unlike just in the world of static string, now we have to manage all these different components.
Adam Becker [00:10:28]: So you could see that the difference between context engineering and prompt engineering really comes down to this. The model is not just taking in the prompt, taking taken a dynamically structured assembled set of a bunch of different things. And nevertheless what we're trying to do is to maximize the performance of the system, but we have to maximize it given certain constraints. The primary constraint that I think is the, is like the, the guiding thing that's happening here is something that we saw a couple of weeks ago with context Rod. The constraint is we have this demon here. The self attention mechanism continues to impose quadratic computational and memory overhead as the sink, as the sequence length increases. That's the problem. And we're going to try different ways to kind of escape this problem.
Adam Becker [00:11:15]: Okay, so just keep in mind this demon is always here. Okay. So the different foundational components are the following. First is context generation and retrieval. The best analogy I came across is this is just sourcing the ingredients. So we're going to take some things we have in the fridge, sometimes we have to go to the store, whatever. We're going to take all the different components, we're going to put them, all the different ingredients, we're going to put them on the counter. Next is the context processing.
Adam Becker [00:11:39]: Right. So we're going to cook the ingredients together in different ways. We're going to slice it, we're going to heat it, we're going to microwave it, we're going to boil it, we're going to do different things. And last is context management, which is, you can think about it like the pantry system. So what's going to get to stay on the counter, what goes in the fridge, what's going to, you know, at some point we need to get rid of something. This morning my girlfriend asked me if I'm going to eat the soup today, because if so, if I'm not, then she's going to take it from the fridge and put it in the freezer. I thought to myself, that's context management. She probably read the survey paper.
Adam Becker [00:12:10]: So this is kind of how we. So what we're going to do is we're going to go through each one of them, I'm going to give you some intuition. Maybe one or two minutes for intuition. Okay, so let's start with sourcing the ingredients. What is the kind of research that has gone into sourcing the ingredients? So they break it down into a few different components. The first is prompt engineering and context generation. Just think about zero shot, one shot, few shot people that have done research about these things. If you are not familiar with it, just follow along.
Adam Becker [00:12:39]: This is when you don't give it any examples. When you give it one example and you give it a bunch of examples, there's still a bunch of research being done in this space. Next is chain of thought. So perhaps you're not just asking a question. Perhaps the prompt itself decomposes the problem into multiple smaller sub problems and into intermediate reasoning steps. And so this is just a basic input output. Perhaps it's a chain of thought. You say, okay, no, first reason about this like this, and then like that, and then like that.
Adam Becker [00:13:06]: Or perhaps it's multiple chain of thoughts, right? And then you're going to pick the best one. Or perhaps it's a tree of thoughts, right, where you're not just picking the best one. Maybe you hit a dead end. So you're going to retrace your steps and then you're going to go on to another tree, another branch. So then there's also a bunch of research going into higher level and more sophisticated topology. So this comes from graph of thoughts. Quote. When working on a novel idea, a human would not only follow a chain of thoughts or try different separate ones, as in tree of thoughts, but would actually form a more complex network of thoughts.
Adam Becker [00:13:40]: For example, one could explore a certain chain of reasoning, backtrack and start a new one, and then realize, oh yeah, I had an idea from a previous thought, I should be implementing it here. So this is graph of thoughts. Now you could zoom out and there's a bunch of research that says how, how should we think about the entire set of the topology class here and what should be the scope of these topologies and what works and what doesn't work? This is all the research that goes into this. Cognitive architecture integration, for example, is like, okay, you tell it very specific cognitive functions. So you say clarify the goal, decompose, filter, reorganize, pattern, recognize a bunch of psychologists that are porting over a bunch of ideas from psychology into figuring out how to solve these problems. Okay, so all of this is in prompt engineering and context generation, still under sourcing the ingredients, right? So this is the kind of research that goes into that. Next is external knowledge retrieval. The idea is a lot of the knowledge you want to add to your context is not parameter, is not parametric.
Adam Becker [00:14:34]: Right. It's not already embedded in the parameters of the model. You need to get it from somewhere. And that's where rag comes in. I imagine everybody's familiar with rag. There's a lot of different research that still goes into rag. So for example, compose rag. So here you go.
Adam Becker [00:14:47]: An agent is trying to decompose a query and then do retrieval for the different components. For example, Mick Carter is the landlord of a public house located at what address? Okay, let's decompose it. What is the public house where Mick Carter is the landlord? What is the address of number one? And we're going to do RAG for each one of these, right? So this is one example. We have recursive embedding. These guys Raptor, we introduced the novel approach of recursively embedding, clustering and summarizing chunks of text. Constructing a tree with differing levels of summarization from the bottom up. So they do. There's a lot of rag stuff.
Adam Becker [00:15:23]: Very interesting. This one is dynamically decide when should we even retrieve information and when should we not, right. We have automatic pre pending from knowledge graphs. So these guys specifically we first retrieve the relevant facts to the input questions from a knowledge graph based on semantic similarities between the question and its associated facts. And then we prepend the retrieve facts to the input question in the form of the prompt. So all of this is RAG stuff. Still active line of research. Next is dynamic context assembly.
Adam Becker [00:15:54]: So now that you have all of these different components, let's say you have relational data, visual data, textual data, how do you actually feed it into an LLM? There's a lot of different things to be done at the architectural level, but still you could also just invest in a verbalizer that is just converting all of these different modalities into text. And there's a bunch of research done on that, bunch of research done on optimizing the prompt. Sometimes you can optimize it with an evolutionary approach. So this is called prompt breeder. And you come up with various fitness scores to see which one is going to be the best one. And then we have a bunch of frameworks. You're probably familiar with some of these frameworks. So this is all for stitching together all these different things before you start to process it.
Adam Becker [00:16:39]: So processing it is the cooking, right? You're cooking the ingredients together here. The problem again is this demon self attention mechanism Imposes computational and memory overhead. If that's the case, well, maybe we shouldn't be doing transformer architecture in the first place. Maybe there we should do non transformer or maybe we should do sub quadratic time modifications to the transformer. So this is where Mamba comes in. Quote, many sub quadratic time architectures such as linear attention, gated convolution, recurrent models and structured state space models have been developed to address transformers computational inefficiency on long sequences. So they're just questioning the entire premise here. There's a lot of work being done on that toplitz neural network for sequence modeling.
Adam Becker [00:17:26]: While showing good performance, transformer models are inefficient to scale to long input sequences, mainly due to the quadratic, again the quadratic spacetime complexity. To overcome this, we propose to model sequences with a relative position encoded toplets, metrics. There's a lot of different things to even say, okay fine, we're gonna, we're gonna stay with transformer, but how are we going to optimize it? This is all of the research going into that. Some of it is very, very interesting. Okay, so still cooking the ingredients, refine the prompt throughout. So this one is self refined. So the idea is we have an initial output generated by base LLM passed back to the same LLM to receive feedback. So an example user I'm interested in playing table tennis.
Adam Becker [00:18:09]: The response is I'm sure it's a great way to socialize. And the feedback that the same LLM says dude, you didn't even give it any information about tennis. That's not good. Okay, so let's refine it. These are some ways to cook the information. And then another thing that we have seen a few sessions ago was deep seq r1 to naturally learn to solve reasoning tasks with more thinking time. So all of this is still going on. Relational and structured information integration.
Adam Becker [00:18:37]: I think this is, we should probably do a whole session on this. I just barely got to taste it. Cross modal attention mechanisms learn fine grade dependencies between textual and visual tokens directly within the LLM embedding space. And there's a lot more work that is going on inside the LLMs architecturally. Not just to how to verbalize multimodality, but actually how to arm the LLM itself architecturally with multimodal capabilities. And then last is the pantry system. So what stays on the counter? What goes in the fridge? The idea here is that modern LLM memory architectures employ sophisticated hierarchical designs to overcome fixed context window limitations. Some of these we already looked at mem.
Adam Becker [00:19:18]: GPT is basically mirroring the way that an operating system virtualizes memory. Quote. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems and traditional operating systems, which provide the illusion of an extended virtual memory via paging between physical memory and disk. There's a lot of work that goes into all of this. They mention a few more papers. Amen. I believe that we looked at that agentic memory with the Zettelkasten method. So all these different things for memory.
Adam Becker [00:19:54]: Another thing that's very interesting is. Here you go, modifications to the architecture itself with Camelot. So the idea here is every layer of the backbone LLM, so now just sitting on top of an existing LLM, augmenting it with an AM module. This is an associative memory module. So we draw AM in the first attention layer here, just as an example, Keys and values are calculated for every token. Keys are used to search for relevant memorized tokens in the memory bank and return them. So the idea is how do you take even existing architectures and superpower them a little bit? So this is the kind of work that goes into pantry system managing memory. What stays, what goes? Some of the work that goes into cooking the ingredients, how do you handle long sequences, how do you refine, adapt and adapt the existing prompts and then multimodality, obviously, and then how do you source the ingredients so prompt engineering, context generation, all the different interesting things that are happening on with Rag and then assembling it all together, perhaps optimizing the prompt in the process.
Adam Becker [00:20:55]: So those are the foundational components. And beyond that, I hope to just give you a sense of kind of like where the research is. And now we're going to see how some of this is actually being implemented.
Arthur Coleman [00:21:08]: Thank you, Adam. And let me say two things, both to the audience and to the speakers. I have learned, or we have learned from prior sessions, that cutting you off or jumping to a next topic. This is a paper. These are. There's a flow to the natural conversation. So I'm not going to cut you guys off at the front end, but I will limit us to give time on the back end. So please be aware of time.
Arthur Coleman [00:21:32]: Rohan and Matt and I realize you will try to run a little long and maybe have a few less questions because there's so much we're trying to cover. But just so everyone knows, I'm not going to do that again. We did that before and it was very discombobulating for our viewers with that, I turn it over to Rohan.
Rohan Prasad [00:21:52]: Everyone, can you guys all see my screen?
Matt Squire [00:21:55]: Cool.
Rohan Prasad [00:21:56]: So what I wanted to talk about here is like a natural extension on top of what Adam was talking about. Adam was talking about what really is happening right now in context. And I wanted to deep dive into a couple different frameworks specifically around RAG and memory systems. We all know kind of what RAG is. It's just about being able to retrieve data from someplace and passing it into context. The specific thing we want to optimize here as a part of that function that Adam was going over is that seed. No. So we want to be able to optimize this particular piece of how do we actually get knowledge that's relevant to a particular prompt, that's relevant to a particular query? And what the paper proposes outside of or going past very naive based RAG approaches, which is, I think of it almost as like a spray and pray, where you essentially just take your query, you embed it and maybe you toss it at like 15 different things and hope you get back the right stuff with maybe some light optimization is they propose using modular RAG architectures, agentic RAG systems, and graph enhanced rag.
Rohan Prasad [00:23:03]: What's interesting though is as you keep diving into these systems, you kind of see that a lot of systems, especially in that graph, graph enhanced RAG space, are kind of doing a mixture of all three. So I figured what's helpful to really first talk about is what is a gentic RAG and what is modular rag and then talk about how that looks like in the lens of a graph based RAG system. So agentic RAG is very simply just abstracting away certain parts of your data sources with agents. So instead of you directly interfacing with a particular data source, or instead of just trying to just hit a data source correctly, is thinking about it more intelligently like, so how do I actually know whether I should hit this particular data source? How do I know whether I should actually use this data source versus this other data source? Are there ways that I can sort of think about retrieving information from that data source a little bit more successfully? So in the high level for like a modular RAG system, agentic RAG is actually broken, broken in or in here quite a bit. And we'll go through that in a little bit more detail. But it starts off with just the question or the query that we're getting. So this is the user's prompt might already have some context, might just have some detail. And the very first thing we really should ask is do we even need to retrieve things from a system and after that we look into what we actually want to get out of the particular question.
Rohan Prasad [00:24:27]: Is it just one monolithic question that's easy to answer? Is it easy to break, break it down? Then we look into various retrieval mechanisms. Do we want to use dense retrieval like vector embeddings? Do we just want to do a simple keyword search? And this is the part where we want to iterate a little bit more. So after we get some information, we want to essentially see whether the information is relevant, but also is it more relevant than some other piece of information. So this is where we consistently just loop back and forth and then we finally get all the answers to our sub questions or our main question and then we answer it. And this is where we start getting it more agentic where start using a pattern like LLM as a judge. Like a more particular example of this is, for example, someone might ask a question like what how many times the plague occur in the place where the worship of Venus died? And really there's a lot of different questions in this. Like if I had to answer this question, I'd probably look it up and break it up into three different components. And that's what exactly we're talking about about modular rag systems.
Rohan Prasad [00:25:28]: We're taking three separate components, we're trying to figure out how to answer each different subcomponent and and then we're going to see if hey, is this a good enough decision? Maybe some of these are dependent on other things. So for example, like okay, the creator of the worshipper of Venus is Titan. That answers our first question. So that actually becomes a dependency to the second question and that actually becomes a dependency to the third question. So eventually we are able to get the answer of 22 by just going through all these particular pieces. Where this actually gets interesting is when we start talking about graph based rag. And Graph rag is pretty simply of taking your information and taking and embedding it and embedding it into a graph database so you can essentially traverse a graph. LLMs do a lot better when they have access to structured information.
Rohan Prasad [00:26:26]: And that's the real value. Prop here is essentially for able to capture entities and relationships to various other entities that we get a very powerful mechanism to be able to give much more in depth context so we can see how various things relate. So the tool I'm focusing on here is Light rag. I was trying to bias towards tools that have some production use cases that people are using today. And the way light rag works is it works based off of two concepts. I'm going to go through the indexing flow and I'm going to go through the retrieval. Retrieval flow. So from the indexing perspective, we first get a bunch of input documents, we then segment them.
Rohan Prasad [00:27:06]: This is pretty standard stuff. And where we start getting into more interesting details is we actually break those documents into chunks. And in order to identify the entities and relationships, we actually use LLMs to distinguish that. So instead of so this is where we start getting into more agentic rag approaches is we're already starting to inject LLMs and agents into this. Then what we're doing is we're curating various relationships and entity key values. And based on these, we store these in two vector stores. One for the entities, one for the relationships. And after that we create a graph.
Rohan Prasad [00:27:44]: And then we go and optimize our graph. In essence, we merge identical entities, we merge relationships. Just make the graph nice and easy to really traverse. Last but not least, we generate these vectors. So this is where it comes back to actually taking these key value pairs, generating these vectors and relationship vectors, and we have our final knowledge graph. So it's not just a graph database, but it's also vector stores. On top of that, what this looks like from a retrieval perspective is there's low level retrieval and high level retrieval. So light rag does this in two different ways.
Rohan Prasad [00:28:22]: So one, it's focused on retrieving the very particular entities and their closest nearest neighbors. And then you have a high level retrieval which really talks about traversing the graph. And the way that I like to think about this is if you have two different questions, these are better suited to answer two different types. One is more specific to a particular node. So let's say we indexed the book of Dune, right? So if you ask the question of who is Paul Atreides? That's a particular node in the graph. And you might want to see who Paul Atreides immediate nodes are or immediate relationships are to understand what's going on. But if you want to ask who's the main character of Dune? You might need to traverse the graph and actually understand all the various concepts of how the various nodes are connecting to each other. Where the typical flow for this works in terms of retrieval is one, the low level and high level follow a very similar pattern.
Rohan Prasad [00:29:16]: It's just they're hitting different vector stores or having slightly different queries. But you take your query and what you do is you first use an LLM extraction to use keywords. That's a great way to really think about jumping back. Here is a different form of question decomposition. So what we're doing at this point is we're actually taking the query, we're finding various keywords, we're decomposing it, we're making it easier to parse through our various systems, and then we're embedding those queries and hitting an entity vector store and a relations vector store. So we're getting back some vectors. This is like a very typical similarity search. So we get back our top K results, and then we actually had our knowledge graph, and this is our graph database.
Rohan Prasad [00:29:58]: And what we're getting here is we're getting the related relations, the related text, the top k results, and the particular context. So that's in essence how this works. To show you, like a little bit more of a live example, I spun up a really quick notebook. It's a bit contrived, but I spun up light rag in terms of setting it up, initializing a rag, where we're actually going through and indexing a book. We're specifically indexing A Christmas Carol by Charles Dickens. And what we're doing here is we're first embedding all the details and we're loading them into one, a key value store. This is just very naive for how we just want to maintain our text context, but then we're also storing it into a graph structure. We're storing our various entities, our relationships.
Rohan Prasad [00:30:52]: And then what we can try doing is we can try asking it a question. So, like, what are the themes of the book? What do the characters learn about each? This is a bit of a contrived example, but you can see in this particular case, when we're naive, we're still doing okay on the themes. We get some understanding about character insights and learnings, but we're not really combining those as much as we'd like to. If we use an actual graph traversal or like a hybrid method, what you can actually see is we're still getting very similar themes. That's something that we would normally expect, but we're actually now getting that these particular entities are related to each other and what their particular relationship is. So. And we can also see that we're pulling various nodes of the graph and how they're referenced to each other. Jumping to the next section, I think it's a really good segue between rag and memory, because rag and memory are actually very similar, except for some small, subtle differences.
Rohan Prasad [00:31:54]: So I'm going to jump here. So they actually have a lot of similarity in terms of what is the underlying tooling, what is a particular way you access data, how do you manage it? The fundamental difference is that RAG is stateless and memory is stateful. In theory, you should be able to send a request to an LLM from a RAG system and it shouldn't have any context of what happened the past time. This often presents like a challenge and a problem, especially when it comes when we want LLMs to understand historical context. Like what has it responded before? I mean, if you log into something like ChatGPT or Gemini, you can see that it's retaining your past history or your past conversations in a session. And that's where memory comes. The way I like to think about it is that RAG is the agent's IO into the world, memory is the agent's internal state and agentic reasoning is the CPU which decides how it uses these various tools. Now jumping more into memory systems.
Rohan Prasad [00:33:00]: The main part we're trying to optimize here is that context around memory. So really just knowing about prior interactions. And I think the obvious example that we often talk about is short term mechanisms, but really we divide this into three categories, the input prompts. So this is just what's being passed in short terms of memory. This is like what's in your typical chat window in your context and long term memory. This is where we start getting into more of the new tools that are coming out. So how do we actually preserve user preferences and state and history? The big thing here, and this is going back to the demon that Adam was talking about, is LLMs really struggle with reasoning over extended context. We don't want to just take the entire chat history and just slam it into an LLMs context because that ends up creating separate problems.
Rohan Prasad [00:33:50]: So where the sweet spot is, is you could take the full memory. That's what some systems are doing today. You could use no memory at all. But then you lose context. But really what you want to get to is this limited memory. How do you efficiently pick what part of your history is relevant to your particular window? Where memory is a little bit more nascent and new than I would say comparatively to RAG is while there are evaluation frameworks and metrics, the general assessment is that it's kind of hard to know whether we're 100% right today. Like there's some ways that we, we can evaluate it, but most of those are evaluated on training data and they're not necessarily, we haven't really figured out how well to, to evaluate that well. So that's just caveating this one place where I think there's a good system that talks about M0.
Rohan Prasad [00:34:43]: And I picked M0 very specifically because it actually has a lot of similarities to like Light Rag, if we go back to the older system. So kind of talking about how these are very similar, but they fundamentally solve slightly different problems. So mem0 works based off of taking messages. And once again, it uses an LLM to break up entities and relations. So this is very similar to Light Rag from what we were seeing in the earlier section. And then what it does is it maintains a graph of particular memory. So this is the information on a particular user, on a particular session, on a particular episodic history. And what it's doing here is that it's first, it's trying to see whether any new information that the LLM is producing does that conflict with existing information.
Rohan Prasad [00:35:28]: Is there anything new that the user is providing that is new or different? So for example, someone might say like, hey, I like pizza. But then they might say, hey, I don't like pizza. Right? So how do we necessarily resolve that? And last but not least that conflict detection to see whether things in the graph are different. That hits an update resolver, which goes and actually makes sure that the graph is just. Once again, this is just making sure the graph is optimized and easy to traverse. I actually really like the way that they also specify the difference between how they think about memory and they think about it in three different contexts, so factual, episodic, and semantic. And the way that I like to think about this for a particular example is like something that's factual might be like the user's favorite cities. Kyoto, Japan.
Rohan Prasad [00:36:15]: Something that's episodic is the last time that the agent interacted with the user, they ordered coffee from the favorite city in last session. So there might be some semantic memory. And this is where we get back to this graph right here of maybe we can make this inference that the preferred coffee is from the favorite city. Now once again, talking a little bit through like a particular example, going through MEM0, once again, this is just a pretty contrived but simple example. We, we start by actually initializing certain memories. So in this particular case, what I'm trying to do is I'm trying to create a personal assistant. It's really important for something like a personal assistant to retain memories of past conversations, because this is something that you're thinking as someone that can actually, or an agent that can actually help you with like future requests or asks. So in this particular case, I've essentially set up an assistant, gave it some random details, just auto generated and what I'm having it do is I'm going to have assist in a particular flow.
Rohan Prasad [00:37:20]: So the user is going to ask a question. Mem0 is going to search for relevant memory. So I've already uploaded some stuff. This is like historic context. Then we're going to find that and retrieve it as historic context. And then I'm using Ollama right here, generating a personalized response and then storing it back in mem0 for future conversations and then finally returning the response to the user. So I might ask it a question saying like I'm feeling like Italian today. What types of food can I eat? So based on that it pulls in some details.
Rohan Prasad [00:37:50]: It pulls in some other stuff too. That's just a little bit of tweaking that probably needs to happen. But it ends up determining that hey, I'm in the mood for Italian food and there's certain vegetarian options because this is what it knows about this particular user and this is some particular things that can possibly happen. And then I might ask it a question, what are some lunch options to arrive at? So this is where we talk about pulling in both short term and long term memory. So we need the long term memory. So things like vegetarian options or working in downtown Brooklyn or a nut allergy, but we also need the short term stuff here as well. So for example, the context here is that the user is preferring Italian foods, so we should sort of keep that as well. So based on that it's retrieving these particular things not just based on like long term, but short term.
Rohan Prasad [00:38:41]: And it's using that to answer a particular question. That's. That's the high level of what memory systems look like and a little bit of a deep dive in memory versus rag. I think the part here which I like to take away is just coming back to this memory is really important and the distinction here is that it's stateful and that's how it differs from rag. But fundamentally they use the underlying same tools and they have a very similar basis between them.
Arthur Coleman [00:39:17]: Thank you Rohan. I want to say this is probably the area of most. We got a few questions too. But I have to say for me the interaction between context windows, short term memory and long term memory is one of the areas I focus a lot on as I think they are interactively the way that you get to the loss less hallucination and better experience for the user. But that's a longer conversation on that commentary. I will let Matt take it over to talk about tool, integrated reasoning and multi agent systems and I'm going to let it go as long as it goes. Matt, there are a few questions, but this is so much material that I'll let you go through as far as you can. If you can live us even five minutes at the end, that'd be great.
Arthur Coleman [00:40:07]: I'll let you know when there's five minutes. You got about 13 now to do. Do what you gotta do.
Matt Squire [00:40:13]: Great. Thank you. Thank you everybody. So do you know, I, I think for, for my part, I'm not going to go as different research papers that are covered in the survey as the other two presenters have. What I'd like to do is skim through the paper and we'll run through it here. But I'm certainly not going to read every single word out because we will be here for a lot longer than 13 minutes. But what I'd like to do is attempt to bring some of these concepts to life through some examples of where the prominent technologies are that we're seeing in our work, our own work, building agentic systems and LLM systems in the wild and just highlight the things that really stand out. After all, people can read the whole paper themselves afterwards.
Matt Squire [00:41:01]: So where do we begin? Well, there's two things I want to talk through just to round off everything else that's been covered. One of those things is tool integrated reasoning and the other is multi agent systems. So we start out with tall integrated reasoning and you might say, well, okay, what actually is this thing? Well, broadly speaking, it's the capacity of a large language model to interact with the real world. So we've talked about context as a concept in general and the idea of context that evolve over time and how that impacts the engineering of the context. We've talked about memory, the ability of an LLM to actually learn about its world and build a world model. The final piece of the puzzle is can we have a model now interact with the world itself? And I suspect a lot of people on this call will be familiar at least with this from very practical experience in the sense of things like mcp, right? These tools and frameworks that exist and this huge ecosystem that's emerged over the last six to 12 months of MCP servers, that is ways that we can tell describe to an LLM how it can call out to an API and get a result back or have an effect in the real world. The thing about that is that it's not just about calling an API. And I'll circle back to that just in a second.
Matt Squire [00:42:31]: In any case, the authors break this down into a couple of different Subcategories. So one of them is function calling, another is reasoning itself. So tool integrated reasoning. The third is the environment and agent interaction. Function calling, I think, is the most straightforward one to understand because it is that intuitive idea of. I've described to the model in some way a tool or a thing that it can go out and do. I've given it the ability to do that. And that most prominently shows up in, in mcp.
Matt Squire [00:43:07]: I want to then highlight, you know, as I said, well, it's not just APIs. You can kind of see some of the thinking here in this diagram. So on the left we have the three different areas of what we mean by augmenting something with tools. The ability to call functions, the ability to reason about calling functions, and then the ability to interact with an environment. And they have this term going from text generator to a world interactor. Fair enough. But let's look at what they have as example tools. They actually don't mention APIs at all, but they talk about search engines.
Matt Squire [00:43:42]: Fair enough. They also talk about things like calculation engines. So imagine you have a function that is just about doing a mathematical operation. You don't want the LLM to do a mathematical operation after all, it's a text generator. But you maybe want to do some maths as part of a bigger problem that you're asking the LLM to solve. That's where you might have a calculation engine, or you might have the ability to query a database, maybe through SQL, maybe through JSON or whatever it might be, or the ability for users to interact. Maybe we can ask the user a question and take the response and use that as part of the workflow that this agent is performing. A lot of the foundational ideas here are very old and we'll see that as a common thread.
Matt Squire [00:44:31]: But a lot of the like, nowadays, a lot of what people are thinking about concretely when they talk about this stuff is MCP and similar things to that. They talk a little bit about how you. And this is quite a lot of this stuff is quite academic and dense, but they talk about how the evolution of how models have been trained to do these things has happened, has taken place. So you can kind of look through the historical research here. It's quite often the case with these survey papers that you're really looking at the historical development of the ideas, which is interesting. But of course, if you're trying to build something today, you're probably more interested in where are we up to now? And so you can kind of skip down to what's Contemporary tool, integrated reasoning is all about how we allow a model to make decisions about which tools to use in what contexts and when, and moreover, to be able to combine those tools, to chain them together into different sequences of operation. So a very concrete example I can give of this, which we were working on in Fuzzy Labs as a research project, is an agent that does site reliability engineering. So the idea here is this agent sits there in Slack.
Matt Squire [00:45:50]: So that's one tool. It's able to talk to, receive messages and send messages to Slack. Then when a live production system crashes, there's some kind of notification that shows up in Slack. That's just standard software engineering, standard infrastructure. We don't worry ourselves too much about how that works. The agent picks up the error and then it has another couple of tools it can use. It goes out and looks at the application logs and says, okay, well I can see the error that's taken place. I've got a bit more detail.
Matt Squire [00:46:20]: I've got a stack trace here. It can go and grab the source code from GitHub. It can push a raise, a pull request on GitHub as well. And what we're asking our agent to do is diagnose an issue and come up with a proposed solution. So that's a very high level task. But in order to do that it has to be able to reason through and say, okay, well in order to do that I first need to get the logs and then I need to look at those logs, analyze them, and then maybe I need to do something else. Maybe I need to tell somebody what I found and how severe I think this issue happens to be. And then maybe I need to go to GitHub and raise an issue or a pull request if I think I can solve it.
Matt Squire [00:47:01]: So large language models, at least in their early years, were not capable of doing this task. They weren't capable of doing reasoning tasks more broadly. But it's that question of being able to reason about the use of tools that we're really interested in here. And in the paper they survey a couple of different techniques and the development of the ideas that allow us to do that. One is plain prompting. So we prompt the model and use prompts to create guide that model through those reasoning steps. Even better than that, we can fine tune the model to be better at reasoning around tools. The final stage is the reinforcement learning.
Matt Squire [00:47:40]: Now it was mentioned earlier, the Deep Seq paper are one which we covered a couple of reading group sessions back there. They were using reinforcement learning to train reasoning engines. And it's exactly the same principle really. We're just refining the task. The task we're interested in is the ability to reason about tools rather than the ability to reason more broadly. And then agent environment interaction. Well, what I'll say just before I get to that is they talk concretely about a lot of different frameworks and this is quite an interesting table, something to study in more depth. All of these different approaches to doing tooling reasoning or tool integrated reasoning as they term it.
Matt Squire [00:48:24]: Agent environment interaction. I won't say very much about, particularly because we don't have much time. But these again are talking about what is the environment, how do we interact with it, how do we manage that. We'll skip over evaluations again deliberately because of time. And then I want to move on and talk about multi agent systems. And I'll try to be brief because we've got nine minutes left and I know we want to allow time for questions. I can't not point out that the term multi agent system is a very old term. It predates large language models considerably by decades and decades.
Matt Squire [00:49:04]: It's essentially the same kind of idea what they're talking about here as the historical idea, right? The idea of multiple autonomous agents that are able to coordinate and communicate in order to complete tasks. They talk about in the paper different ways of doing communication, different communication protocols. They talk about ways to orchestrate the agents and ways to coordinate the workloads and the contexts that those agents need to have in order to do the task. I like to think of this as doing Conway's Law on purpose, just to kind of make concrete this idea of multiple agents working together. So in software engineering we often talk about an anti pattern called Conway's Law. And this is the idea that in a sufficiently large team and a sufficiently large software project, what happens is this software architecture starts to reflect the structure of the company that builds it. So you end up separating, I don't know, the finance system in the same way that you happen to have separated the humans that are doing finance in the business, and so on and so on. Well, actually the idea here that a lot of people have been talking about recently is what if you replaced all these roles in a business with agents that are specialized to a particular task? So now actually you kind of want Conway's Law at this point.
Matt Squire [00:50:29]: You want to have a multi agent system that is architected such that each agent knows its own specialism. It has its own knowledge, its own context around its specialist area, and it has its own tools to do those specialist jobs. And now If I don't know, we want to launch a new product, maybe we need the advertising agent to come up with an advertising campaign for that product. But it doesn't need to figure out who we need to hire to build it or how we finance it or any of these things. We have these independent agents doing independent jobs. As far as communication protocols go, one of the things they talk about is MCP. Not too surprising. This idea of often described as USB C for AI is a way for us to describe APIs to large language models so that LLMs can consume them.
Matt Squire [00:51:21]: Now it's actually a lot more to do with telling models about tools, but it also applies to telling models about models or rather agents about agents. They also talk about A2A agent to agent communication and agent cards, which is this idea that an agent can describe what it's capable of, what it specializes in, so that other agents can delegate tasks to it and understand what capabilities are available in there wider environment and then just see what else we want to say here. They talk about a few of the challenges, I suppose in terms of coordination and orchestration. And that's a really big topic that I feel at this point we don't have time to get into in detail. What I will say is it strikes me that the challenges here mirror the challenges that we know of if we've, if anyone's done distributed software systems and that kind of thing where you have lots of components in a software system that all have their own piece of the task. They all need to share some level of context, they all need to coordinate on some level of context. Often they need to agree on an ordering to a particular task where there are dependencies. So all of these challenges that are described in this paper, it seems to me at least that they are mirroring challenges that are well known in distributed systems research as well.
Matt Squire [00:52:50]: And I think that's probably all I need to say about it right now. There's a huge amount of information we've covered today. So thanks to everyone for sticking with us. I'll stop for questions.
Arthur Coleman [00:53:00]: Hey, let me bring up the questions list. We only have two questioners in this list, so let's start with Sandeep because we have limited time. Choose one of your questions, Sandeep, because I want to let Sam ask his and let's go from there. And can you, if you want, put your video on if you feel comfortable? I always like to see who's asking the question. There he is. Hey, Sandy.
Adam Becker [00:53:33]: Hey.
Rohan Prasad [00:53:34]: Yeah, really great talk. So I think I'll just Go with my second question which is how do knowledge graph based drag approaches deal with multimodal knowledge? And it does seem as if the chunking and indexing strategies would have a big impact on the quality of the retrieval. So yeah, just wondering if you've come across any research in that vein. Thank you. Yeah, I can talk about specifically what light RAG does here. They actually I haven't played around with this a ton so take this with a grain of salt but they, they released a tool called Rag anything which integrates into their particular service and I'm pretty sure it's like plug up pluggable into other systems as well. But essentially that's what offers like multimodal support into their system. At the end of the day they're just creating different kinds of like vectors that they'll store entities and relationships to and then link those in their particular knowledge graph.
Arthur Coleman [00:54:36]: All right, Sam, I don't know who that is, but there he is. Sam. Oh, Sam Comer. Of course, I should have known that. Sam, welcome back. Go ahead.
Matt Squire [00:54:48]: Yeah, firstly thanks for the great talks, really enjoyed that and yeah, apologies in advance. This question might be quite specific but something that we previously looked at at my place is graph frag. I think out the box these things look quite great. But one downside of what I've seen is that it's quite difficult to build things like entity relationship management, mapping, things like that and build it in a way that maintains easy updateability. Especially in scenarios where you have your underlying knowledge base is currently changing, things like that, you need to keep regenerating the graph. Do you have any takes tips like war stories from applying things like that in prod? Yeah, it'd be really good to get your insights on that.
Rohan Prasad [00:55:39]: Yeah, we have a slightly different challenge where our data is very stale and old. So I don't have that specific problem. But what I'd be curious to hear about in your particular case is what's the sort of nature and type of that data? For example, the way I'd contrast like memory and RAG based systems is like memory systems are designed to be a lot more updatable and finer grained and smaller. So if it's more along that lines, maybe it's a slightly different problem. But if it's truly like a knowledge base issue, there's like other tools out there where you can like plug and play that graph. But I don't like. I'm not trying to give you a non answer in terms of like how do you just like keep this stuff up to dated But I think there's ways that you would probably have to just take that on yourself outside of the tool. One of the benefits of these frameworks is they kind of just take on that entity and relationship creation for you.
Rohan Prasad [00:56:34]: So you're kind of like saying I'm okay with taking that level of being out of my control. If that makes sense.
Matt Squire [00:56:43]: Yeah, that makes sense. Thank you.
Arthur Coleman [00:56:47]: Okay, I'm going to go to Srikanth who just added a question I think is, is very interesting. We have about a minute. So Srikanth, if you can put on your video and ask a question, that'd be great. Yeah.
Rohan Prasad [00:57:01]: Hi Arthur. Hello Adam and Matt and Rohan. So I see that you know, the.
Arthur Coleman [00:57:07]: Graph Graph DB skills are lacking in.
Rohan Prasad [00:57:11]: Most of the, you know, data engineers or data scientists in the industry. So I'm afraid like Graph Rag would eventually, you know, gradually reduce in terms of adoption. Even in terms of the kind of.
Arthur Coleman [00:57:25]: Complexity that Graph Rack brings or graphdb.
Rohan Prasad [00:57:28]: Brings is hard to be supported or maintained at scale.
Arthur Coleman [00:57:41]: Yeah.
Rohan Prasad [00:57:42]: Do you have anything to comment or you know, is the, is the question that like, like adopting this system is going to be hard because. Because there's not enough like expertise in the industry? Is that the question? Like reframe it?
Arthur Coleman [00:57:58]: Interesting. Yeah, yeah.
Rohan Prasad [00:58:04]: Sorry, go ahead.
Arthur Coleman [00:58:06]: I was.
Rohan Prasad [00:58:10]: I was gonna say I think my 2 cents and take on this is it's just going to be really dependent on the company and your expertise and your skill sets there. One of the past places I worked at everything was based off of graph based workflows. So it was very integral. We were very big adopters of Neo4J early on. So like we have that expertise in the company. Like not to give like a non answer but like my two sense is that it's, it's very much just going to be contingent based on your workforce in terms of like whether it's going to be reduced in Rag architectures. I don't think that's going to be the case because at the end of the day people are going to want to understand kind of to like Sam's point, like understand like what those entities and relationships are. Is there a way to like tweak and tune that to make that more important important? I think it'll just be a natural way for people to sort of develop in skill sets over time.
Rohan Prasad [00:59:08]: Where I think Graph Rag presents like an like or some of the tools present like sort of a way for you to like quickly prototype and build something out which I think is very conducive to like a small startup but as you start getting larger, then you'll build that expertise. I don't know if that answers the question. I don't know if Arthur, you had more to add than that.
Arthur Coleman [00:59:28]: Okay, guys, we're unfortunately one minute, two minutes over, but I'll forgive us. Very good session. Thank you everyone for coming. Don't forget to fill out the post event survey, especially since it's the first survey paper we've done that 60 pages long. Would love to hear if you think it was a good use of your time or whether we should stick to more detailed topics as a rule, and shorter papers, which I'll vote for. Anyway. Great day, everybody. Love to have you.
Arthur Coleman [00:59:54]: Thanks for your participation. We'll see you next month.
Matt Squire [00:59:57]: Thanks, everybody.