AI REWIND 2025 - MLOps Reading Group Year-end Special
Speakers

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!


Sophia Skowronski is a Data Scientist at Breckinridge Capital Advisors with previous experience as a Business Analyst at Pledge 1%. Sophia has also worked as a Data Science Intern at Candid, an AI Investigations Intern at Deep Discovery, and held roles at Singularity University and the Global CO2 Initiative. Sophia holds a Bachelor of Arts in Astrophysics and Cognitive Science, as well as a Master's degree in Information & Data Science from the University of California, Berkeley.


Hey! I’m Nehil Jain, an Applied AI Consultant in the SF area. I specialize in enhancing business performance with AI/ML applications. With a solid background in AI engineering and experience at QuantumBlack, McKinsey, and Super.com, I transform complex business challenges into practical, scalable AI solutions. I focus on GenAI, MLOps, and modern data platforms. I lead projects that not only scale operations but also reduce costs and improve decision-making. I stay updated with the latest in machine learning and data engineering to develop effective, business-aligned tech solutions. Whether it’s improving customer experiences, streamlining operations, or driving AI innovation, my goal is to deliver tangible, impactful value. Interested in leveraging your data as a key asset? Let’s chat.

Sonam is a data scientist turned developer advocate.


Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .
SUMMARY
AI REWIND 2025 was less “wow, agents everywhere” and more “uh… this is messy.” We called out brittle agents, bloated context windows, sketchy orchestration, evals that still don’t reflect reality, and why open models are quietly eating the ecosystem. If you think 2025 was a victory lap for AI, this episode might annoy you—in a good way.
TRANSCRIPT
Binoy Pirera [00:00:00]: All right, everybody, thank you so much for joining. So just to give you a little bit of context, usually when we do these reading groups, we select like a research paper and we invite our wonderful speakers to dissect it and dive a little bit deeper into it and see what's in there. But this time we're not really covering the paper. Instead we asked the community what they think were the most consequential things that happened in AI last year. And we got a bunch of responses. And based on those responses, we, we selected a few of them and helping us go through all of them are wonderful speakers as usual. We have Adam, the founder of Headon, Sophia, she's a data scientist at Breckenridge Capital. We have Nahil, who is a member of the technical staff at Anyscale.
Binoy Pirera [00:00:44]: We have Rohan, he's a soft software engineer platforms at EvolutionIQ. And we have Sonam, who's a develop advocate at Tellinks. And we have Lucas, who's a senior AI engineer at Stone. He's joining us for the very first time. Thank you, Lucas. And also we have Anna Yoon, who is a member of technical staff at OpenAI. So thank you guys so much for joining and helping us go through these developments. And just so you know, just to make it a little bit easier for us to go through these developments, we classified them into three clusters, right? Just to make it easier for the flow.
Binoy Pirera [00:01:22]: So the first cluster is architecting and designing AI native workflows. And as you can see, we got agents, memory, context engineering, MCP and evals and wipe coding in it. And then this is cluster two. We'll go through each of these individually. I'm just giving you a very high level background on what we're about to cover. Second cluster is multi agent orchestration and model cognition. And this is three Cluster three. So evals, metrics and open source.
Binoy Pirera [00:01:49]: So without further ado, I think I've taken enough time I'm going to hand over to plus the number one which will be covered by Rohan, Sonam and Sophia. All right, guys, take it away.
Sonam Gupta [00:02:04]: Thank you, Binoy. So many thanks to everybody who joined. And I am going to talk about the agents because that's one of my favorite topics to talk about these days. I'm going to quickly share my screen and venti share. One second.
Lucas Pavanelli [00:02:25]: There we go.
Sonam Gupta [00:02:26]: Okay, can everybody see my screen?
Binoy Pirera [00:02:33]: All good.
Sonam Gupta [00:02:35]: Okay, perfect. All right, so today, so the first topic I want to talk about is AI agent about from experiment to production ship. Now, in the past couple of years we have heard a Ton about agents, AI agents. In the last year when I was talking about agents, the common question I used to get is what is an AI agent? And the best answer I could come up with for that is that think about LLM as the brain. You have access to tools, the agent gets access to the tools and combined the brain, which is LLMs and the tools it produces. It behaves autonomously. It could make some decisions on your behalf. And that's what your AI agent is.
Sonam Gupta [00:03:19]: Now, from the last year to this year, I was wondering like, how did agents leave the lab? So it's, you know, just from the models became systems, just from basic prompt engineering, those became workflows, from stateless calls, it became long running groups and it's basically all these production going into the production state. It didn't really just happen because models got smarter. And when I say that, okay, it happened with, because we wrapped models with control tools, memory and even guardrails, which is by the way a very important component in agents. And agents became orchestrated systems, not just the APIs that you call once. So that's where it became more than demos and it became a lot more involved in your daily work. And agents are not really equal, like, you know, not really just one, one pattern. You have, you know, we have task agents, you have conversational agents, multimodal agents, which is now the talk of the town. We also have background agents to do some monitoring, triage automation and so on.
Sonam Gupta [00:04:35]: So this is where like, you know, when I talk about multimodality, I think about voice, audio, sorry voice image, videos and of course we have text. And all these text agents are quite different from the other modalities agents. We have also been seeing a lot more coding agents coming up, which I'll give you some examples in the later slide. They behave a lot more differently from operational agent or healthcare agent. So what I'm trying to say is that there is not just one architecture. The base stays the same, that we have the MLMs, we have access to the tools. But then how it's all orchestrated, it differs based on the use case and the how much you want to scale. Now why do, why does one think that production agents are harder than demos? And I can say this from my personal experience, I do build lots of different demo like prototype agents.
Sonam Gupta [00:05:41]: But then when I see the engineering team taking those agents, pushing it into production, there's so much to handle. There are conversations that could last minutes, not just milliseconds. Even when on the demos it may work perfectly with. If you think about from the memory point of view or the state of that aging, because they need to persist correctly and even like a small failure. It's not just retries, it means like a strong recovery. And this is why, like, and at the end of the day you also need to explain like why, you know, explainability becomes a requirement, not just like a nice to have. And all in all of this, I believe that latency matters and consistency matters as well. And one more thing is that the memory bugs are so much worse than the model bugs because it can create some, not just distractions, but also like problems, especially if you are in, I don't know if it's a sensitive agent, a sensitive industry.
Sonam Gupta [00:06:53]: For example, healthcare. Now I work at a company called Telnex and we are in voice agent space now. So some of the real world examples that we have not only just built the demos for our healthcare customers, but we have been pushing these agents into production for them. For example, just recently I had given a demo on symptom triage nurse agent where I wrote a very comprehensive prompt to the agent for selected different LLM transcription model speech models. And then the agent, basically you assign a number and one agent gives, it transfers to the nurse. So that's where the triaging comes. But what it does is that it saves a lot of time. And when you talk to that agent, it feels like a very natural conversation that you're having with another human being.
Sonam Gupta [00:07:48]: And it will ask some bunch of questions. And then when it sees that, okay, the agent cannot answer any more questions of the patient, it will transfer it to the nurse, similar to the prescription filling agent. It saves a bunch of time. It will answer the patient's questions about prescription. It will give you like, okay, what were the most recent prescriptions? Or do you need a new prescription? It will ask questions like, okay, who's the primary care doctor? And so on. So these are some of the voice agents. And then also, you know, that falls under the multimodality in the agents. And then of course the famous coding agents that we all know and use our cursor, GitHub, Copilot, Codex and so on.
Sonam Gupta [00:08:32]: There are like so many of them that I can't actually remember all the names anymore. So yeah, this is how I feel like we have gone from experimental stage with AI agents to production. But I still believe that the human in the loop is still a component layer and we still need to figure out many nuances because what works in demos may not work in the production. And the big lesson that I have learned from my personal experience working at agent platform companies is that latency and consistency really matters. So for that, that is something that I know the companies are working towards and something that we should even keep an eye out that, okay, this is how the agents in the production will work. This is all I had for this particular topic. Are we taking questions right now or should I move on to my next topic and then hand it over to Rohan?
Binoy Pirera [00:09:34]: Yeah, let's wait for till we're done with the entire cluster and then we.
Lucas Pavanelli [00:09:38]: Can go for questions.
Binoy Pirera [00:09:39]: Thank you, Sonam.
Sonam Gupta [00:09:41]: Perfect. All right, so the next topic, given that I was talking about human in the loop, it naturally transcends to the rise of wipe coding. Now, wipe coding is one, again, pretty cool topic. Everybody's wipe coding when it first came out, I'm like, okay, what exactly is wipe coding? It took me a while to fully understand the lingo there, but now I do because. And I have a fun example to share about how I use wipe coding. Now when we say or talk about how white coding is rising, it's basically coding where at least I come from. The times when the first coding language I started learning was C and C, then Java and so on and so forth. And my God, learning how to code was a bit of a challenge for me personally.
Sonam Gupta [00:10:31]: Now all it is, it feels like I'm just talking in natural language with the ide, for example, cursor. All I can do is that, hey, I want to build a website. These are my constraints, these are the guardrails, and this is my idea. Do something. For me, that's literally how I talk to these coding agents.
Anna Yoon [00:10:50]: And.
Sonam Gupta [00:10:54]: It'S like you are steering and AI is executing. And that unit of work basically shifts from line writing lines of codes to ideas plus feedback. And I sincerely love that part. Now, what wipe coding unlocks. When I first started, I wanted to see, like, okay, how much can it help and what is it doing for me? Basically, the prototyping of any application, whether I'm building an agent or building a website for my personal projects, the prototyping becomes seriously fast. And there is much less fatigue when it comes to, like, you know, that building that boilerplate kind of template. And I can focus more on system design, and this is true for developers as well, that they can focus more on system design. They can imagine, like, they can put their time and creativity there, they can build a prototype, and voila, you have a working prototype of whatever application that you are building now.
Sonam Gupta [00:11:54]: So this is why, like, basically productivity goes up without really replacing Developers. Now one might ask like okay, can we do white coding and push it into production? I'll say no, please don't do that because it may break. And we have seen many examples. I have recently heard people talk about certain models or certain coding agents, you know, deleting the whole database and stuff like that. And it is scary. So this is why we want to keep humans in the loop because of course the context still lives with the humans and AI is great at momentum. But I still think that it is terrible at judgment and humans, at the end of the day humans decide what good really looks like. And this is especially true when you are working in production code bases.
Sonam Gupta [00:12:47]: It will create like you know, bunch of like it will create your GitHub repo, it will write the readme everything. But it just like you need to have the review done by humans correct whatever you want and make sure that the intent is aligned. Now I want to quickly give an example of a website that I coded and let me share start sharing again. One second. So I'll take you to my cursor. So I have been building like a career advice platform and this is my conversation. This is exactly how I have been white coding this particular application. From idea, from designing the whole application to the feedback to the connections to creating a GitHub repo.
Sonam Gupta [00:13:39]: For my project, all I did is was using a cursor. I honestly so personally I do not like and back encoding because my brain just stops working there. I do not know how railway works, I do not know how netlify works. But I was able to follow along and asked cursor specific questions like okay, what exactly are you building? Why did you choose this tool? And then when I would look into the code files I would understand that okay, this is what's happening. And the result of that, the result of this wipe coding was this particular website. And believe me, I do not have any, like I said, I don't have any knowledge behind how do you how to build this particular website all by myself. And all I had to do was just ask or talk to this chat with this chat with cursor and it basically created all these things that I that you see on my screen. So yeah, wipe coding is really cool but highly, highly recommended that don't white code into production because that's going to mess things up.
Sonam Gupta [00:14:48]: So yeah, that's all I had for my talk and I will move it, I will hand it over to Rohan.
Rohan Prasad [00:14:58]: Hey everyone, how's it going? I think would be a great way to talk about the next section, which is going to focus on context and memory, is to actually talk about what I think the world kind of looked like in 2024. And I know this might be a little bit of a controversial statement, but the way I think we operated with LLMs in 2024 is really thinking about, let's be very, very careful to include all the information and anything that we see. And context windows will just keep getting bigger and bigger and LLMs will keep getting smarter and smarter and we'll get better results. So we really focused on sort of the recall side of the problem. So what we found though is there was a lot of challenges and issues with that approach that I'm personally very guilty of myself, which is what I was doing more in 2024, is that it didn't necessarily work. And even though like context windows go to like 1 million tokens or 2 million tokens and you can stuff in the entire book that you have, you still end up with a lot of issues. Like for example, as you're adding that much context, even with perfect retrieval of information, so you know that the information that you're passing to a particular LLM tasks tend to degrade. And that's because there's a lot of noise as well as signal in a particular document and you're actually making the problem a lot more challenging for an LLM.
Rohan Prasad [00:16:27]: Additionally, there are challenges where information is very easily found at the beginning and at the end, but things tend to get a little bit more missed in the middle. I kind of like to think about this as like when I'm reading through a paper and I start off and I completely understand what's going on, and then my eyes gloss over as I get to like page two. And then I have to stop at some point, go back to the beginning and remind myself to reread. Going into the whole concept of like, once again, too much information means that in really focusing on the recall side of the problem means that you're also possibly including a lot of noise, especially if you're not thinking about precision a lot of times. And the other part of it is when you have a large context, things often default to local optimization. So because there's so much context and we have this focus on the beginning and focus on the end, there might be a lot of details and pieces in the middle that are really talking about how the entire system should work. So but that ends up getting glossed over and missed. So you end up with actually a subpar product.
Rohan Prasad [00:17:29]: And I think a good way to really Think about this is like taking a particular example. And I think this example is pretty easy for us to understand, but I think it also really relates well to how an LLM might particularly approach a problem. So, like, you might ask a question like, hey, what was the best piece of writing advice I got from a college classmate? This might be like a test question or something that you said. And then you have a bit of a document, and you have to find some information. So what you're doing here is you're essentially reading through this document and you're saying, well, a lot of this stuff isn't really relevant, but, oh, this piece might be relevant, this piece might also be relevant. But the other piece of information here is that this is actually wrong. And there's a really good chance when you pass in all of this, that the LLM actually focuses on this yellow section, the distractor, and actually gets the answer incorrect just because there is some semantic similarity between this distractor piece and the question itself. So the way I think about context engineering is focusing a little bit more on the precision side of the problem.
Rohan Prasad [00:18:28]: And how do we almost give the LLM a cheat sheet? Like, for example, if you. And once again, to sort of, like, humanize it or sort of make it more to give an analogy. It's like if you had to write ask a question of a book, if you had the whole book, that's definitely a much harder problem than if you highlighted sections in the book, which is still a harder problem than if you just ripped out all the irrelevant pages and just kept the two or three pages that you actually needed. So that's what I think about context engineering at sort of a high level now. Before context engineering, we really had prompt engineering, and prompt engineering is still important. I'm not saying that we shouldn't still do that, but I think context engineering takes what we used to in prompt engineering and really focus on a different scale. So for prompt engineering, the way I think about it is, I think about it as we're leveraging the model's internal intelligence. It's much more static.
Rohan Prasad [00:19:25]: We're trying to see the best way of how we can coax that particular information out of the particular model. Inherently, it's a stateless system, and it's a singular component that you're trying to optimize. And I just gave a couple of examples of how people typically think about optimizing prompts, which are chain effects, thoughts, Persona adoptions, few shot examples. Context engineering changes the problem fundamentally to say, think about how do we provide information to the model outside of its context. So fundamentally this is dynamic and it's essentially talking to multiple different systems, which I'll go through in a little bit. It's a bit more stateful because you want to preserve how the LLM is interacting with someone and you also want to think about more of how do we remove distractors, how do we make sure that we're giving it relevant and useful information. So some questions that might be come to your mind is how do we just get better at this? So we don't obviously want to give the whole book, we want to figure out how to figure out what's the right relevant pieces of information. So some techniques are for example using like system 2 attention, so actually using another LLM or another system to take the particular context you're providing, providing and stripping out anything that you find irrelevant.
Rohan Prasad [00:20:39]: This still doesn't solve the distractor problem, but it at least reduces the amount of noise that an LLM has to filter through. Compaction is another technique that's very, very similar. So if you ever worked with like a gen decoding tool, they'll see an option for example in Claude code to compact your past history so only keep the most relevant information. That's also helping with managing how much you have in your cloud context window and how you're actually improving it. There's other techniques such as like really thinking about context retrieval as like a problem that you can gamify with like a generator and a discriminator. You have something that's generating the particular context and the discriminator that's telling the generator where it's wrong and you just keep improving and tweaking those solutions or that system similar to like a gan, you can look into things like better rag techniques. So for example like using a knowledge graph or having a better embedding system or embedding model so you have vectors that are more relevant to your system. Or maybe in your case keyword search works better than embeddings.
Rohan Prasad [00:21:37]: Or what we'll segue into is what I think is sort of like the sort of like shining the shining example of improving context, which are memory systems. So in terms of memory systems, there's two aspects of this that I think that really focus on how do you improve things? For context engineering there is short term memory and then there's long term memory. And the way I like to think of short term memory is it's things that are more immediate in the conversation. It's part of your working session of you working with an agent or Talking to an agent. So things such as like conversation memory which are like what are the recent snippets or the recent things that you're talking to someone. The working memory which are like tool outputs and intermediate calculations. So like what, what are things that we have from like calling some MCP server and getting back some details, how does that relate to the conversation we're having at the moment? And attention context focusing on what is actually relevant to the agent at this particular time, where do I want to call attention to in the conversation? But then into truly build like agentic systems that have really good context management and memory management. We also want to talk about other aspects of how do we think about, about this from a more longer term perspective.
Rohan Prasad [00:22:52]: So for example like factual memory, so things that the agent might know about the particular domain or the user that it's derived from prior conversations. Episodic memory. So thinking about things that has happened in the last conversation or the two conversations before that, I know those kind of feel a little bit similar, but the way I sort of distill or separate those two is thinking about it in the lens that here is something that I've been able to distill as a fact about the user and from a factual memory perspective. And the episodic memory is something that I think about is here's something that might have occurred in a prior conversation. It might eventually become factual memory over time. But usually factual memory is a much more declarative way of speaking or a declarative statement that you've given to the LLM. And last but not least, we have semantic memory. So really thinking about the relationships between different concepts that have been introduced.
Rohan Prasad [00:23:43]: But once again to kind of anchor this, I think it makes sense to like think about this in the lens of like what might happen if you go to like a local coffee shop and order a drink. So when you're ordering a drink, you might say like, hey, I want a latte. And the barista might say something like, well do you want it hot or cold? And you say hot. And the resulting part of that conversation means that the barista is going to make you a hot latte. It's not that you told them explicitly up front that you wanted a hot latte. It's that you're keeping track of the particular elements in the conversation so that you know how to chain those together into like what the actual customer wants. Working memory might say like, hey, the latte costs like some crazy amount because I live in New York City, but let's say you Add a bagel to that order. And so like adding up the price of those particular pieces and putting those together to, to get some sort of value of like how much the total order is attention.
Rohan Prasad [00:24:41]: Context is saying like, hey, actually I'm in a rush, so I'll just take whatever you have that's readily available. So kind of bypassing the particular order, understanding that the rush is a key element. So what is the fastest thing that I can make when we talk about long term memory? It's something like where if you go to a local coffee shop and you're a regular, where someone kind of just knows your particular order, because that's the fact about you, is that your typical order is you always get this particular drink. But episodic memory, the way I would contrast that is thinking, knowing about what you ordered last time or some element of a conversation you had in a prior, in a prior visit to the shop. And lastly, semantic memory might be on top of knowing that usual order. It might know that, hey, when it's cold, when it's cold outside, I'm going to relate cold to this person wants a hot drink, but when it's hot, when it's summer, I want to know, relate that heat to someone might want a cold drink. So knowing those aspects of it and knowing the relations of all those pieces, focusing on some notable papers that I think that actually I think really helped me clarify a lot of my understanding of this and helped me from a professional context is I listed a few out here, but one I think is a survey paper that really covers the breadth of context engineering. One is like really talking about how a lot of context gets lost in the middle and context wrought, about how just increasing your input tokens can be problematic.
Rohan Prasad [00:26:08]: And last but not least for memory systems, one of the memory systems I've had experience with is mem0 and I think they do a really good job of working with how they represent and curate memory. And I think their definitions of how they define how memory works is also very interesting. I'll pass it off to Sophia.
Sophia Skowronski [00:26:30]: Awesome, thank you. Let me share my screen first. All right, cool. Can you see the Miro presentation, everyone?
Adam Becker [00:26:45]: Yep.
Sophia Skowronski [00:26:46]: All right, cool. So building on that idea of context management, mcp. Oh yeah, and I'm using a very funny image. I was trying to make a cool one like Sonims, but didn't really work out as well. But so just building on the idea of context, MCP takes it a step further by providing kind of a structured way for models to access, to access, organize and reason over context reliably. But just to kind of ground us in what we're talking about, we're going to chat about what MCP is, how it interacts with LLMs, and some examples of some issues with it and ways to implement a server for your own application. So as I was researching this or just looking up what people are talking about, I kept seeing MCP described as like the USBC of AI or the REST API of LLM apps. And all of those work fine.
Sophia Skowronski [00:27:49]: Whatever makes the most sense to you, you should just go with it. It's an open protocol that lets AI models securely connect to tools, data sources and applications. And open protocol meaning that no vendor owns it. All the implementations work the same way because they follow the same rules. So instead of every agentic framework inventing its own tool calling wrapper, MCP provides one standardized way for any AI system, any AI system, to kind of list what tools are available, send structured requests and received reliable and typed responses. And just under the hood, it uses a communication format called JSON RPC and that just basically any tool that can speak MCP can read and write in JSON messages. So it's pretty language agnostic. And let's see, next slide.
Sophia Skowronski [00:28:51]: So what's here? So under the hood, MCP is built on a client server relationship. There's three components to it. There's the host, which is any AI application. So Claude desktop, cursor, copilot provides environment. The host loads the config file, manages conversation and state, and then runs the MCP clients themselves. And the clients are hosted inside or are inside the host, I should say, and handles the communication, opening the connection, sending requests, receiving responses, kind of the gateway to the server. Then the server is where all the capabilities are. The server exposes structured features such as tools, so like the types of actions the LLM can invoke.
Sophia Skowronski [00:29:44]: So maybe if you have like you see this example everywhere, like a weather API, an action could be get weather or you could also create an action or get JIRA ticket. There's also resources that the MCP servers expose, which is just pulling data or content that an LLM can request. And then for prompts, just like reusable templates that the LLM can call. And so they're all sandbox, sandbox applications that can define what they do and they return data. So let's see, like the connection. Oh yeah, okay, so the images on the right, I just wanted to kind of like highlight like the exchange that takes place between the client and the server. So you Know this client asks the server, server, like what can you do for me? And the server responds with the structured list of things like the tools, resources and prompts and then the client can acknowledge and then it proceeds with the action. So for the weather API, again it might have like a get weather tool forecast prompt, maybe a set of resource endpoints for collecting like historical temperatures, I don't know.
Sophia Skowronski [00:31:00]: So the LLM will know after this exchange what exactly the server can support. Support. So just comparing to like a traditional API where you must like hard code, all of the request formats with MCP servers like self describe what their capabilities are and if you like want to add another tool or another prompt, anytime the clients access it, it will see the latest definitions of it. So it's just very convenient, gotta say. So and then one other final piece to kind of go over quickly is just that elements do not make the API calls directly. We're talking about agents, right? So the elements here, what they're explicitly doing is outputting a structured tool call and one of those, and I'm sure we've all seen those JSON structured responses. But what there it can be in any format, but we're saying that MCP is the structure that you should use and it just basically describes what it wants done. Then the agent, the piece of software running this LLM flow steps in.
Sophia Skowronski [00:32:10]: It's either your Python code, it's LangChain, it's the cloud desktop copilot, etc. There's a loop that happens and then basically the agent will read the LLM output, sees that it's a JSON, executes it against MCP server, and then gets the result back and adds it to the LLM context. So that loop continues until an answer happens or whatever endpoint you set up for your flow. And so just going on to kind of the.
Arthur Coleman [00:32:42]: Oh, can you make your screen a little larger? People are saying they can't see.
Sophia Skowronski [00:32:46]: Oh no. Oh no. Okay, let's see. Yeah, right, very good. And I guess, yeah, we'll, we'll be sharing the links for this afterwards. Right. So, okay, so I'll. Thanks for letting me know.
Sophia Skowronski [00:33:05]: So I'll just go to Anthropic's page just to show this one, but let's see what is the middle context. Okay, so yeah, so why is MCP the standard now in 2025 and 2026 moving forward? So as I kind of mentioned before, there's already agent frameworks out there, everyone had their own wrappers and for tool calls, so MCP again, just Solved that by offering one universal interface. And you can see um, let's see like a lot of industry adoption recently. Part of this announcement in this blog post is that MCP is being donated to the Linux Foundation. So it's going to now have community neutral, vendor neutral evolution. And then of course like over time there's been more and more community and official servers that have been released. And again there's also MCP SDKs and Python and let's see like Typescript and I think they also said go and Rust. So there's just a lot of different.
Sophia Skowronski [00:34:22]: There's a huge momentum here is what is going on. So going back to the slides. So just some examples, just so. And I'll again like these are all different links that when you get the mirror board you can kind of click through as you like. But I looked at a lot of different MCP Reddit threads and tried to get a good understanding of what servers people are actually using. Again if you have any that you are using, you should put them in the chat or you can also highlight it during the Q and A time. I'm very curious to see what people are using for their own application development. There's again a lot of official and community based servers.
Sophia Skowronski [00:35:09]: These few on the right hand side are kind of what's used for demonstrating features of mcp. There's like sequential thinking, let's click the whole block that basically is a chain of thought prompt with additional features for planning, reasoning and you can adjust your plan. It like branches out. It's like an interesting way to basically create chain of thought prompting in your LLM context or LLM state. And then another one I heard of is called context 7 and so that if you're working in Copilot and you can't use. Well I guess if you have access to web search. I currently do not in my financial lockdown environment. But you can use context 7 to.
Sophia Skowronski [00:35:59]: Let's see, like if you're like working in a library and it constantly is getting updated, something like LangChain where like there's new releases releases it pretty often context 7 will actually pull up the latest docs into the context and like allow you to actually debug using the latest if you're using the latest library. So that one seems pretty useful. So moving on, so what still needs work with mcp? All this sounds really great. There's a lot of momentum but there's still some important gaps in the ecosystem that everyone's trying to work to solve right now. So there's inherent Security risks with giving AI agents access to real systems. You know, it's a pretty high stakes situation. So once an agent can take action, you know, it becomes an active participant. So MCP was not designed for security, it was kind of meant, it's like agnostic to it.
Sophia Skowronski [00:36:56]: It handles the plumbing of tool calling. So again like listing tools, calling tools, returning results. But MCP doesn't really define who is allowed to call a tool and like what are they allowed to do and how those actions should be audited. So that means all that must be handled externally. And same, same deal with like unauthorized access concerns and monitoring and oversight challenges. And there's like vulnerabilities to prompt injection as well. So, so there's yeah again like people are coming up with solutions and I like listed, I have papers at the back of this that kind of highlight like what the overall like security frameworks are and what some solutions are being built into like make that easier because if you use a community server, MCP server, you really have to do that background check on yourself. You can't like necessarily trust everything that is being like put out there because it's still early days in some ways.
Sophia Skowronski [00:37:55]: And so yeah, just to get to kind of the wrap up, there's a few things, a few papers that stood out. There's well of course like this first one which is like MCP Landscape, Security threats and future directions. So it's kind of an overview of all the architecture, the risks and the life cycle of MCP servers. And then there's the second one which we actually talked about in the August or September reading group, which was basically a benchmark, a continuous evaluation benchmark that stress tests agents across 101 real world dynamic tasks and exposed some weaknesses within some of the reasoning models that they used. And so it was actually was a good example of continuous evaluation which I'm going to talk about next. And in the, in the paper they use an LLM as a judge that judges the final output based off of the efficiency of the tool calls like did it follow the most efficient path as well as whether or not it sufficiently completed the task at hand. And it showed I think like all reasoning models have like an 80% completion rate or like successful completion rate of a task. So it like alluded to the fact that maybe there's more like development in LLMs right now where they need to be able to support better tool calling or more efficient tool calling.
Sophia Skowronski [00:39:29]: And so this last paper is called Securing AI Agent Execution and it's kind of the first proposal for A security framework for MCP server. Cool. And so yeah, these papers all kind of were helpful for pushing like the open source community to like figure out MCP implementations. And so I guess I can just end on how you can try mcp. You can actually just use it on Claude Desktop or Copilot. You just have to add servers in the developer mode in the config file. There's a lot of videos on how to do that. And then you can also use the official SDKs for building a tool and making it available.
Sophia Skowronski [00:40:14]: I think it's pretty easy. You can use standalone Docker containers as well. That's a wrap for MCP for this year. And so I can actually go on to the next topic. I see lots of questions coming in. Okay, so yeah, so still sharing my screen. So let's go to the next beautiful artwork for this next topic which is continuous Eval framework superseded one time benchmark validation. So basically to set the context here it's LLMs now are powering a lot of different types of processes.
Sophia Skowronski [00:40:56]: So they like power chatbots, they're part of Copilot, they're used in search now and lots of different decisions, decision tools. So I'm sure, like it is it. I don't know if you have your managers that like talk to you about like, oh, what's the deal with this model? What's the deal with this model? And it's like it would be a full time job to like review all of the latest LLM releases as they're coming out and identifying the strengths and weaknesses of just all the different implementations. But just to say that elems like their behaviors can change and they seem to change constantly. So we need an evaluation framework that keeps up with that. And so let's start with the basics. The good old days. In the good old days of Transformers, early Transformers, we relied on some handful of well known benchmarks.
Sophia Skowronski [00:41:52]: So I've listed some of the classics there. And so for instance, you can see on this square squad, this is actually a leaderboard for the squad benchmark. And it you can see like all the models are steadily pushing towards the exact match accuracy higher and higher. So it's like pretty simple. It's like almost like you look at that and say like we have solved nlp. But you know, there were actually problems with this. So as soon as, yeah, so as soon as models began, more models got on the scene. Like, you know, models can start to memorize benchmarks, especially if, if they know what benchmarks are going to be tested on and so as soon as like a field discovered a new data set, it's, you know, spreads everywhere and models can like study the test and then these benchmarks get outdated pretty quickly.
Sophia Skowronski [00:42:43]: So a data set built even like a few years ago probably can't tell you about how a model performs on tasks that uses knowledge from 2024, 2025. And so once models hit like human level evaluation, the benchmark kind of stops telling you anything meaningful. So they're good for early progress, but it seems like they're not keeping up with pace and complexity of modern LLMs. So they're not like fixed models. So every update has the potential to shift how the model behaves. So without reliable evals, you're like shipping this LLM into the void essentially. So on top of that, you know, models are also are unpredict or work in unpredictable dynamic environments. So you know, millions of people ask them lots of unique questions in different languages, different context and forever format.
Sophia Skowronski [00:43:44]: And so there's no static test that can represent that. And so new behaviors emerge all the time. I'm sure maybe some of you heard of, let's see, like semantic leakage in GPT4O I think this was like from a few weeks ago where models leak irrelevant information from the prompt into the generation into the output and unexpected ways. Like he likes yellow, he works as a fill in the blank school bus driver. So it's like how do you debug your prompts efficiently against all of the different biases that come out of the LLMs? And if there's more biases being discovered, like how do you keep up with those? And so let me zoom out. There's also just, I'm just like pulling up recent ones. But there's also, you know, subliminal learning where, let me zoom in. Where models can unintentionally learn patterns from signals that were meant to be hidden.
Sophia Skowronski [00:44:48]: And so if the parent model. Yeah, oh wait, let me see. Yeah, such as. Oh yeah, so they're meant to be hidden or irrelevant and so then they, it causes them to be picked up later in the child model that they weren't allowed to access. So again like these papers are here in case you want to check it out more. But so let's see, did I cover everything here? So yeah, lots of that. And so, and then another piece is actually like the dynamic. The way that we're testing models are in such dynamic unique environments that it might be difficult to find human labelers for every single instance where you'd want, want to test.
Sophia Skowronski [00:45:34]: So this is where potentially using LM generated test sets and using LM as judges can be helpful here in terms of scaling these like testing evaluation frameworks. Okay. And so some of the key drivers for this continuous, continuous evaluation is there's been a big ecosystem built out of it over the last couple of years. There's Stanford's helm, there's OpenAI Evals which has like community contributed tests. There's hugging face as we all know them. And there's also chatbot arena that allows users to compare head to head different models which allows for human feedback at scale which is great wrapping up here as models have gotten more capable and more unpredictable. So we've shifted towards continuous eval. And this looks, and this style looks pretty different from the old, you know, single, single one time static benchmark.
Sophia Skowronski [00:46:47]: So it means testing across multiple dimensions. So not just accuracy, looking at reasoning, safety, bias, multimodal skills, what matters for the actual job as well. So it makes it more realistic, but it also is like more complex task at hand. As a developer you can also also update how you do scoring and so you can do rules based checks in addition to elements judge scoring in addition to human in the loop agreements. So rules based systems handle easy things. LLM judges can scale evaluation quickly and humans can step in for like the nuance edge cases where more like subjective judgment is needed. So there's also more task oriented evaluation which I kind of already mentioned. So instead of trying to optimize your element for just this like blanket statement of improving productivity, maybe you are more specific that.
Sophia Skowronski [00:47:51]: So like whether that's like summarizing like legal documents, writing SQL and answering medical questions like how do you evaluate how well it's doing those specific tasks? So what it's actually meant to work on. So not like abstract proxies or whatever. So continuous evaluation has become the standard. Not because it's like a flawless new system, but it seems to be the only approach that's flexible enough to handle the constant change of modern LLMs. So I think that's it for me. And so I guess we're switching to questions now, right?
Arthur Coleman [00:48:28]: Correct. Let me take it from here and correct me Vanilla. We have, we have five minutes for questions.
Binoy Pirera [00:48:37]: Not really. Let's just take the top two questions. Honestly, we're running slightly behind on time, but if we have some time after all the clusters are over, we can come back to questions. By the way, Arthur, there's a bunch of background noise from here right in.
Rohan Prasad [00:48:50]: This world letting you know.
Arthur Coleman [00:48:52]: Okay, it's not, can you shut down Everybody except me, because it's not me.
Binoy Pirera [00:48:58]: Done.
Arthur Coleman [00:48:58]: All right. Okay, we got it. Sounds like it's gone. All right. So Nahil, you want to ask your question?
Nahil Jain [00:49:04]: Sure, yeah. When Rohan, you were chatting about you know, like improved context engineering versus prompt engineering, a question came to mind is there are all these different techniques that we can use to do context engineering. Is it like the same strategy to evaluate prompts against each other of how you would do context engineering, or is there a better way to say this works better in practice for these type of tasks, so to speak, like use system 2 versus use compaction versus this.
Rohan Prasad [00:49:37]: For these type of tasks, Yeah, I think it really depends. A lot of it's going to be very contingent on your particular data set and what you're actually trying to do. So I don't think there's a one size fits all. But if you really want to talk about evaluations, then I think there's two ways to really take it. There's like the integration test scale where you treat the entire system from query. You treat like how you're pulling in information and your agent's response is like a black box and you're sort of validating like is your agent's response still looking okay? And I think that's kind of how you approach it from a prompt engineering perspective. Like the prompt is part of the like how you're calling the particular LLM. What context engineering allows you to do is think about it more in the scale of like specific, like sorry, not integration end to end test.
Rohan Prasad [00:50:21]: But if you want to really think about it as a series of components, let's say you have some memory system which you're pulling in particular information. Let's say you have a system which is doing compaction. You can do things like context recall and context precision on this particular sub components. So you can essentially say like, hey, I'm getting a really good recall score here, but I'm getting really bad precision. So depends on your system. Once again, maybe that's okay. But essentially you can evaluate those sub components independently and then use that. You can still do your full blown end to end test, but you can do your more sub component based tests as well and evaluations.
Sophia Skowronski [00:50:58]: Cool.
Nahil Jain [00:50:59]: Yeah, I think that makes sense.
Arthur Coleman [00:51:00]: Okay, next question because we're short on time. Vignesh, would you like to ask your question? You're on mute potentially. Vignesh. Oh, okay. All right, I'll read it. Vignesh. I see your note. So Vignesh asks, so are MCP servers often doing the context engineering for any context made available through them? Or does contact engineering still fall on the user's side on top of what they do to optimize further to your use case? I don't know who that's directed to.
Arthur Coleman [00:51:38]: It sounds like it's Rohan, but I'm not sure. Oh no, I'm sorry. That's Sophia. It's mcp.
Sophia Skowronski [00:51:47]: Yeah. So like where can you restate the question? It had a couple different pieces to it. So it was. What was the first question?
Arthur Coleman [00:51:57]: Hold on. So our. So our MCP servers often doing the context engineering for any context made available through them. Or does context engineering still fall on the user side?
Sophia Skowronski [00:52:11]: Yeah, it seems like it falls on the user side because you are the tool just accesses like potentially like I guess it does manage context in some ways. If you're like using the MCP to extract or pull in data, that's using additional loop of reasoning. But it kind of depends on how you've set up the flow. It sounds like how like context engineering is used.
Arthur Coleman [00:52:40]: You know, my experience is how you architect it. You can architect it either way.
Sophia Skowronski [00:52:44]: Yeah.
Arthur Coleman [00:52:46]: All right, I've been told two questions only. So we're going to move on to session two, which is multi agent orchestration and model cognition. This is going to be handled by Nahil and Valdemar. So I will turn it over to Nahil and Valdemar. I don't know which one of you is speaking first, but go ahead.
Nahil Jain [00:53:03]: Yeah, I think Valdemar is not here but Lukas is joining us and it's his first time. So welcome Lukas. Take it away.
Lucas Pavanelli [00:53:11]: Yeah, pleasure to be here. I will be starting about. I can start with the multi agent part. Let me share my screen. Yep. Okay. So yeah, I prepare a little presentation here about the multi agent orchestration and talking about a little bit about research. Also some production use cases that I prepare.
Lucas Pavanelli [00:53:43]: So first we talk here in the session about agents and also mcp. I think all this relates to this broad concept which is the multi agent systems. So if we think that in some analogy with kitchen we can think that we have an orchestration orchestrator which is the like the head chef which makes planning the delegate tasks and we have also some sub agents which are specialized on doing certain specific things. Like for example if you fit in the kitchen like for pastries or for like seafood or for entries or desserts and things like that. So if you think in sub agents in general you can have for example sub agent which is specialized in coding or in testing or in writing documentation, we can Think about many different types of use case. And the main idea is that this orchestrator role, it is like a central brain that delegates the task to this workers. So it needs to have certain type of responsibilities like for example planning or understanding what sub agent should be used for each specific tasks. And of course one of the main parts is how you communicate.
Lucas Pavanelli [00:54:58]: So if you start with this more complex architecture, you need to have some type of communication protocol. So one of the ways like is for example if the orchestrator wants to call a sub agent, it can generate like a JSON request and then your system can read this and call the sub agent. Well that's done also for like popular package like LangChain, lang graph. Using that you can implement this type of systems as well. And the third part which is really important is that you have memory which is shared between the sub agents, the orchestrator also that can be defined depending on your application. But in the end of the day you need to have some type of collective memory when you are building this type of multi agent system. So with this said.
Arthur Coleman [00:55:51]: Can you hold on one sec? I want to. I'm gonna. I need you to put your thing back on you. I'm about to mute everybody and then unmute yourself. Okay?
Lucas Pavanelli [00:56:01]: Okay.
Arthur Coleman [00:56:04]: Okay. Now you should be able to unmute yourself.
Adam Becker [00:56:14]: You know.
Arthur Coleman [00:56:15]: There we go. Thank you.
Lucas Pavanelli [00:56:16]: Okay. Okay, good. So continue on this and here let me go back. So I prepare. So defining the multi agent architect, the multi agent system in general. I would like to talk about a research paper that I read recently which implements like a research playground of this with a multi agent system which is really interesting. So in this paper they are building this multi agent system for playing Minecraft and it's all based on LLMs and should test some type of new features that they are proposing on the paper. So, so normally when you have a standard agents you.
Lucas Pavanelli [00:57:06]: They work like in serial. So you think and then you act and then you think again, right? And the interesting part of this research paper is that they try to paralyze the planning and acting part of the this multi agent architecture. So as I told like normally when you have this orchestrator you need to have a planning part, right? Which calculates like what are your goals, what are your long term goals, how you plan to solve this X problem and also have the acting task, the acting thread which are mainly the sub agents which interact and perform that. Right? If you think in this broad sense you can separate like that, right? And in this paper they try to get this system and make it a Little more complex by allowing the planner to stop the actor if something change. Right. And in this scenario here it's, it's a made up scenario because in this multi agent system implemented for playing a specific game, which is like Minecraft in this case, and you can see here that all these parts that I presented, like they have a centralized memory which contains the records that they are the serve, the chat logs, the actions that were taken. You have the planning and the acting part here in the middle and you have also the resources that you can get here. They are like resources that you need to get so you can advance in the game.
Lucas Pavanelli [00:58:42]: And in this scenario it's really dynamic because the environment is changing every time. So what you're observing your observations, they are changing and you need to adapt with them. So in this, all this context it makes sense that for example Orchestrator can pause or can interrupt some other agent. So if you can finish the test.
Binoy Pirera [00:59:08]: Sorry to interrupt, man. I think, I think your audio quality is dropping drastically. I'm not sure what it is. There's like a bunch of background noise and we can't really hear you.
Lucas Pavanelli [00:59:17]: Right?
Binoy Pirera [00:59:27]: Yeah, we can barely hear you.
Lucas Pavanelli [00:59:34]: Can you hear me again?
Arthur Coleman [00:59:35]: Yes.
Binoy Pirera [00:59:36]: Better?
Lucas Pavanelli [00:59:37]: Yeah. Okay. Yeah, I think my battery is dropping. It was in the. Let's see if it's better now. Is it better?
Arthur Coleman [00:59:48]: Yes.
Lucas Pavanelli [00:59:51]: Okay, so continue here in all this scenario. Well, we can see that these are like a complex scenario that you deal with and this is where the research, like the research paper tried to act. Right. And then if we go to the production reality, if you step in the production reality, I think one of the main use cases that you can see now is on the coding agents part. And I've been using a lot of coding agents in my job as well. And one of the, one of the, one of the coding agents that uses this multi agent architecture more is the cloud code one. And they already implemented, they already deployed this. And the way that it works is that you can create sub agents and you can assign specific tasks for these sub agents.
Lucas Pavanelli [01:00:49]: And for example, you can create a sub agent that just analyzes logs and you create this and the loop which goes is like you have the main agent which is the cloud code and if identifies that there is a task that is suited for this sub agent, if you call the sub agent with no, with less context. So you can just focus on solving this specific task. Right. And this is, this is how they are implementing for example in the, in the, in cloud code. And in this case the sub agent is really specific for like a task, right. And the environment that is the the sub agent see is also really restricted because they are more interested in making the sub agent works well for this specialized test instead of trying to create a more general sub agent. And here we can see that this is from directly the cloud code page that they say what the main benefits are of creating these sub agents. So for example, here we have the context privilege preservation which the sub agents operates on its own context.
Lucas Pavanelli [01:02:09]: The sub agents also have a specialized expertise. You can reuse the sub agents or for other tasks and you have also permissions that are flexible from those sub agents that can maybe access tools or limit it to just some capability. So with all that said, I want just to trace here I presented basically one research implementation and also the production one. And there are similarities. Then they are differences as well. So in the research, in the research part the objective is made up. It's like you are trying to implement a system that is used on only for a game and you are basically creating, using the complexity of the game to create a better system. And the objective in this end is just surviving in this made up world in the game world.
Lucas Pavanelli [01:03:13]: Whereas in the production case we are really concerned with being reliable. So the sub agents, in this case the multi agent that we need to really reliable for everyone, for the developers that they're using. Also the environment is a big difference. And the strategy that was implemented in one is also much more complex than in the production. The research strategy is much more complex than the production one and also really important the constraints. When you are talking about research papers in general and research papers implementing this multi agent system, you are constrained by the computational power where in production you are really concerned with how many tokens and the latency that everything is implemented. So just this parallel, this just trace. How can we go from the research implementation of this complex systems to actually a real, a real world one? And I think bridging this gap is a really important thing to understand and to discuss.
Lucas Pavanelli [01:04:22]: Going to the, going to the future as well. That's it for my parts, I think.
Nahil Jain [01:04:37]: Cool. Yeah, I can take over. See, let's share screen. Once I share screen, I won't be able to see the cameras. So yeah, like if, if you guys want to interrupt, feel free to do so. I want to make sure that the content is legible. So I'll just zoom in, zoom out. So since the beginning we've been hearing about, you know, different agents.
Nahil Jain [01:05:05]: They're able to increase context. We need to get the right context. Then they're doing tool calling now. We're building multi agents behind the scenes. What's happening is the labs are working on increasing the quality and the capability of these agents. I wanted to talk about where that is, how people are doing it, and how that changes for engineers and ML engineers specifically. Specifically because the world is getting to a point where most likely we all will be taking some part in not just being at the application layer where we are just writing some prompts and connecting systems and architecting them together, but maybe we have specialized tasks and enough data that we have to train our own. For example, Sophia was talking about this example which I see in industry right now very commonly where you have a constrained environment, like you don't have access to all the tools that would have made the agent actually work, but you have the data to help you actually build the agent yourself, which can inherently do the reasoning in that part of the context.
Nahil Jain [01:06:09]: So earlier this year, I think in January, that was the first reading group we did. And it's kind of feeling full circle where Deepseek Arvan came out and Deepseek Arvan was like, hey, I can do what OpenAI has been doing for so long in much, much cheaper time and resources. And everyone was like, oh my God, what did they do? OpenAI is not valuable anymore. And so let's kind of understand what they did and how that will change us. Fine tuning, like what does RL mean? Or what does human preference, human feedback mean? Stuff like that. So I just prepared some slides around that stuff. This is where let's zoom in. So this is a blog post by Chip Heughan which kind of talks about how ChatGPT was built.
Nahil Jain [01:06:54]: And this is a good place to kind of anchor and understand what has been happening behind the scenes of all the application layer stuff that we've been doing. So reinforcement learning was actually pretty hard and not related to NLP. So people who are becoming experts in NLP like four or five years ago did not know anything about RL reinforcement learning. And so ChatGPT was like the first production system. It kind of brought them together and showed, hey, there is value of bringing RL into the NLP world. But the techniques that were used for RL were there for a long time. And so this is the process that actually was used to build ChatGPT at a very like theoretical level. And there are a lot of steps.
Nahil Jain [01:07:40]: Of course it was trained on Internet level data. And then we got this model which we call the pre trained model. It's very possible if you guys have Been self learning on doing transformers or LLMs, etc. You might have seen different parts of this or maybe even the whole thing. But because we have a broad audience, I wanted to make sure that we're kind of anchored into the concepts here. So on the Internet stage, data, which is like raw data, just take a lot of data and say, hey, can you predict the next word? Can you predict the next word? And that thing is called pre trained LLM. And as Chip Heughan was saying, this is more like an untamed monster. You don't really know what is going to come out.
Nahil Jain [01:08:21]: And you can't put that out in front of the world. Right? Like you will most likely blow past ethics. It might not be sensical output, so you cannot actually use it either. But it has understanding of the repeated patterns that we know as human knowledge captured in Internet data. So we know that we're just not able to kind of understand it. So then what happened was they took lots of human labelers and that was actually not even like taking human labelers. They just scraped data from high quality data sources like Stack Overflow, which has, hey, this is my question. This is the answer Quora and stuff like that.
Nahil Jain [01:08:57]: So they got a lot of high quality data which is not at the scale of Internet data. Internet data has trillions of tokens. This is more as it's written here, like 10k 100k type of tokens or words, so to speak. And so the data set is much smaller, but it's much higher quality. And they train it again on that. So that's called supervised fine tuning. You're supervising. Given this input, this is the output you should think about and training the model to actually get better at doing that.
Nahil Jain [01:09:27]: That is what the supervised fine tuning model is usually referred to as SFT models. And then even then they were like, okay, this is still not good enough because it is okay in terms of giving coherent answers or somewhat coherent answers, but they're still not like ethical or they're not good enough. And then came, hey, okay, can humans actually understand the different pieces of what is good and what is bad for a given answer and do another round of feedback, which is human preference human feedback. And then they are like, okay, this is going to stay in the bounds of what human thinks is the correct answer. And that is RLHF. And in this diagram you will see RLHF actually is almost the same size of the box in terms of complexity as an engineer to look at versus the previous parts. They use a technique Called ppo, ppo, RLHF. Because they're using it on human, they're doing RL on human feedback.
Nahil Jain [01:10:30]: And the whole part of PPO is that it's a reward model. And so you need to have a reward model which kind of takes your comparison data, this is good, this is bad, and then classifies the output of the model and creates a loop of, hey, you get good reward here, you get bad reward here. I mean, there are more technical mathematical understanding, but I'm thinking more from like architecting the system and the flow of data and kind of what are the different pieces you need to actually make it work. Or at least they made it work with ChatGPT. You'll see further down the line that it's kind of simpler now to do that. There are lots of libraries which will help us do that. And that gave rise to ChatGPT. So there were these three different techniques that were used.
Nahil Jain [01:11:11]: One was just pre training, which most of us don't have the budget ever to do. Then there is supervised fine tuning, which we can do if we have enough data. And then there is RLHF, which has seen a lot of different research and innovation there. And it's still early days, there's still more ways that might come out next year, which I'm personally excited about. So in terms of how is it relevant for us, there are a lot more boxes here we can put for different types of trainings. This is called post training, where you have a model and you're training it to do something specifically better and that gives rise to extended reasoning and kind of long context. Like, hey, think about, do a thinking block, do a solution block and all of those things. And how can you actually make your LLM do that specifically or in your context or your task? There are different techniques for that.
Nahil Jain [01:12:07]: So let's look at some of them. So the top row is actually sft, which is like you have input output pairs and you're just helping the LLM get better at. Given an input, how will you get a good output? And then this bottom row and GRPO is what Deepsea Carbon came out with is more rl. And here what is happening is this is the thing that I just talked about. And then GRPO is basically Group Relative Preference Optimization, where what they are doing is they're doing something slightly different than ppo. So let's go one by one. So SFT is basically you have a base model, you're retraining it with high quality data, and it's usually examples like Given this input, this is the output and you just curate and make sure that the input output pairs are higher quality than your base data that you use the base model training on, that you train the base model on. And so you'll get a slightly higher quality data.
Nahil Jain [01:13:07]: It is not that effective and you need a lot of data. Then there came a technique called DPO which is direct preference optimization. Here what you're doing is kind of what chat arena does. You get two response. ChatGPT also does this where you have two responses and they say, hey, do you prefer this or do you prefer this? And you're getting direct human preference. And then you're using those data sets to make the LLM choose a given option of the multiple answers it can produce. So that is called dpo. And then KTO is not so popular yet.
Nahil Jain [01:13:40]: Maybe it will become popular, but it is very similar to DPO where you don't have a detailed output, but you have hey, is this. Or you don't have a detailed comparison between multiple options, you just have one output and you have a feedback symbol of hey, this is good or this is bad. And then you're training your model to only behave towards the good answers and not the bad answers. So a little bit like rl, but it's not using rl, it's just like simple pairs of input output and then was it good or was it bad? So those are SFT techniques and then we have RL techniques. The RL technique, the standard, like the long term way of doing this has been PPO where you basically have a reward model and then given an input you're like, hey, will you get a good reward for it? Will you get a bad reward for it? And based on that you will improve the model itself. But you need two models, so it's memory intensive. You need a lot of huge infrastructure to actually make it work. GRPO is different.
Nahil Jain [01:14:40]: What GRPO is doing is it's saying I don't need the reward model. But then how do you actually figure out reward? What you actually end up doing is you create heuristic based rewards. So that's what Deepseek R1 did. What they said was we will make our model. They were using v3 as the base model and then they were saying, hey, I'll give it reasoning and I'll make it work really well on the quality of coding and mathematical tasks. And based on that I'll just inherently see an emergence in better quality of, better quality of intelligence or the output. And so what they did was they took an input and then they create a bunch of different outputs. So that's the group of potential outputs for a given input.
Nahil Jain [01:15:25]: And then I have a heuristic which is giving reward of saying, hey, relatively these few answers are better than these few answers. So next time when you try a task like this, try to get towards a better group type of output rather than the bad ones. And slowly, slowly, slowly this builds up towards a better model. And I think the value here is that you don't need to train another model which does reward. And so it's much easier to get started. And I'm seeing more and more in practice that GRPO is becoming the standard when you're saying, hey, I want to do rl. And then if you just want to do supervised fine tuning, DPO is still kind of SFTP is there people are doing Loras, but it's not very common to do SFT anymore. It's mostly dpo.
Nahil Jain [01:16:12]: And then for RL is grpo, what does it actually look like in practice? There are a bunch of different libraries. I have played around with Skyrl and so I can talk about this, but I think the concepts overall align on all the different, different libraries have the same RL concepts. From an engineering architecture perspective, how do you train a post trainer model? With GRPO or ppo, you have a controller which moves the data and makes the model work between a trainer and a generator. And then a trainer is going to have some kind of a training loop. Most of the people will use fsdp which is fully sharded data parallel. It's just a way to use lots of different GPUs and, and load lots of data on it so that you can actually train a big model. Otherwise you're stuck with smaller models. Then you can't encode so much knowledge, so to speak.
Nahil Jain [01:17:05]: And then generator is taking that same model and doing inference on it where like hey, given this input, can you give me an output? There is no updating of weights, which is what actually changes or improves the model itself. And this inference engine works closely with an environment. And for most engineers, the thing we will be working on in 2026 is actually going to be mostly environments first because you have many inference providers already. Training is mostly standardized. I mean I think that's more researchy the training block. And most people don't change training. But the common thing to change is one of course the kind of prompt which is part of the input and the environment which is the task that you're doing. And there are different paradigms and APIs coming out of how you, you write your tasks.
Nahil Jain [01:17:49]: And of course they are reusable. Like if you're building a C agent, probably file editing tool is like the most common tool. Why everyone needs to implement their own file editing tool. And so that's where the world is going. Just wanted to flash that for you all that like if the world gets to a point where or if in your work you're like, hey, I want to do a RL based fine tuning. It is not very hard anymore. It's getting easier and easier. All you need is infrastructure and then learn one of these libraries.
Nahil Jain [01:18:19]: I can show you an example of what Skyrl training will potentially look like. They wrote a paper about how to do text to SQL with RL inside that loop of the generator. What is actually happening is you get a question and then the environment is the database. So here what's happening is you get a question, the agent thinks of, hey, what should be my response to the prompt? It generates a SQL query in the environment, it runs the SQL query in the environment. There is another function which knows, hey, how do I actually get the right response? And can I do a code based matching of this output is exactly the same as this output. Okay, Then I can calculate some reward based on the matching of the output and then you give the reward back and then you keep looping through this and that's your training loop. These are some examples of how you can even make it do multi turn. This is how you develop extended thinking.
Nahil Jain [01:19:20]: So you get the model to do a lot of different iterations of thinking, SQL observation. And then you kind of using grpo, say, hey, these turn one, turn two were great, three, four, five were bad. Don't look at that. And then eventually you get better and better, better turns and yeah, that's kind of it. And all you needed to do was figure out how to set up an environment with SQLite, have some data set to. Because you need to know what the correct output for an input prompt is. And so there are data sets already which talk about you can create a synthetic data set in SQLite and then there's an English prompt to ask a question and then a SQL output as well. So you can use that to train the model to do a better job.
Nahil Jain [01:20:06]: And they used Qin 2.5, but you can use any model that you want. And this is what one example of the data set looks like. I think it's fairly standard and kind of makes a lot of sense. This is the name of the database because the data Set had many different databases. This is the question and then this is the correct answer. This is like if the output of this SQL matches whatever the model is giving from the question, then that is the. Then you get a better reward versus if not. And then this is just some syntactics on what synthetic data do use to create your SQLite database.
Arthur Coleman [01:20:41]: Nihil, since you're at your last slide, I'm going to stop us because I want to make sure our last speakers have time. Unless you have something major more to to discuss because I see your last slide is references.
Nahil Jain [01:20:54]: Yep, that's it.
Adam Becker [01:20:55]: Are we?
Arthur Coleman [01:20:56]: Are we?
Adam Becker [01:20:56]: Okay?
Arthur Coleman [01:20:56]: Okay, good. I'm going to skip, if you don't mind, Benoit, because of time and the quality of what's coming from these speakers. I'm sitting here just in awe of you guys and what I'm learning. So I'd like to just jump to Adam and Anna, let them talk, and then we'll have questions at the end. Are you guys okay with that?
Binoy Pirera [01:21:16]: Yeah, let's go for it.
Arthur Coleman [01:21:17]: Okay, Adam, Anna, you're up.
Anna Yoon [01:21:24]: All right, Adam, do you want to go first? Or I can cover first and then try to leave enough time for you.
Adam Becker [01:21:30]: Yeah, yeah, go ahead, please.
Anna Yoon [01:21:31]: All right, let me try share my screen.
Sophia Skowronski [01:21:59]: All right.
Anna Yoon [01:21:59]: Are you guys able to see this?
Adam Becker [01:22:02]: Yeah.
Anna Yoon [01:22:05]: Okay, I'm reusing my slides from a previous conference, so please disregard the logos. My design team put this together and it looks much nicer than what I have, so just going to reuse it. All right, I'll get right into it. So every day you see a new recommender chatbot summarizer shipped. It's an exciting time in tech, but a troubling pattern is that your model performs well on technical benchmarks. But there's no guarantee that they will accrue to a business school without carefully testing and validating. These once exciting new features often fall flat. Users bounce, trust your routes and support tickets spike.
Anna Yoon [01:22:49]: This is no coincidence. This is the gap between offline model validation and real world product success. I think in the question and answer sheet, Vigna asked about whether we will be comparing online versus offline. This presentation will cover that. So the AI evaluation crisis. The MIT Technology Review quote, human preference testing has also emerged as an alternative to benchmarks. AI researchers are beginning to realize and admit that the status quo of AI testing cannot continue. The problem here is that many AI developers and leaders fail to understand that AI features are about more more than just the model itself.
Anna Yoon [01:23:36]: In the real world, it's about the UX context and user goals surrounding them. So I'll introduce the ways product leaders are addressing the AI valuation crisis in practical steps. The core problem here is that benchmarks don't measure reality. The evidence is overwhelming to list just a few metas. Galactica scored super well internally, but was pulled back within three days for fabricating scientific facts. Air Canada's AI chatbot hallucinate fake refund policy is triggering a lawsuit. Blah blah blah. Stanford's research shows general purpose LLM chatbots hallucinates legal facts up to 82% of the time what people think goes into an AI app model API call model parameters but what actually goes into an AI app are on top of those embeddings in a vector database in app content like chat history, non AI app features like UX and UI of your product, and user information.
Anna Yoon [01:24:45]: And each of these building blocks can be either proprietary or non proprietary. So what actually works here? The real solution is product level validation. This is the only question that really matters, and there's only one reliable way to answer it. Real user data. That means a b testing AI powered features against baselines or different models using holdouts to measure the cumulative impact of new features and catch metric regressions implementing trust and safety guardrails tied to user behavior and business metrics. I would love to highlight some of the products here, particularly notions AI using flags, experiments and validations to lead it to business success and cursor voted Product of the year 2024, emphasizing on seamless user experience rather than just benchmark claims. So what's behind their product success? It's the full AI testing stack. It's a three layer stack with first model evaluation, second user validation, third monitoring and guardrails.
Anna Yoon [01:26:09]: In layer one Model evaluation. Your goal is to check if the model produces coherent, relevant and safe outputs in a controlled setting. Layer 2 user validation. Your goal here is to test whether the AI experience actually improves user outcomes compared to the baseline in layer three. Monitoring guardrails While your feature is in the wild, you need to track the ongoing performance and user trust as well to catch any silent failures after launch. So let's deep dive into each layer. Layer one Model Evaluation. Again, this is the first filter in the AI product development process where you test how the model performs in a controlled, usually offline environment.
Anna Yoon [01:27:04]: This step tries to help catch functional failures, hallucinations and quality issues before anything reaches production. So some of the components that go into this layer are model prompt model parameters like temperature and then UI and UX of your product. So teams typically run predefined eval sets, manually review prompt response pairs or rely on LLM as a judge techniques where one model scores the outputs of another. Automated tools like toxicity classifiers and hallucination detectors can also help catch known pitfalls. Some of the common methods also include offline prompt evaluations using labeled data sets for expected outputs and running outputs through either role based or model based scoring filters. These techniques can help establish a baseline level of model quality before progressing to user facing experiments now. Great, you passed the first filter. Your model works well offline, but offline checks alone only measure isolated output quality and are poor indicators of actual product success.
Anna Yoon [01:28:27]: Traditional NLP metrics like blue and root show poor correlation with human quality judgments. Instruct GPT having 1.3 billion parameters tuned with human feedback significantly outperformed a much larger GPT3 with 175 billion parameters in human preference evaluations. The learning here is that a model can pass every evaluation and still fail in the real world because it doesn't actually help users get their job done. That's why it's critical to have a strong way to go from this model evaluator to the next user Validation.
Binoy Pirera [01:29:10]: So.
Anna Yoon [01:29:10]: The user validation is a four step process. Step 1 Build compelling AI features that engage users. Step 2 Flight dozens of models, prompts and parameters. Step 3 Collect data on all inputs and outputs including cost, latency and performance and any other business metrics that are important for your business. Step 4 Use this data to select the best performing variants and train new models. It's critical to see how the AI performs in context with real users, real use cases and real stakes. So controlled product experiments and feature flags let you measure whether the AI power feature actually improves key outcomes like engagement, task completion or revenue. And that's the goal of this layer.
Anna Yoon [01:30:07]: Some of the call outs of experimentation must haves for user validation layer definitely a B Testing your AI power features so with AI products the simplest way is to ab test between AI and the traditional experiences as the baseline. So this is where you randomly assign users between versions of an experience you can directly compare is AI better and not just Is AI working? And you can also use advanced stat settings like interaction effect detection where you can rapidly test multiple AI variants and features at the same time. You can also use holdouts. It's a critical concept where it's a product experimentation technique where a percentage of your users stay on the non AI version indefinitely, sort of like a permanent control group. And they can help you quickly catch silent regressions that you that your new AI features can introduce such as confusion friction, user churn or bugs. So the hard truth about AI products is that a feature doesn't succeed just because the model looks good or because it was launched successfully. It succeeds if and only if the real users prefer the AI powered experience over the baseline. And the only way to know that is through the experimentation that we're talking about today.
Anna Yoon [01:31:46]: Layer 3 Final Layer Guardrails and Monitoring so here, even after a successful launch and experiment, your AI power features are always still at risk of silent failures. Your outputs can fluctuate based on model updates, prompt changes, data drift, and API changes from the third party LLM providers that you're using. And as this drift occurs, quality can degrade subtly over time, and issues might only show up in edge cases or downstream business metrics. And that's why ongoing monitoring and automated guardrails are essential. Just checking time so some of the highlights on the methods for monitoring your AI success definitely feature flagging your AI product wrapped in experiments where you can launch your AI features with flags and wrap them up as product experiments permanently. And this way you can continuously monitor user behavior and can turn off degrading models or problematic behaviors instantly to using, alerting or implementing alerting on trust metrics. So instead of just monitoring system health for crashes and bugs, you have to monitor user trust health as well, like opt outs, abandonment, negative edits and spikes in undo behaviors. 3.
Anna Yoon [01:33:23]: Built in rollback tools and plans so rollback isn't just for infra risks. You have to be prepared to revert model versions, prompt versions, and entire AI driven flows if trust metrics degrade. So AI isn't a set it and forget it world. It's a fundamentally different type of product, a living probabilistic system that needs permanent guardrails. Monitoring trust signals is as critical as monitoring uptime or error rates. And just to wrap up, product experiments are nothing new. It's been the backbone of how companies like Facebook, Netflix, Amazon, Uber and Airbnb built products that scaled. These companies didn't rely on intuition.
Anna Yoon [01:34:23]: They ran experiments to understand what worked for their users and what didn't. The difference is that in the past, engineers shipped deterministic features. You knew exactly how a button, a ranking algorithm, or recommendation rule become behaved, even if the business impact wasn't fully clear. And with AI, that certainty is gone. Foundation models are probabilistic by nature, not deterministic. Outputs vary based on prompts, user input, context, and even silent model updates. What looks fine in a demo or benchmark might quietly fail in production, hurting the user experience, degrading trust or driving churn without anyone noticing. And that is why AI requires a different level of discipline.
Anna Yoon [01:35:15]: You need to continuously ask yourself, does this AI actually help users? Do users prefer it over the baseline? Is it still working as intended over time? And the only way to answer these questions is through. Through a continuous product loop. Evaluate, experiment, monitor your metrics and improve final takeaways. AI is easy to ship but hard to get right. So build AI features that actually work in the same way. Product teams have validated software for decades. Experimentation. And that is the end of my presentation.
Anna Yoon [01:36:02]: And we can connect through the QR here. And feel free to drop more questions in the docs. Thank you.
Adam Becker [01:36:14]: Awesome. Thank you, Anna. Arthur Benoit. Can I take it away?
Arthur Coleman [01:36:18]: Yeah. Adam, I want to say one thing. So this is such an amazing thing. I know we're supposed to end at 10. I'm going to stop you at 10, but I'm going to let it. Benoy, can we reset the clock potentially? Because I'm going to let it run over for Q and A. There is so much good stuff here.
Adam Becker [01:36:34]: Okay.
Binoy Pirera [01:36:35]: Yeah, don't worry about it. Yeah, don't worry about it.
Adam Becker [01:36:37]: I'll cover my part quick, though, too. So I think we'll have some good amount of time for Q and A.
Nahil Jain [01:36:44]: Okay.
Adam Becker [01:36:44]: Or at least that's the hope. We'll see how. We'll see how quick.
Binoy Pirera [01:36:48]: Don't worry about it, Adam. I think we will have enough time.
Adam Becker [01:36:52]: Awesome. Can I show my screen?
Binoy Pirera [01:36:54]: I'm not sure if it's just me. Can everybody see Adam? Because I don't see Adam.
Adam Becker [01:36:58]: You don't see me?
Arthur Coleman [01:36:59]: Yeah, I see him up top. We don't see your screen, Adam.
Adam Becker [01:37:02]: Yeah, no, not my screen. You're not seeing my screen yet. You can see me on top, but you can't see me on the stage. That might be. But no, that's. Maybe you need to do that.
Arthur Coleman [01:37:13]: That's a view issue.
Binoy Pirera [01:37:15]: Yeah, yeah, yeah.
Rohan Prasad [01:37:16]: Hold on.
Adam Becker [01:37:24]: Okay, There we go.
Lucas Pavanelli [01:37:26]: All right.
Adam Becker [01:37:27]: Okay. You all can see it, right?
Arthur Coleman [01:37:29]: Yes.
Adam Becker [01:37:32]: So today I want to talk about the openness of AI models, and I want to drive a little bit of intuition about what's been happening over the last year and to give us some sense of where what's likely to come around the bend. And you'll see that a lot of the thinking here is going to be combining and merging and fusing different threads that we had seen open in the talks today already. Okay. So when I started thinking about the openness here of AI models, I kind of like, landed on Two different stories. The first one is how are open models doing? Are they doing good? Are they much behind proprietary models and how well are they adopted? Maybe you and I haven't been adopting open source models all that much or open weight models all that much. But are they prevalent? And the second story, I think it's a much more interesting story is not only why are these models good insofar as they're good, but what can their goodness tell us about the future of AI developments? And I think that that's a three that would be a very interesting one to continue pulling on. So I hope we're going to have some time to get into it. And if not, I also think that there's, you know, maybe we should do even another session on this because there was a lot here as soon as I started zooming in.
Adam Becker [01:38:48]: Okay, so how are, how good are open models and in particular how do they compare with proprietary ones? So you could see here, so this is an analysis done by artificial analysis. You'll have the link in the mirror board too. And they essentially evaluate virtually every model on these top 10 evaluations and then they combine the score in whatever way. Okay, so if you look at the top five models in terms of their performance, in terms of their performance, all of those are proprietary as soon as you look at the second batch. So top 10 already two of them are open weight and not by a far margin. And we're going to see how that margin is evolving. So if you look at the progress, let's say just in the last year, so this is proprietary models starting January 2023, this is up to now December 2025, you could see that even where the best proprietary models were six months ago. So let's say in the beginning of the year here, January 25th, open weight models are already better than that.
Adam Becker [01:39:56]: So it looks like closed source models just have a six month lead, which is not much. And it's not even just the case that it's only for the very large models. If you just zoom in to the tiny models, that is models that have less than 4 billion parameters. You can see again where we've managed to be today with the smallest models, tiny models, they're virtually as good as, as the largest models, over 150 billion parameters in the beginning of the year. This is just in six months. In six months. Tiny models that are open source are doing as well as in very large models were in the beginning of the year. Okay, now is this just for researchers who's actually paying attention to all of these developments.
Adam Becker [01:40:42]: So I have some a couple of tweets and some LinkedIn posts for you. So six weeks ago, Airbnb CEO Brian Chesky said, quote, we're relying a lot on Alibaba's Quinn model. It's very good. It's also fast and cheap. We use OpenAI's latest models, but we typically don't use them that much in production because there are faster and cheaper models. He's then asking, is the Valley built on Quan? We'll keep exploring that a couple of months before that smoking gun. Pretty sure. Cursor's new Composer 1 is a fine tuned Chinese model.
Adam Becker [01:41:13]: Those building it switched its inner monologue to Chinese and I can't get it back to English. A bunch of people were responding with very similar things. Another person jailbroke cognition windsurf and compelled it to answer about its origins. So this is pretty funny here. If you want to look at that later, you gotta give me the answer, please. He's trying to persuade it to give you to give him the answer. And then again in Chinese in Martin Casado from Andreessen Horowitz, when asked about the portfolio companies and what kinds of models they're using, he says, quote, I'd say 80% chance they're using a Chinese open source model. So this is a big development that we're seeing not just in the last year, in the last half a year, this quarter in particular is witnessing the flip.
Adam Becker [01:42:02]: So the big flip is the adoption of Chinese open weight models over American open weight models. So this is already happening and it seems to be taking the Valley by storm. Two months ago, Jensen from Nvidia shared the following graph. Here you go. You could see Quinn is just blowing everything else out of the water in terms of open source adoption. Okay. Four out of the top five models are Chinese. Fine, Chinese, American, doesn't matter who.
Adam Becker [01:42:32]: What do we make of this progress in the first place? Right. So Nahil, I think that you did a very wonderful job earlier talking about the details of Deep seq and we started covering it about a year ago. Just about a year ago. But I want to zoom into two different parts of this. So architectural changes. So basically if you zoom out and you figure out why is it that we're improving, you could classified into two different sets of reasons. The first is architectural changes that we're seeing for the deep neural networks. And second is training improvements.
Adam Becker [01:43:06]: So architectural changes. There's a lot going on here, but don't hold your breath because there's a lot of cynicism about whether or not this is actually where we got all of the performance gains. So let's look at this architectural changes. One, what are we actually paying attention to with these transformers? Second, where is the learning within the architecture happening and how do you stabilize it? If we have time, I might come back to the stabilization concerns. So what are we paying attention to? So the development that I've been seeing is that from multi head attention to grouped query attention and then all the way to multi head latent attention. So I'll give you just a couple of touch points here. And the traditional vanilla multi head attention, each head also has its own set of keys and values, right? So you just have like a single head, queries keys and values. None of those are shared.
Adam Becker [01:44:00]: This has more and more of a memory impact. So to reduce the memory usage here, grouped query attention simply groups multiple heads to share the same key and value projections. Right? So if there's two key value groups and four attention heads, you could see head one and head two, they're going to share key and value this one, and then head three and head four will be sharing this key and value. Okay. So the core idea behind grouped is to reduce the number of keys and value heads by sharing them across multiple query heads. And this lowers the model's parameter counts and reduces the memory bandwidth. Fine, but if you look into deep seq, that isn't enough. That's not what they're doing.
Adam Becker [01:44:45]: So this was a little bit bigger even just last year. But this year multi head latent attention is a different memory saving strategy that pairs very nicely with KV caching. So what this one does is instead MLA compresses. So this is latent. Right. So instead of just working directly with these keys and values, we compress those and we can also share those. Right? So they're not necessarily in opposition to one another. So you basically compress the key and value tensors into a lower dimensional space before storing that lower dimensional space in the kvcache at inference time, you can project them back to the original size before being used.
Adam Becker [01:45:24]: And you could see here that they're doing pretty good. They're even doing better than the traditional multi head. So MLA versus MHA and wins you a lot of memory benefits. So this is one example of a larger trend that I've been seeing in the architectures here, which is we treat intelligence as the repeated asking of local questions. You see, in this compression you could already get a sense of the locality. Not just that. The idea is let's focus in particular about the. Let's only pay attention to those parts of the sequence that are most close to the query.
Adam Becker [01:46:03]: So typically our attention is more global, that is, you can technically attend to the entire input sequence. Right. Local attention or sliding window attention is a particular optimization that essentially just restricts the context to very local windows so that you can cut computational cost. And that's what sliding window attention is doing. So, okay, so these are some architectural changes that we're seeing. We're going from multi head attention to latent attention and then local attention. And another question would be, where is the learning happening? So in the traditional case, you have just. So this is without a mixture of experts.
Adam Becker [01:46:44]: You just have a feedforward module that just kind of looks normal. It's basically, it just looks as typical. But the idea of the mixture of experts here is to replace each one of these feedforward modules in the transformer block with multiple expert layers. And what we see is almost, I call this the resurgence of experts in 2025. And Deep Seq introduced this at the end of 2024 and then continued to use this. So this has been an old concept, but it's been making a comeback. Okay, so deep seq in this case v3, has 256 experts per mixture of experts module and a total of 671 billion parameters. So the idea here is this.
Adam Becker [01:47:29]: If here you traditionally just had a very simple kind of feedforward module, instead you're going to replace it with let's say 256 different feedforward modules with the router that's going to say which one should actually be activated. So what, this expands the number of parameters that we're dealing with. Not all of them get activated. In fact, maybe only eight expert ones get activated. So that's been the innovation. In particular, what Deep SEQ is doing is they're also using a shared one. So you could see the evolution here. This is a more traditional moe, or the mixture of experts.
Adam Becker [01:48:05]: We have a router that then selects, okay, maybe we should do top two. Right. So the top two are going to be doing the forward pass. But then again you could say, okay, well, it could just be larger, that's fine. So maybe if we have even more of them, then the expertise will be segmented in the architecture. Fine. But what DeepSeq has been doing is saying, okay, let's have a shared expert. This one always gets activated.
Adam Becker [01:48:31]: And then we pick seven, eight, however many other ones to pick up more local signal. So that's been happening with, with Deep SEQ and We just see a lot of different architectures using these. Okay, how are we doing on time? Okay, a couple more minutes here. All these things, just small tweaks. So there's some cynicism on LinkedIn. You will see here after dissecting nine flagship LLMs, the author of or, this guy Sebastian, I have the link up. Found something nobody wants to admit. This guy says we haven't invented a new architecture in seven years.
Adam Becker [01:49:08]: The evidence is damning. Whatever. All these are still transformers. Every breakthrough model, same skeleton, different makeup. Deepseek has just been compressing attention with MLA and with expert routing. Gemma with sliding windows. The brutal reality, no new architecture. Since attention is all you need.
Adam Becker [01:49:27]: Every innovation seems to be an efficiency hack. Okay, I'm not that cynical. I mean, I think this is very likely. I mean, it's, it's obviously true. I think the biggest reason that we've seen all of this progress is probably not because of these architectural changes and these architectural developments, but it has something to do with what Nahil was talking about. So let's rewind to the beginning of the year. We had the deep seek moment. Right.
Adam Becker [01:49:51]: The architecture here, I don't think was the key innovation. It was all of the stuff that Nahil was just talking about. Nihil, I stole all of your stuff. I put it here as soon as I saw it. So because you're right on the money, it's all of the training and the alignment and all the strategies around that. The problem is, if you might remember, in the beginning of the year we were reviewing Deep Seq. The problem with it is that there's a lot that they don't tell us. And so while it's true that these are open weights, fine.
Adam Becker [01:50:21]: These are just, you know, we could use the model, you could fine tune, whatever. I'm not actually seeing what data they, they trained us on. Right. And so virtually everything that I want to do with the model, if I want to go and drive even more innovation here, I can't do it just with an open weights model. So there's a difference between open source and open weights. The training regime for the most part required a lot of speculation. How exactly did they do it? How exactly did they choose what data to train it on? This was in the beginning of the year, left a bunch of question marks. At the end of the year, three weeks ago, we had almost three.
Adam Becker [01:50:56]: And this introduced radical openness and I think it's going to change virtually everything in how we're going to train here. So we're running out of Time you should open this link here. Just literally go to almo from AI2 and what you're going to see here is something that we had never seen before. At least I hadn't found this before. This is literally a complete. The idea is you could click into each one of the stages and just inspect the data, look, see exactly what they trained it on and everything is checkpointed and you can go and download the model at every different part of this journey. Now this isn't cutting edge model. It's not yet at the frontier.
Adam Becker [01:51:39]: It's close to the frontier. But what's much more interesting is that you can go and see exactly what they're doing. And now researchers, anybody could say, you know what, whatever this is for a thinking rl. Almost three. Almost three think. I want mine to be a little bit more like this or a little bit more like that. And I feel like we're entering an era of this like, like precision training. Just to give you some sense, what exactly is their pre training? Like we have common crawl.
Adam Becker [01:52:06]: Okay. They tell you exactly how they processed each one of these things and then how they're going to be mixing it all together and how they're building models to simulate what types of mixtures should produce what type of impact. Right. So this is, I think, fascinating. There's a lot here going on. I mean, just like they're literally telling you, okay, I need a little bit more archive here, I need a little bit more stacked you, I need a little bit more Wikipedia in order to. And they just keep refining this process. And I think that it's very likely that most of the innovation is going to come through that it isn't just going to be us relying on existing models.
Adam Becker [01:52:46]: It's going to be open frameworks for creating more and more precise models that fit very, very specific needs. So anyway, they go even into the specifics like. So Nahil, you were talking about the verifiers. You can go in and zoom into very specific verifiers for very specific types of tests, tasks and modify them and then see how the resulting models are going to change. So anyway, I think this is a massive change. I suspect 2026 is going to see us leveraging. It's not just going to be enough that the model is open weights. We're going to need to have the whole thing there and all of its guts and then we need to start to modify it and do some useful science there.
Adam Becker [01:53:30]: So that's maybe we have a couple minutes.
Lucas Pavanelli [01:53:34]: Questions.
Arthur Coleman [01:53:34]: What an amazing way to end all the talks. Adam, that's. That is fascinating. I'll be on OMA 2 in a moment or three questions. I'm going to go right to them and I ask, you know, I know people have time issues. Two hours. But if you want to stay for questions, I think it'll be very worth it. Mina, you want to ask.
Arthur Coleman [01:53:56]: Take one of your questions, whichever you one you want. I think it's for the, for the last session, but go ahead and choose one. Are you still here?
Binoy Pirera [01:54:09]: Arthur, you can just read one.
Arthur Coleman [01:54:11]: Okay, I'll read one. I like his first one. I would love to listen to people's experience in building domain specific evaluation with Human plus LLM. I am in a very specific domain field as climate smart, agriculture, culture and need to be differentiated and the LLM often fails. Currently we're relying on mostly human evaluation, but the project is scaling up. Do you have any advice on building a system that combines Human plus LLM evaluation? I'm not sure that's for Ana or not. So I'll leave it to the speakers to decide who's the. Who should go first? I think that's Ana and there's noise in the background.
Arthur Coleman [01:54:54]: Somebody. Yeah, I'm going to mute again and then Anna, can you. Is Anna still here? I can't see her. No. All right, does anybody else have a comment on this?
Rohan Prasad [01:55:14]: I can help kind of feel this one. So I mean it's. Well, I guess Mina is like responding in the thing. Yeah, I think we were calling for you in terms of thinking about human evaluation and LLM evaluation. I think there's different ways to think through that. One of the things that we try to do is we actually use human evaluation as part of our feedback loop in terms of how we're developing our prompt and how we think about like various aspects of our like context engineering. So we actually start with a lot of human labeled data and we use that to sort of feed into the system. I don't know if you're talking more about like live human evaluation or something else in terms of the question, but maybe that sort of answers what you're asking and happy to chat offline as well.
Arthur Coleman [01:56:09]: All right, we'll move on. Vignesh, are you able to to chat? Can you talk now? Are you able to get off mute? I tried to unmute you and I couldn't do it. Learning this platform.
Binoy Pirera [01:56:22]: Arthur, he just said sent his question on the chat.
Arthur Coleman [01:56:27]: Okay, Any thoughts on what differences there are between multi agent and single agent evaluations?
Nahil Jain [01:56:40]: I think Lucas left but I can take That a little bit. I think one of the things here is evals are still not a solved problem and it becomes more complicated when you're doing multi agent stuff. Best practice or like what people are really doing in practice is doing it end to end and then collecting the data and then eventually if you have enough resources human annotate multi turn. So like at each step if you can say hey this was the right thing versus the wrong thing but you need a lot of data because you're basically testing different trajectories of what the agent was trying to do to solve the solution or solve the task. So I think single agents are easier also more commonly in practice, like only very AI native innovating cutting edge tools are using multi agent stuff in production we are all just using their systems as they've designed it like cursor, Claude, et cetera. So yeah, it's still not a solved problem. But you can start from end to end and then slowly go in and evaluate each step.
Arthur Coleman [01:57:50]: Nahil, I'm going to ask a question. If you've got multi agent, do you start with a single, do you start with the master agent and then work down or do you start with the individual agents and then work up?
Nahil Jain [01:58:02]: I do have the task level and I would like do end to end, from master to the task completed and then go down.
Rohan Prasad [01:58:11]: Got it.
Arthur Coleman [01:58:12]: Adam, any additional comments.
Adam Becker [01:58:17]: To go to the next one?
Arthur Coleman [01:58:19]: Okay, mine's the last one because Mina is not here. So Nahil, I want to understand, I literally want to go from here and redo my entire evaluation system. I'm not joking. Between you and Ana, I'm like I'm really screwed. Okay, so how do I go about this? Because I have to feel like if there's, there's subtlety and complexity in the setup and if you get it slightly wrong, you could really screw up your evaluation system. So where do you recommend we look to get real guidance on how to implement grpu? Grpo. Grpu. And I know you put the references in your doc, but I want to know more about how subtle and sensitive these evaluation systems can be to your architecting of them in grpo.
Nahil Jain [01:59:08]: Yeah, so GRPO is not more for evals but it's more for training a new model. Model training or not new model, but fine tuning a model to better on a specific task. Even with multi turn like you say, hey, try this in multi step thinking. So that's GRPO given enough data set and I'm not a researcher so if someone is, please correct me but I think Given enough data set, you will eventually correct the system to get to the desired output. Output. People have found in practice that GRPO helps you use much less data than other techniques to get to a better answer. And I don't think it's easy to say like the minutiae difference between one way of building the data set versus the other, what is the difference? And so that is in my mind an unsolved problem. And then I think someone had a similar question on of it feels like because in the RL fine tuning world you can almost see this gym or environment concept where you're asking the LLM to take some tools, run the output and then check the output and then update the thing that is similar to what eval is if you cut the training part like inference and the environment is what the eval part is for your agentec tool system.
Nahil Jain [02:00:34]: So I think you can use the same libraries, but I haven't tried it yet. But you can use the same libraries because they're giving you an interface of you do inference and you do tool calling in an environment and then kind of make it work together on a large data set.
Arthur Coleman [02:00:47]: Benoy, should I let do Minha's other questions or we should stop here?
Binoy Pirera [02:00:54]: I think we can stop, but since people are still tuned in, I think we can go for one more question.
Nahil Jain [02:00:59]: Yeah.
Arthur Coleman [02:01:00]: Okay. When building a multi agent system, is it common to have an agent review the other agent's results? So I guess within the system the agents that are in parallel with each other or the supervisory agent. I think I asked that question in a little bit different way, Nihil.
Nahil Jain [02:01:24]: I think that's open up to the audience as well. If someone has practical answers to this, would love to hear.
Binoy Pirera [02:01:32]: Yeah, Voldemort, I see you. So if you want to say something, go ahead man.
Arthur Coleman [02:01:43]: When building a multi agent system, is it common to have an agent review with a region with the system review the other agents results? So does it help to have your agents grade each other? Basically.
Nahil Jain [02:01:58]: I think you would rather have like specialist grader agents that would do that.
Arthur Coleman [02:02:02]: Yeah.
Nahil Jain [02:02:03]: Instead of making them review each other because they. Yeah, they have usually very specific tasks.
Arthur Coleman [02:02:12]: Okay, last question in the list and anyone can answer this. I just wanted to clarify one point. I guess this is for you Nihil. I just want to clarify one point. For GRPO, what is the model logistics, I.e. judging which answer in the answer group has a better heuristic value. Is the heuristic value a mathematical representation? If so, what is it based on?
Nahil Jain [02:02:38]: Yeah, good question. That Goes into a deeper beads. So let's say the the text to SQL example what you're doing is you have some natural language and then you're saying hey LLM, do a thinking block, then give me the SQL query and then interpret the output. Let's say these are the three things we say it to do to increase the thinking time it takes and improve the output. Then you basically break down as a human how you would judge this. You would say hey, did it first in the multi turn, did it ever query the schema? You basically write some rules and heuristics to check the output and run them like verifiable code functions almost. Hey, is there in the output a call to check the schema? Is the output eventually fully correct in the thinking? Is it saying XYZ things which should match based on the NLP query that we got? It's like you have to break down a task into multiple bunch of heuristic tests that you can then run on a codified way. Otherwise you end up doing LLM as a judge.
Nahil Jain [02:03:41]: That's also a way to do it, but it is all about how you how you figure out a reward without actually building a full model running in this agent loop as well.
Adam Becker [02:03:52]: Can I compliment that real quick? So I think the way that they split it here is into verifiable tasks and non verifiable tasks, right? So verifiable tasks are ones where it should be fairly easy for some program to verify the correctness. So how can I detect and handle counterfeit money? There should be exactly two paragraphs. Paragraphs should be separated with star, star, star, use all lowercase whatever. These are things that I can write a piece of code in order to verify. I don't need to go and ask an LLM about this, right? So this is an instruction following in math. I already have the right answer. Let's say I could just check is the prediction the same thing as the right answer? Again, I don't need to put an LLM there coding. Does the thing compile? Am I getting do the unit tests pass that sort of thing in general chat, sometimes it's more difficult.
Adam Becker [02:04:43]: So then you can put an LLM as judge even here you could probably then go and further create different dimensions of thinking in analysis. But yeah, I think that this distinction is probably a useful one.
Nahil Jain [02:04:57]: Which also ties back to the question that Valdemar was answering where it depends on your task. Like if your task is something that can be just verified, don't have another agent reviewed, get it to call a tool and then otherwise have a specialized agent do part of the non verifiable part of the task.
Arthur Coleman [02:05:18]: Well, that was our last question. Benoy, you want to close us out since you opened?
Binoy Pirera [02:05:25]: Thank you so much guys for joining, especially the speakers. I mean it's not lost on us the the incredible amount of effort that you have to put into, you know, present something of value and the amount of research that you have to do. And we really appreciate it. To all the speakers, thank you so much. And at and like we get like a lot of messages and emails complimenting us and thanking us for doing these sessions because people do actually learn a lot. So thank you so much for all the effort and Arthur, for you for, you know, as usual hosting these sessions. So thank you so much everybody. Have a good holiday.
Binoy Pirera [02:06:02]: See you next year.
Adam Becker [02:06:04]: Thank you everybody.

