Small Language Models are the Future of Agentic AI Reading Group
speakers

Sonam is a data scientist turned developer advocate.

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

Hey! I’m Nehil Jain, an Applied AI Consultant in the SF area. I specialize in enhancing business performance with AI/ML applications. With a solid background in AI engineering and experience at QuantumBlack, McKinsey, and Super.com, I transform complex business challenges into practical, scalable AI solutions. I focus on GenAI, MLOps, and modern data platforms. I lead projects that not only scale operations but also reduce costs and improve decision-making. I stay updated with the latest in machine learning and data engineering to develop effective, business-aligned tech solutions. Whether it’s improving customer experiences, streamlining operations, or driving AI innovation, my goal is to deliver tangible, impactful value. Interested in leveraging your data as a key asset? Let’s chat.

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .
SUMMARY
This paper challenges the LLM-dominant narrative and makes the case that small language models (SLMs) are not only sufficient for many agentic AI tasks—they’re often better.
🧠 As agentic AI systems become more common—handling repetitive, task-specific operations—giant models may be overkill. The authors argue that: SLMs are faster, cheaper, and easier to deploy Most agentic tasks don't require broad general intelligence SLMs can be specialized and scaled with greater control
Heterogeneous agents (using both LLMs and SLMs) offer the best of both worlds
They even propose an LLM-to-SLM conversion framework, paving the way for more efficient agent design.
TRANSCRIPT
Arthur Coleman [00:00:00]: In the chat is the link to the Questions document where you can add your questions as we go. I will explain how that document works in a minute. Our topic today is Small Language Models and the Future of Agentic AI. It's an interesting paper, interesting hypotheses being posited. I have my thoughts about them. It'll be interesting to see how you feel about them. We have three illustrious speakers, as always. Nahil Jain, who is.
Arthur Coleman [00:00:30]: Sorry, Sonam Gupta, who is a senior developer advocate at AI Explained and really happy to have her. Adam Becker, who many of you know because he does a lot of these. He's the founder of Headon, which is a website for political activism and it's a pretty interesting site. I urge you all to go take a look at it. Adam's doing some pretty important work. I will also tell you that he is a graduate of Berkeley in both astrophysics and classical history, Greek and Roman history. And I did promise him that he's always going to get a question about ancient, because I love the history of that period. I was going to give him a question, but given limited time.
Arthur Coleman [00:01:12]: Adam, I'll let you off the hook today, but it was about, I'll warn you, the next question will be of the 14th Emperor of Rome. So we'll go from there. They made a famous movie about him. And then we have Nahil, who is in a. A stealth startup right now. He has quite the background. He's ex McKenzie and X quantum Black. And by the way, at the bottom you see our LinkedIn addresses.
Arthur Coleman [00:01:33]: If you wish to connect with us, I always urge you to connect with us. We love to be, you know, part of the community. That's why we're here. And we're always happy to have conversations with you one on one. The agenda today is, is Sonam's going to kick us off with the introduction and definitions and that's important. Like what is a small language model that's not an obvious definition? And then the problem statement. Then Adam's going to review the arguments that are put out by the authors in favor. NIHIL is then going to talk about barriers and the way we have to adapt to them.
Arthur Coleman [00:02:06]: And in between, you're going to see surveys, polling questions pop up in particular. One should be up, if not now, because I can't see the screen about whether people have read the paper in advance. This is important for us because it tells us we're trying to find out how often people are actually reading these papers in advance because it's how we prepare to speak so if you wouldn't mind answering that poll in particular, it'd be great. As always, the guiding principles of the reading group are that these sessions belong to you. They're intended to be interactive. We talk for about 35 minutes. We all discuss, including you, the 25 minutes. That's why we want you to do the questions.
Arthur Coleman [00:02:47]: And of course, the more we all participate, the better it is for all of us. There are no dumb questions. I always like to say this is a no judgment zone and you should always feel safe in here to be able to ask any question you want. I was on a call yesterday, it was a PhD thesis presentation and I had three questions and I didn't ask them. And I later talked to the speaker who graduated from MIT yesterday and he was like, well, why didn't you ask those questions? Those are great questions. And I was afraid, not being a physicist, a PhD physicist, to even ask any questions. So there are no dumb questions. And the Google Doc link is shown where the questions you will go.
Arthur Coleman [00:03:28]: The questions will be answered in the order received. So like a stack on a computer chip. And then please, please, for our sake and for the sake of getting better for you in the future, please fill out the post event survey that will come in your mail. Really appreciate we read those comments and we will use them to help prepare how we do things in the future. Now lastly, this is not something I normally do, but I always read in preparation for these events. The references, not all of them, obviously. There were like 80 references on this paper, but one of them, which is 68, 66 from Subramanian, it was done in March of this year as a survey of small language models and how they work. And it's basically some of the major research that's quoted in this.
Arthur Coleman [00:04:17]: I wouldn't worry too much about the left side of this diagram, but what I like about this side is it shows all the small language models that they have sort of looked at and categorizes them based on the approach to getting there. And I thought this was an interesting way to frame the discussion and whether, you know, transformer based, reparameterized, additive or hybrid models really work better than others. I don't know that we'll get to that, but it'll be interesting if we can have some conversation about that. And with that, I will turn it over to our first speaker. So are you ready?
Binoy Pirera [00:04:50]: Yes, I am. Thank you, Arthur.
Arthur Coleman [00:04:52]: Let me, let me let go and stop. How do I do that here? And we're being recorded. Yep. Did I let go Yeah, I let go.
Binoy Pirera [00:05:02]: Thank you everyone. Hope everyone's doing very well. I'm very happy to be back here and I'm going to share my screen quickly.
Arthur Coleman [00:05:17]: The results of the paper. It's interesting. The majority of folks have read it or skimmed it, that's about 60% so far. So thank you, that's great and feedback for us.
Binoy Pirera [00:05:28]: Well, if we are ready, then I can start. Okay, perfect, Go ahead. All right, so as Arthur mentioned, the paper today, we're discussing small language models are the future of agentic AI. Now, the QR code that you see on the right side of my screen, it's my LinkedIn profile's code and if you wish, feel free to connect. Would be happy to have a chat later, but let's get started now. So small language models are the future of agentic AI. That's a huge claim the paper says, which kind of makes sense too if you think about it. Now, the basic overview of this paper, it basically it's challenging that even though bigger is better in the AI agent systems, it argues that the small language models, not the LLMs, not our usual LLMs, are better suited for building much effective and faster AI agents.
Binoy Pirera [00:06:23]: So this research basically demonstrates how smaller, more specialized models can basically do better outperform their larger counterparts in AI agents while being more economical and operationally efficient. Now, the way there are two working definitions of small language models in the paper, the first one, it says around that it is a language model small enough to run on constrained compute with low inference latency. As we know large language models, we need a huge amount. Like it needs an infrastructure to have it run smoothly, it needs better latency, it has lot more parameters and so on. The other working definition they have, which made me chuckle a little bit, is that a large language model is a large language. Sorry, a large language model is a language model that is not a small language model. So it was a little confusing there, but the first definition it made sense to me. Now before we get started, I want to just take a quick overview of what AI agents really are.
Binoy Pirera [00:07:31]: AI agents. So the way I define is that an agent AI agent is a system where LLM or a large language model is the brain. Whereas when it uses the tool to do more autonomous task depends depending on the query. The best part about agents is the autonomy. It breaks down the complex task into manageable steps. It makes decision based on whatever context and objectives you provide. It calls the external tools and APIs to accomplish these goals. And as I said, like the language models are the brain to the agent so that it can do the reasoning part and then it adapts to feedbacks and the changing conditions.
Binoy Pirera [00:08:18]: Now the way the paper has positioned or the authors have positioned their hypothesis is in three ways. First, they say that small language models are powerful enough. Basically they have shared like what are the recent advancements in smaller models. They have much capability, sufficient capability for majority of the agent tasks without really needing or adding that overhead of using the large language models. The second way they have positioned is that small language models are more operationally suitable. So basically they offer better latency, less usage of the memory and then much simpler way to deploy your agents in production. The third one is small language models are more economical. I mean, if it's using less latency, you know, less parameters, comparatively speaking, it's of course economical for people to use it because you know, it reduces the cost of training, fine tuning and inference in production environments.
Binoy Pirera [00:09:24]: So basically, yeah, so this is what the the authors have positioned the paper or their argument, the evidence that they have showed it, they have compared a lot of different models. So for example, especially to support their first argument which was that it's powerful enough, now they have given examples such as Microsoft's fee model where you know, it matches like around 30 billion models or the fee 2 is around 2.7 billion. It matches the 30 billion model comparison on common sense and coding it matches basically runs 15 times faster. V2 on the other hand, it's matching the performance of models that are 70 billion parameters on language and reasoning. Same with similar, similarly with Nvidia family or Nemotron have these models yet. But overall all these models that you see on the screen are the examples that the authors have actually given where they're showing that small language models are much capable comparatively to their large language models counterpart. So that's all and I am going to pass on to Adam to share us the next how next points over to you, Adam.
Adam Becker [00:10:52]: Awesome, thank you very much. Okay, I'm going to share my screen too. And we made a little mirror board so maybe we can share this one as well in case people want to follow along. Would people like that? If so, do you have the link?
Arthur Coleman [00:11:11]: I've got the link.
Adam Becker [00:11:12]: Okay.
Arthur Coleman [00:11:15]: Yeah, I'm putting it in chat now for everybody. It's there.
Adam Becker [00:11:21]: Awesome. Yeah, if you want to follow along. So we have the paper laid out here page by page, but especially for the purposes of articulating the arguments, we thought it might be nice to just see it all in one view as kind of like a big diagram. So let's start here. So, okay, their major claim is small language models. They're going to be the future of agentic AI. And one theme that you're going to see them continue to come back to is what we can call the LEGO theme. So it's just scaling out by adding small specialized experts instead of scaling up monolithic models.
Adam Becker [00:12:03]: Right? That's the intuition here. Instead of thinking about a very large language model, it costs whatever, millions of dollars. This is it. Just start with very small, very expert, very purpose expert models. That's going to yield systems that are going to be cheaper, faster to debug, easier to deploy, and better way we're actually going to see in the real world. Right? So that's the intuition here. And I think Mihail and I were texting about this. It's almost like they have that type of intuition and it could be a blog post, whatever, but they ended up making this into something that's a bit more structured to really like, drive home the point and be very precise about the argument.
Adam Becker [00:12:48]: So this is the shape of the argument. The conclusion is they believe small language models are the future of agentic AI. Why do they think that? So, so already spoke about these three different views. So put a little brain icon here. You can just consider this the powerful enough view, right? Small language models, they're sufficiently powerful to handle large. To handle language modeling errands of agentic applications. So they're good enough. Even if you want to make the case, at some point you were going to see that the large language models are even better, might be even better, but these are good enough.
Adam Becker [00:13:23]: So why do you need the better? So that's the first view, powerful enough view. Second one is the better suited view because of how agents operate. And we'll zoom into that. Small language models are inherently more operationally suitable for use in agentic systems rather than large language models. So that's the. It's just they're better suited. And lastly, they're going to be cheaper. So small language models, they're necessarily more economical for the vast majority of language model uses in agentic systems, much more so than their general purpose LLM counterparts.
Adam Becker [00:13:56]: But because they're smaller. So by virtue of them being smaller, they're also going to be cheaper. Okay, so now let's investigate the arguments and then some of the counter arguments to those arguments and then the rebuttal to the counter arguments of the arguments in support of these views. Okay, so powerful enough view, the first is good enough argument. If you look at competitive off the shelf performance of SLMs versus LLMs in those places where it matters. And Sonam spoke about this where it matters, SLMs are just as good, so you really don't need to worry about any of the other stuff. Next, they jump around a little bit because I think they're trying to build a case. So in support of the powerful enough view Sonam showed you on benchmarks, small language models are doing just as well where it matters.
Adam Becker [00:14:43]: But what about on the cheaper view? You have the economical argument, which I think is a pretty interesting one here because they break it down into a few different dimensions. The first is the inference efficiency, right? So serving, they said serving a 7 billion parameter SLM is 10 to 30 times cheaper than a 70 to 175 billion parameter LLM. Right? This is, these are insane differences, right? So they're like, of course it's going to work, of course this is ultimately going to win out. It's so much cheaper. Next, because you don't have, because they're small, you don't have to deal with parallelization across GPUs across, different nodes of GPUs across. And because of this, it's going to be even cheaper to maintain and to operationalize. Even just the infrastructure, forgetting about just the inference itself of survey managing, the rest of this is going to be much easier. Next, fine tuning.
Adam Becker [00:15:38]: Agility, because they're small and especially if you use parameter efficient methods. But even if you just go for full parameter fine tuning requires only a few GPU hours. You could just go to sleep, you wake up in the morning, the thing is fine tuned, you don't need to now spend weeks trying to fine tune these models. Edge deployment, you could just deploy these things on consumer grade GPUs. You don't need any specialized infrastructure for it. And they even mention and draw references to a lot of people doing all sorts of interesting LLM stuff or language modeling on mobile devices. Parameter utilization, they're fundamentally more efficient because if you look at the percentage of parameters that are actually activated in the large language models, it tends to be very, very small. And so that isn't the case.
Adam Becker [00:16:32]: So they're much more for the small language models. So they're much more efficient. Because they're much more efficient, they utilize parameters more efficiently. The whole thing is going to be cheaper. That's the economical argument, right? So economical argument goes into the cheaper view. All right. Next they make an argument about flexibility. So they say small language models possess greater operational flexibility in comparison to LLMs and the shape of the argument then looks like this, because it's cheaper, it's more practical to train and adapt and deploy multiple specialized expert models for different agentic routines.
Adam Becker [00:17:07]: Here they start to slide into slightly different. Oh yeah, so this is. Okay, so first it's cheaper and therefore they're more flexible. So it supports the cheaper view. But this one also supports the better suited view. Why is that? Because if it is cheaper, well then more people can play with it and more use cases can be solved and can be addressed. And because of that, you might even introduce less bias. Because if somebody has a bias in the deployment of AI, at some point maybe those biases will cancel each other out.
Adam Becker [00:17:39]: You're going to have more perspectives, more people actually have a stake in the success of these systems. So it's more suited for agentic systems and it's cheaper. So that's the flexibility argument. Let's now jump back to the powerful enough view. Okay, so if you think about all of the uses, all of the actual skills of a large language model, very few of those skills are actually used and utilized during agentic processes. Agentic applications are interfaces to a very limited subset of language model capabilities. So even if you can make the case that they're intelligent, it might be overkill, or that their large language models are more intelligent, it might be overkill. Why? This is kind of how they see it.
Adam Becker [00:18:24]: If you think about what an agent is, they say, all right, let's say an agent is some tools and some heavy prompting and some choreography. So you know, you dance around like this and you play with the memory and then the tool and then you, you know, you bounce back and forth in different ways and then you allow a human to interface with it in some way via text or voice or whatever. Well then really what you've done is you've squeezed a very large language model's capabilities into a relatively small subset of things. Okay? So that squeezing means that maybe you don't need that large language model in the first place. It's too large if you're doing all this squeezing. So just start with something much smaller. It's so much, it's, it's so much more fitting. So it's powerful enough.
Adam Becker [00:19:08]: Small language models are good enough for these very small subset of the uses. That's the argument. I imagine some people can already can foresee maybe some of the counter arguments to this. So this is the intelligence overkill argument. Okay, this one is also just means that it's better suited, right? Not just that, it's the powerful enough view. So better suited. Okay, we're pushing along almost, maybe probably halfway done. So next is the behavioral alignment argument.
Adam Becker [00:19:36]: So agentic interactions, they necessitate very close behavioral alignment. I almost call this just like the right formatting argument because I think that's like the main argument that they're bringing up, which is this. Let's just imagine that right now you're building the app, right? You have an agentic app. There's a bunch of different agents doing different things or different parts of an agent that are doing different things. You probably want a single format to, to communicate between the different parts. Maybe it's JSON or YAML or markdown. And if the model knows too many things, maybe if the model has been trained on lots of different potential formats like an LLM had been, maybe you're risking some hallucination here. But if you have a model that is small, that let's say has only been trained or has been enforced or fine tuned to only really return JSON, you reduce the risk of it returning something that isn't adjacent.
Adam Becker [00:20:36]: Right? So you can be much more aligned with the intended behavior that you expect if it's small. So that's this argument, right? Behavioral alignment argument. Next is naturally heterogeneous argument of all the different just you can. If you can imagine a non agentic world, the agentic world naturally allows for a lot of heterogeneity, especially in the model selection. Why? Because it's supposed to be doing different things. When an agent invokes a language model, you could choose any language model. So why not choose the language model that's the best for the task and that will probably push you to the smaller one. So different levels of complexity call for models of different sizes if you have one.
Adam Becker [00:21:22]: If you have a particular task that requires the LLM, maybe that's fine, but most of them are unlikely to. This is again based on that previous argument, this one. Okay, we have another one which is the interaction data argument, which I think this is the cutest argument they've had in the whole paper. And it kind of looks like this. Each invocation is itself producing data that can be used for optimization. So think about it. Let's. I think the best way to think about it might just be like in the alternative reality.
Adam Becker [00:21:57]: So let's imagine that we say no, no, we're just going to go with LLMs, no SLMs. Fine, but we know, as we said, that we're only really using a particular LLM for a particular task, so. And we're already getting a Bunch of data about the performance and efficacy and effectiveness about this task. We know about the prompt, we know what the user did, we know whether it was downstream useful or not useful. All of this is data. And I said, okay, fine, I'm going to start with an LLM. I don't want the slm. But now I have all this data.
Adam Becker [00:22:31]: I have the data about the prompt, I have the data about the efficient. Why shouldn't I further optimize just that one piece of the flow? And then I optimize, I optimize, I optimize, I end up with a small language model. So that's kind of how I see it. So we could just, if you were just to look, you start with LLMs, you just keep tweaking, you're going to end up with the SLMs. Okay, so I think that we're done with the all the arguments in defense of this claim that SLMs are likely the future of agentic AI. So if people are persuaded, here are some counter arguments. So, okay, we have an alternative one. This one is an alternative to the better suited view.
Adam Becker [00:23:15]: Empirical work shows the LLMs follow scaling laws. Some of them spoke about this larger size, better language understanding across tasks. So this is to the point of saying, is it actually overkill? Maybe it isn't. Maybe for maybe what you do need is the ability to understand language in a more nuanced way so that you can even deploy the right agents or the right tasks. Maybe that interface can't really be managed all that well with a small language model. Maybe what's necessary is the large language model here. Okay, so that's an alternative view. There are some rebuttals here.
Adam Becker [00:23:55]: So this is called the, I call this the reasoning rebuttal. So techniques like self consistency and verifier feedback allow SLMs to trade extra compute at inference time for reliability without sacrificing their lightweight deployment benefit. So even if it's the case that large language models, when you compare an LLM with an slm, maybe an LLM still does better and maybe you could still, the LLM would still be good enough, the SLM would still, the LLM would still be much better. But SLM isn't just what you get out of the box. Maybe you can add more, you can, you can trade extra compute that inference time and then at that point it could still be doing better. So there's some other tricks that you can do. And again, I think this is a rebuttal that's related to something we'd already covered. Simple Subtasks, they just don't require the complex abstractions and understanding.
Adam Becker [00:24:43]: Sure, the LLM is good enough, is plenty better, but for most simple subtasks, which is the domain of the agents, you don't need that. Okay, so that's one alternative. We'll tackle the other one next. So could it be the case that now managing a bunch of different SLMs is going to be expensive because you need to deploy new specialized infrastructure for it. You need to build a whole now world that supports this. Maybe that's going to be too expensive. I think they are. I would say they're sympathetic to it, but ultimately they rebut it.
Adam Becker [00:25:25]: So recent improvements in inference scheduling and large inference system modularization offer unprecedented levels of infra system flexibility. So yes, sure, but the trend is going to make it easier and easier, will have better and better tools to manage and orchestrate all these different systems or components of a larger system. And then the other counter, the other rebuttal is most recent analysis on the setup costs of inference infrastructure indicate consistent falling trend due to underlying technological reasons. So again, we'll just tech will be able to meet us there and managing the infrastructure is going to be less bad. So this is I think a summary of their argument. But I think they also leave one possibility here to say it could be the case that while SLMs are better suited for the job and they might be cheaper, LLMs have a head start and so maybe that's going to win out. They just started sooner and so even SLMs might be better and cheaper. LLMs have a head start, they push back on this.
Adam Becker [00:26:34]: They say I think they're open to it, but they still think that it's much more likely that given all of these attendant benefits of SLMs, that's going to be the future of Magenta CAD. Okay, with that I think I'm gonna hand it over to Nahila.
Nahil Jain [00:26:51]: Cool.
Nahil Jain [00:26:52]: Yeah, I think that makes sense. I will also go to the same.
Arthur Coleman [00:26:56]: Before we turn it over to you, I want to go to the poll. It's a very important poll. It really has surprised me actually. We had 40, 66% of people participate, 32 of 48. And any of you who haven't participated yet, go ahead and add into this. But what's interesting is that we have four people who have tried this and they found that they can run models as large as 10 billion parameters. But we've had seven people who have tried this and they have not been able to make models this large run in in small memory environments. And frankly, I should vote because I'm.
Arthur Coleman [00:27:33]: I would be eight on that score. So that's interesting. And yet. And so that's 11 out of 32 who have tried it, but the majority have yet. So when we get into the conversation, folks, I would like to hear please, from those who have tried it, what their experiences have been. And I'll try to find someone, but just come prepared if you, if you wouldn't mind sharing with us your experiences. And now I'll turn it over to Nahil.
Nahil Jain [00:28:02]: Cool. Yeah, thanks, Neal. Board time. Did I see the right one? Nope, not the right one. Ah, this one. Okay. Yeah, so basically I added some of the red stickies kind of similar to the alternative. So one thing they were saying that powerful enough view, that SLMs are powerful enough.
Nahil Jain [00:28:34]: But then I think Samantha was talking about this in chat and they have this point that because LLMs have generalized in general now over international, they can do multiple languages, multiple modality. They understand concepts at a much higher, higher abstraction level, which is how maybe the Asian world looks at a concept is different than the Western philosophy of it. And they understand both. And so they can kind of translate between the two. And that's one argument they have that that's why LLMs might be better intuitively. Again, I still feel like if you're just going for automation, it doesn't really matter that much. You can fine tune for whatever your task is, but that's one of the alternatives.
Adam Becker [00:29:18]: And then.
Nahil Jain [00:29:21]: Another counterargument they have for the cheaper thing is things that at least the mlops community has known before that you cannot get full utilization of the small language model endpoint. So like, even if you have your own slm, maybe the LLM endpoint will win when you average the cost of ownership because of the traffic you can drive to the centralized endpoint versus like these bespoke endpoints you have for your agents. And so they're like, maybe it doesn't matter and LLMs will still be better because of centralized endpoints. And the second thing is the talent and cost of managing each endpoint and each inference of an SLM for your agentic task is way more complex than just like the training of the LLMs and the large language model teams managing it in a centralized place. So those were two things I wanted to also add for the counter arguments. Okay, so for how do you solve for the barrier? So how do you, I mean not.
Adam Becker [00:30:25]: Solve for, but nail. Are you, are we looking at, are we supposed to see mirror or something else?
Nahil Jain [00:30:31]: Oh, what were you seeing?
Adam Becker [00:30:32]: We're seeing mirror.
Nahil Jain [00:30:36]: Okay, let me restart. I think something is wrong. I think it's the wrong mirror. I've been moving around, but I have two, three mirrors open. Gave me one sec. Oh, I spoke so well that you guys didn't catch that I was actually showing a mirror board. So nothing moved on the screen. Huh?
Binoy Pirera [00:31:03]: Showing Adam's flowchart from earlier.
Adam Becker [00:31:10]: Tricky, tricky, tricky, tricky.
Nahil Jain [00:31:12]: Okay, is there anything moving?
Adam Becker [00:31:16]: Okay, now you're moving.
Nahil Jain [00:31:18]: Oh, boy. Okay, so first thing I was talking about was, you know, LLMs are powerful enough, and the multimodality and internationalization is a counter argument from their side. So that goes on to, like, the intelligence and the powerfulness of LLMs over SLMs. And then the other piece I was talking about before was the economical argument, which is, hey, because the models are smaller and you can run them on constrained compute, you have efficiencies. But their counter argument is maybe not because you have the utilization problem and the talent and the cost of actually getting them running problems. So sorry for the mess up on what I was showing versus what I was talking about. Okay, now we can talk about what are the other barriers? That's also kind of discussed. Adam just touched on it before handing it off about, you know, LLMs have a head start.
Nahil Jain [00:32:20]: And so what that means is all the benchmarks and how you actually evaluate these different things is more suited for these frontier labs because they are designing the way to prove to each other that mine is better, et cetera. And Those are all LLMs right now. And so that's one problem. The other is there's a lot of investment that has gone into the LLM side of the things versus the SLMs. And so there's a lot more advertisements, there's a lot more talent being poured into the LLM side of things, pushing them ahead in the curve. And then it's just awareness. You don't have the same level of awareness for SLMs being used in agentic stuff. Most people, if you would have done the poll, answered the other way.
Nahil Jain [00:33:00]: Where have you tried LLMs vs. SLMs? Most people would have tried LLMs, but not many people would have tried SLM because it takes a slightly higher barrier to entry to try it on the constraint compute, et cetera. So that's their barriers. But they do say that our alternatives is still better. And so maybe there's SLM in the future. Now, if you have LLMs and you want to move it to SLMs, and you're convinced from all the arguments and the points of why SLMs will be the reason or the main way to build agentic systems. Then they give out a kind of playbook of how you would go about it. And in my mind that's very similar to the kind of in older Worlds, the MLOps playbook and now it's the Evals and AIOps.
Nahil Jain [00:33:55]: We don't have a formal term for it yet. Playbook where you instrument your process, you kind of build the end to end loop and then you instrument it so you have all the LLMs taking the prompts, giving the outputs, and then your flow is going through and anything that doesn't have human in the loop instrument all of those. And then you collect enough examples. Once you have those examples, then you do separation of those based on the different tasks that are being done and then you build SLMs to become expert in those specific tasks. This is the same playbook you would run if you were fine tuning something or improving the accuracy of a node or the whole agent workflow. That's what they talk about here. Another resource that I follow a lot, there's this course running around on Internet, at least in my social media, by Hamel and Shreya. I think Shreya has spoken a lot on the MLOps community podcast and even in events.
Nahil Jain [00:34:55]: And so they talk about this eval framework to improve AI apps and agentic systems. And their kind of flow is very similar as well, where they are saying, hey, analyze it first, collect all the data with examples and then measure based on what you think is accurate versus not accurate. So maybe manually label it if you have some other way to figure out what is the accuracy and the correctness of the thing and then improve it. Improvement can be either prompts or it can be fine tuning. So it can be either route. I found that their route to the LLMs to slms looks very similar. So I think that's all. And then they're just asking for let's have more discussion about it.
Nahil Jain [00:35:40]: This is very new, which we are doing right now.
Arthur Coleman [00:35:44]: That's a good transition. Nihil, thank you so much. You can leave these. What should we leave up? Should we leave Adam's diagram up? Because it may, you know, prompt people for some thinking on stuff or hopefully everyone's got it on their desktop because it's easier to see. Before we go into this, and this is a topic I want to make sure we cover, recent research shows that memory is a much more important factor in AI performance. In fact, in all computing it took years to prove this mathematically, although it's been suspected for years. A young man at MIT finally did it. The question I posed to you all here is whether you think memory is the real focus should be, you know, the ability to remember things and have a larger context window, and whether you think the emphasis should be on processing.
Arthur Coleman [00:36:39]: We've had limited answers because. Kind of a hard question to read. I'll leave it up for now before we. We come to the conclusion on it and before we get to the questions I'd like to ask. Let me just look who it was here. Samantha, is it? Did I get that right? Samantha Zeitlin, you've been talking about some experience you've had with small models in your application. Could I get you to spend. Could we get you to spend a couple of minutes making this real for us, since you've already sort of opened up, and I'm open to anyone else having that after we get through the questions.
Arthur Coleman [00:37:15]: But, Sam, would you mind telling us about what you've done and what your experiences were?
Adam Becker [00:37:22]: Yeah.
Nahil Jain [00:37:23]: We built a system to detect when you're having an outage based on your logs. So we needed a large language model that could handle the full context window. So we ended up using Claude with whatever it ended up being, 200,000 tokens. And we needed it to be able to do a holistic analysis and build out basically a graph, right? A chart of how all your services are connected to each other, which ones are having issues, and then make recommendations on what to do about it. And we did do some experiments using LLAMA for parts of this, and it just didn't work. We even had trouble getting any of the models to do well with tool use for, like, JSON validation. That's gotten better now, but a year ago it was really difficult. So I think the big piece that may be relevant for this group especially was we had to keep refactoring our code because we were figuring out which parts needed to be modular in which ways.
Nahil Jain [00:38:39]: And deciding to add additional models in would have made it even more complex. Rather than just having, you know, a couple of APIs we were using, we would have needed to add more clients for more APIs, et cetera, et cetera. So we decided to try to keep our architecture simpler. And I mean, another factor for us too is the security aspect, right? Getting different models vetted by the company we were working with, getting them approve it.
Arthur Coleman [00:39:11]: How much memory were you running in? How. What was your memory limitation on that?
Nahil Jain [00:39:18]: Well, I was saying we ran into some issues where, like running things in cicd, they only had four gigs of RAM on that system. And we were trying to explain to them that a lot of our use cases needed more memory, even just to handle the data, never mind hosting a model. So I think depending on where you're trying to deploy these things, some companies are really not used to working in the AI space and they're not even thinking about data engineering at this scale of, you know, we might have data files that are 100 megs each streaming in essentially real time. So you definitely need enough memory to be able to handle all of that, plus whatever code you need to run. So that's another advantage to using a third party model is you don't have to worry about how do we allocate and scale the instances to host the models on top of everything else.
Arthur Coleman [00:40:20]: Thank you for sharing. Adam, you're showing your screen or whoever's showing their screen right now. Let's. That was great. That certainly puts some context to some of the problems that the paper brings up or some of the arguments it brings up. Let's go to the questions. The first question, and I hope people, could you, if you've already put a question, could you put your name next to it so we know who it is? Whoever asked this first question, would you go ahead and put it out there in the ether? Do you think this argument holds? Who was that?
Chimein [00:41:01]: That was me. I chime in. Yeah. So this seems like it's talking about the field or the industry or the state of the art a lot here and more thinking as a developer trying to build a product, is there any room here still for off the shelf or foundation in models as a thing? Or are we. Or is it really fundamentally saying like, no, you need to fine tune something for everything, like this is your YAML model and this is your JSON model and those are different and so is that. Do you agree that's what they're arguing? And do you think then that this is a viable proposal? Because people love to not have to do that work. And that to me seems like that's so much of the popularity of an LLM in the first place, is that you don't have to train it.
Nahil Jain [00:42:04]: I think you bring up a really great point, which is we shouldn't all be training models for JSON validation. That's just dumb. There should just be a standard of practice that everybody can use.
Nahil Jain [00:42:15]: I also think that a lot of people have empirically found recently that in general, prompt engineering is better than fine tuning, which goes back to the same cost of your cost and talent of inferencing on these Models and building these models and keeping them up to date is pretty high. So for that reason I feel like yeah, we can't go to the extreme of like everything is a fine tuned model for your task. But there might be arguments where maybe what they're saying is that you might just move to slm, which might be better at doing that specific task which LLMs was doing. And maybe you can think of the SLM as a foundation, like let's say Mistral, have a small model which out of the box is doing pretty well on that specific task and they're saying that might be the world you might live in and you don't even have to fine tune. But yes, I don't think fine tuning and creating so many versions of SLMs, I buy that future a lot. Like maybe part of it, but not fully.
Adam Becker [00:43:20]: Yeah, I, I have a perspective on this too where mostly in like when I'm still in founding a company mode or like trying to build a new product mode and I'm chasing product market fit, the last thing I'm going to do is right now break apart my system and try to do that will be crazy, right? Like all I got to do is just see if users are going to use it. So I wonder if the way that they would distinguish this is like by number of new experiments run or in terms of actual calls and invocations served. It could be that once you hit some scale then in that case, of course everybody should be using these SLMs or whatever in some conjunction with LLMs for their agentic systems. But when you're still experimenting and you're not yet at scale, I would not spend I think more than a few minutes trying to experiment on, on an SLM at this point. This is how I, this is how I'm. I'm likely to run things until shown otherwise. If, if somebody has a different perspective though. Yeah.
Arthur Coleman [00:44:38]: Other comments? Mihil, anything? Okay, let's move on to the second question. I don't know who this was and I think they are two separate questions two and three. I don't think they're the same person. So whoever had question two? You still there? All right, I'll do it. Does the paper mention the suitability of sln? Sorry, what?
Guest 1 [00:45:08]: Hey, this is.
Adam Becker [00:45:10]: Yeah, sorry about that.
Guest 1 [00:45:11]: Yeah, so mine is actually like these three separate parts. But yeah, the high level is basically, you know, how does reinforcement learning and like reinforcement learning on like you know, agent specific or application specific tasks kind of fit into the usefulness of small language models? So I think one Point where like I think of reinforcement learning is interesting for like agentic use cases is like coding agents where or like mathematical agents where you like have like, you know, a result that you can really evaluate, like binary, like good or bad, but also just, you know, where you can define, you know, an eval with an LLM as judge. You can also do reinforcement learning there for, you know, your application specific tasks. And I think that's something you can do, you know, on any of these large language models that, you know, have open weights or have like fine tuning interfaces. But I kind of wonder if that's like more appropriate for small language models. Actually didn't get to read the paper. Um, so. But I did a control effort and didn't really see it mentioned reinforcement learning.
Guest 1 [00:46:32]: So I'm curious if anyone like has any thoughts or like experience with actually using RL for like task specific training.
Nahil Jain [00:46:43]: So I can tell you our experience with RL was that it was too hard to get humans to give feedback. Especially in some use cases where you need like subject matter expertise to evaluate whether the outputs are correct. It's really, really hard to get enough people to look at enough data points to be useful.
Nahil Jain [00:47:06]: And now that makes sense. Gone, it's even harder.
Guest 1 [00:47:11]: Yeah, I think I want to look more into the agent reinforcement trainer where it tries to get around that, I think by using an LLM as judge. And you know, I think LLM as judges, like maybe a difficult thing to get right, but maybe there's certain areas where it can actually be pretty accurate and useful.
Nahil Jain [00:47:31]: Yeah. So Dylan, one of the things if you see the screen right now is like they were saying that LLMs are more intelligent versus SLMs, but the authors argue that you can use. They don't necessarily say rl, but there are things like because inference is cheap and because you can have expert type models, you can train a model to be better at judging. So LLM as a judge. Or you can do self consistency where you're like, hey, I'll run the same prompt n times instead of once and then find the majority vote of the calls and then use that as an output. Or you can have another LLM which verifies the feedback, which kind of is. It's more like LLM as a judge rather than rl. But they do talk about that a little bit and they say that because it's cheaper to run them, you can easily do that with SLMs and get them the same or even better quality than LLMs.
Arthur Coleman [00:48:30]: Any other questions?
Adam Becker [00:48:31]: Yeah, thanks.
Guest 1 [00:48:33]: The cost and the latency Makes a big difference in being able to do that. I think that's all I have on that. Thanks.
Nahil Jain [00:48:40]: I just wrote this in the chat, but I'll just say it out loud in case people aren't following that. We did a lot of work with LLMs as judges, and I just want to point out that it's another whole project in the sense that you have to do another whole round of prompt engineering and validation to make sure that the judge is working correctly. And I think a lot of people are very nervous about that. So our clients at least also want to see human validation on anything that you use an LLM judge for. I think a lot of people are still going to feel that way for the foreseeable future. So that's another thing to keep in mind. It's not that trivial to get a judge that works really well unless you have really deterministic outputs, in which case you probably don't need an LLM.
Arthur Coleman [00:49:33]: Excuse me one second while that's somebody in. All right, question three. And this, I love this question, and it's one that I'll phrase a little bit. Sergio, do you want to go from here since it's your question?
Sergio [00:49:47]: Thank you, Arthur.
Adam Becker [00:49:49]: Yeah.
Sergio [00:49:51]: I wasn't quite sure whether the paper was replacing any participation of LLMs. That was the suggestion for SLMs. And I was very compelled by the argument that language can be complex and, and hence whether a hybrid approach of calling in of that LLM, digesting.
Adam Becker [00:50:16]: What.
Sergio [00:50:16]: Is happening and orchestrating the whole process and then being able to possibly summarize in a domain specific way and hand over the task to the relevant SLM that has been trained for a particular, a particular task.
Adam Becker [00:50:34]: That.
Sergio [00:50:34]: Is that, Is that what you were thinking, Arthur, what you interpreted?
Arthur Coleman [00:50:38]: Yeah. I'll put the question a very specific way in terms of if you have a supervisor agent versus the subsidiary agents running under it, do you have the supervisor agent using a LLM, but the actual agents who are doing the specific tasks using an SLM as an option? So, Sergio, that sort of is the same question.
Sergio [00:51:00]: Next time I'm going to let you phrase it for me, please.
Arthur Coleman [00:51:04]: No, no, that's fine. It's a version, It's a variant.
Adam Becker [00:51:07]: Yep.
Sergio [00:51:08]: But that's exactly right, yes.
Adam Becker [00:51:14]: Sergio, are you seeing the diagram? I don't know if you can see it now, but it's. And I think this is the visual depiction of Arthur, what you were just saying as well. Right. So there's different ways in which the. Like you can, you can imagine that the primary LM as the controller could be an LLM and then all the other ones being like SLMs. I think that was my takeaway reading it that at least the way you should proceed is replacing some things that are a little bit more like smaller scope and then gradually kind of like see if even like your coordinator can do more, can become an slm. But now when I'm going through it, I'm not finding anything explicit. That was just my kind of my takeaway.
Adam Becker [00:51:58]: So I don't know if they mentioned it explicitly to you. Did anybody see an explicit call to it or it just kind of became like obvious reading it that you wouldn't want to replace this one with an SLM as the first thing to go.
Nahil Jain [00:52:12]: Yeah, I think it's the latter. I do feel like the paper is written from someone's almost like a straight strong viewpoint that you know, I want to push SLMs. So they probably won't say that explicitly, but that is the intention even down below when they're talking about hey, once you have enough data and you're doing evals on your tasks, you can then train an SLM. Or like figure out if an SLM can do the same job as an LLM. But start with an LLM is what they're saying there. So yeah, but yeah, I think there's going to be a hybrid. What? There's no way like which goes back to this other question I think we were discussing. Question number two.
Nahil Jain [00:52:49]: Where do you train? Do you find, you know, SLM for every single job? That's a lot, a lot of work and I don't think there's consensus on it at all.
Binoy Pirera [00:53:00]: But do you think like using the hybrid approach where you are using SLMs for smaller tasks versus LLMs for the bigger ones and the agents, will that hamper the quality of the results it returns? That's my only concern. Like the quality might be affected because when I build agents and when I try to use different LLMs for in a multi agent system, the results actually become slightly weird in the sense like the quality gets affected. Whereas when I'm using just one LLM for all the agents, no matter which one, the quality stays pretty sane or consistent. So that's my only concern. Like how will it affect when we're using SLMs and LLMs in one multi agent architecture or system?
Guest 2 [00:53:51]: I'm going to time in just wanted to bring out this point, something we've explored. Hybrid does work for us in high volume environments where you need to make a prejudicement on any sort of downstream events. Which might need say LLM processing. And what we've been able to do successfully there is use these SLM as a pre filtering there to sort of adjudicate what needs further processing and what needs further sort of execution at a very complex level. So think of it as like a separation between a high speed lane on a highway or toll, wherein you're just quickly filtering out any sort of, you know, irrelevant events and just focusing on the main ones. And that has worked really well in like a high volume, high velocity environment.
Adam Becker [00:54:39]: Yeah. Okay.
Arthur Coleman [00:54:46]: Any other comments? And there's something I forgot to say up front. So if you're asking a question, if you wouldn't mind, if you're not in your PJs and you don't look like Albert Einstein because you haven't done your hair this morning, which is what I look like when I don't. If you'd put your camera on so we can see you asking the question. So, Yassin, you are up, if you don't mind.
Yasine [00:55:06]: Yeah, of course. I don't think I've done a very good job on phrasing the question. And I think it also like touches upon what Adam said earlier about building a product and being in that mode where you just select the best model up front.
Adam Becker [00:55:17]: Exactly.
Yasine [00:55:18]: The better lesson. Thanks, Arthur. I feel like the authors were arguing very hard in that direction that you can use LLMs. It's better for the environment, it makes sense for very narrow tasks, but doesn't. Ultimately everybody would just like to have one smart agent and not have to deal with SLMs, not have to deal with the orchestration, not have to deal with thousands of prompting, et cetera. I just wanted to understand whether the group feels that sentiment or whether I'm missing something. And.
Adam Becker [00:55:48]: Yeah, I have an opinion on this. As a big fan of the bitter lesson. Go ahead. I think that it's. Or at least I think you could make a good argument that. Hold on, just, you know, I'm just riffing, right. I haven't thought about it all that much, but it could be that the reason that this feels to me like probably like the most interesting argument here is that you could go back to the passenger seat. The idea isn't that you are actively, always constantly feeding in more data for fine tuning and you are the one that's always active, in control, but that the system could, at least in principle, automatically, organically evolve to such an extent that it only really uses a small language model.
Adam Becker [00:56:41]: Whether. Yeah, that is somehow like fine. And you. If I were to put you in the driver's seat. Maybe you wouldn't even know what should be the. The small language model in the first place. And what would be the ideal place to replace the large language model with in your entire agent architecture. And if that's the case, it's perfectly consistent with the bitter lesson where.
Adam Becker [00:57:03]: Yeah, although I could see how it would be. Yeah, it's like a flavor of it. Right. So the system is continuing to evolve without so much human guidance, unless you feel like you need to step in to introduce guardrails or whatever you need. But on its own, it can already kind of self optimize and the output of that self optimization will be small language models.
Arthur Coleman [00:57:26]: I'm going to stop us here. It's nine o' clock on the dot and part of my job is to keep us on time. And my whip is out, Adam, so I'm going to stop you. Thank you everyone. This is a great discussion. I'm sorry, Artem, we didn't get to your question. The participation is wonderful and thank you, Samantha, for sharing your experiences and the other folks who did as well. Look forward to catching up with you all offline again, you have our links and please feel free to send us and do that.
Arthur Coleman [00:57:54]: And we will see you all next month at the reading group. Take care. Be well.
Adam Becker [00:58:00]: Thank everybody.
Nahil Jain [00:58:01]: Bye.
Binoy Pirera [00:58:04]: Thank you. Bye.