What is the Role of Small Models in the LLM Era: A Survey
Korri Jones is a Sr Lead Machine Learning Engineer and Innovation Coach at Chick-fil-A, Inc. in Atlanta, Georgia where he is focused on MLOps. Prior to his work at Chick-fil-A, he worked as a Business Analyst and product trainer for NavMD, Inc., was an adjunct professor at Roane State Community College, and instructor for the Project GRAD summer program at Pellissippi State Community College and the University of Tennessee, Knoxville. His accolades are just as diverse, and he was in the inaugural 40 under 40 for the University of Tennessee in 2021, Volunteer of the year with the Urban League of Greater Atlanta with over 1000 hours in a single calendar year and has received the “Looking to the Future” award within his department at Chick-fil-A among many others, including best speaker awards in business case competitions. However, the best award he has received so far, is being a loving husband to his wife Lydia.
Sophia Skowronski is a Data Scientist at Breckinridge Capital Advisors with previous experience as a Business Analyst at Pledge 1%. Sophia has also worked as a Data Science Intern at Candid, an AI Investigations Intern at Deep Discovery, and held roles at Singularity University and the Global CO2 Initiative. Sophia holds a Bachelor of Arts in Astrophysics and Cognitive Science, as well as a Master's degree in Information & Data Science from the University of California, Berkeley.
Lihu Chen is a Research Associate at Imperial College London. His research focuses on natural language processing and large language models, in particular on developing efficient, reliable, and open-source models and tools, with an emphasis on information extraction and biomedical applications.
October 2024 MLOps reading group session explores the role and relevance of small language models in an era dominated by large language models (LLMs). The author of a recent survey paper on small models joins to discuss motivations for using smaller models, including resource constraints, efficiency, and unique capabilities they bring to certain tasks. Key discussion points include the advantages of small models in specific contexts (e.g., edge devices and specialized tasks), their role in complementing large models, and emerging techniques for leveraging small models to enhance model efficiency and mitigate issues like out-of-vocabulary words. The group also touches on methods for compressing models and the challenges in balancing model size with generalization and task-specific performance.
Binoy Pirera [00:00:00]: I think we're good to go. So once again welcome everybody to the next this month's edition of the mlops reading group. Good to see everybody joining. I see a lot of familiar faces, a lot of new faces. I think it's very, very cool that we have people from all over the world again covering all the time zones, from all the continents coming here for one common goal, just to dive deeper into AI machine learning and just demystify this whole thing. So it's good to have you guys here. And today's session. And I see Waldemar's cats making another cameo.
Binoy Pirera [00:00:32]: I love it. Every time. Brilliant. So today's session is a little bit special. Why? Because we have the source of the paper right here with us. Usually how we do things is we shortlist a bunch of cutting edge technical papers released very recently and we send it on the public slack channels for people to vote and to weigh in on what are people interested in talking about. And people usually end up voting for one paper. But this time things were a bit different.
Binoy Pirera [00:01:03]: Li Whose paper was this paper won by landslide. A lot of people wanted to talk about it and I was very intrigued by that and I wanted to see who's the author. And I reached out to Lihu, the author and he has very generously accepted. So join us today and answer our questions and just take a link. So that's what makes the session special today. So if you guys have any questions whatsoever at any point, we know the person that wrote the paper, he's here with us, so feel free to ask any questions. And as usual, we have an amazing host today. Join.
Binoy Pirera [00:01:40]: So let me give you guys a quick intro. As usual, we have a familiar face returning in. Corey Jones is a senior ML engineer at Chick Fil A. Always a pleasure. And we have Sophia Skoronski. I'm so sorry, I suck at names. I hate it. So Sophia is joining us today as well.
Binoy Pirera [00:02:05]: She's a data scientist at Beckinridge Capital. And as usual, we have Waldemar Egiston, he's AI team lead at Smart Data. And also last but not least, we have our amazing resource person for today, the author of the paper. He is a postdoctoral resource associate at the Imperial College London, Mr. Lee Chandler. So without further ado, let me pass the microphone to Corey. Do you want to take over from here?
Korri Jones [00:02:34]: Sure.
Korri Jones [00:02:34]: I will kick things off in a different type of way and just thank you so much for the love. Like you're a rock star and champion of this community. My comments will be pretty brief. But I always like to. I think it's important to just kind of level set right. We've heard some people say that you're here to learn. This is a safe space. This is a place for us to learn and talk about this content.
Korri Jones [00:02:56]: Some folks are coming with 10, 15, 20 years of experience doing things in a supplementary way. And like, you are an expert in your way. And then there's other people, like, listen, I have no idea what this was. I read this paper and I understood the first three paragraphs. The rest of it, I have no idea. And I have all these questions. We have a wide range of people and we want to let you know that that is a. Okay.
Korri Jones [00:03:17]: We want to celebrate everyone. The experts, the folks that are. That are just starting, the people in the middle, and then the folks that are just like, I was curious and I saw this was here. I like Demetrios. I've seen him and I've heard about this community. And I want to just kind of join and see what you guys are talking about. And so we want to make sure that this is good. This is a safe spot.
Korri Jones [00:03:36]: Ask questions. If you feel uncomfortable asking verbally, drop it in the chat. We'll be there to kind of observe and listen in. And it is okay if you say nothing. It is okay if you ask a lot of questions. This is a good place for us to come and learn. And that is part of the heartbeat of the reading crew for the Mlock community. We get to learn, we get to grow together, we get to have a little bit of fun.
Korri Jones [00:03:58]: And so I wanted to just kind of settle this up right there and bring us in. I'm excited about this because I been thinking about this type of content for this paper. And so I'm coming in as like, person in the middle. And so if you see me ask some crazy questions, that's okay. So let's celebrate that, everyone being able to come in and have some fun. So hopefully that lets you know who we are, how we kind of operate in that energy and that heartbeat of this. And so I'm pumped, I'm excited. But I'm also humble enough to know that I don't know it all, and that's okay.
Korri Jones [00:04:30]: And so we want to say that to each of you all, it's okay if you don't know it all, and it's okay if you know almost all of it. So let's come in together and grow together. With that being said, I'm going to kick this over to the man with the master plan the one who had the courage right to write a paper, publish it, do all of that research and all the things that are there I'm looking at like the screens right to the left of me. And that is the one and only. So I know you're going to get a chance and help ground us and give us kind of a summary and perspective. And I'm going to mute my mic because I need to pull out my notepad because this is about to be good.
Lihu Chen [00:05:05]: So regarding the motivation, I think this very interesting question and when you ask me the motivations and I started reflecting myself why I wrote this paper and why stick to the small skill language models. And I think I believe there are two motivations and the first motivation is kind of stupid but very direct. I stick to small models or emphasize small scale language models that I don't have very many the powerful GPUs I don't have the computing resources. So I have to admit that compared to the industrial giant and we have less computing resources. And given my personal experience while I was doing my Ph.D. i remember I had two research projects going at the same time and the first project is about positional encodings and I'd like to inject the word ordering information to additional modules. As we know the attention itself it cannot capture the word order means it means that they are position independent and in this project we need to study the properties of different or different position ecosystems and we like to pre train world based model from scratch. And in my lab I have 100 cars and it took me more than one month to pre train the birth model from scratch and I need to repeat these experiments many times and due to this reason this project was very struggling every time I submitted this paper and there was some concurrent findings and it means that I need to include these new findings in my current world and it was rejected many times from ACL to eclear and finally I got some negative comments regarding this submission and the reviewer side.
Lihu Chen [00:07:35]: You should study much larger larger models like Llama 7 billion instead of Bert and Robart since their size is fake and small and it's not a trend. So through this I realized I cannot compete with people that have more or much powerful computing resources. And this lesson I learned from this research project and another research project is that I try to fix the issue of bird based model for the encoder but the encoder only language models they have an issue when facing with the auto vocabulary word it means that this word they have never seen during their pre training. So they tend to make mistakes when facing with this auto vocabulary words. So I propose to use a small model. It's like a plugin. So you don't need to pre train the original model from scratch. You just.
Lihu Chen [00:08:48]: You need to fit a smaller tiny model that can learn, can impute evaluate words and then you make the two size of model work together. Then you can fix the auto vocabulary issue. And after the paper was accepted by ACR it's an auto paper. And also I got some messages that people are very interested in this direction how to use small models to enhance larger models. And this is the first thing that or lessons I have learned. And recently why I started writing this survey paper actually is from some complaints. Almost every day I can hear people they are talking about how powerful and how great. And at the same time I can also hear from my colleagues or friends or people in this field, they are complaining, they say that the alarm field is dominated by these high tech companies, by Google, by OpenAI and we cannot compete with them.
Lihu Chen [00:10:13]: And I heard this, I keep this in mind. And a direct stimulus to these paper is the word from OpenAI. They published a paper named Victor Strong generalization. And the main core idea is to use a smaller model or a weak model to supervise a larger model. So from this I realized that small model can also contribute to LM. And this is the first point, the second this year, this spring I traveled to Malta for NLP conference and I saw a paper. So they try to use small models to enhance the in context learning. So for in context learning you give a few examples and then that can learn these new things in the context and then can be applied to unseen or new tasks.
Lihu Chen [00:11:21]: And I saw a research they tried to use more model to to generate or to label the input samples. And through this way they can inject knowledge into the context. And their typical is kind of sexy that small models can dance or small models can help Adams and after that then went to my advisor's office. My advisor is kind of Oracle and asked him whether we should start this project to think of how to use small models to enhance ADMs and what is the role of small models in the era of atms. And I know for some users they don't consider cost or the expense. They stick to ChatGPT or Cloud to large scale models. However, for some simple queries for some specific scenarios we don't need such large language models. And I think maybe we should inform people or tell people that first some certain task we don't need larger language model we can still use ones and this is the optimization for the resource usages for power for our environment.
Lihu Chen [00:12:58]: And starting from this point and I wrote this paper, I think this motivation regarding the overview of this maybe I can also present this briefly. So the basic idea is that LM are very powerful. We all know this. But I did a very interesting study. I like to know which model which NLP model are hacking face is downloaded most and then I crawl the data on cutting face and do the 360. And the interesting thing is that I found the smaller model is still widely used. It's just more popular than the larger counterparts. And the most popular model is still bird still bird based model.
Lihu Chen [00:14:00]: So this raise an interesting question what is the role of small models in the era of lm? And to assize the role of small models. Oh, nice cat. So to assize the jaw of small models we need to analyze the stress and the weaknesses of both small models and the larger models. So for sure for LMS they can achieve state of the art performance across various tasks. And also they can be generalized to an unseen task to new task with minimum examples. This is the stress of alarms and what is the weakness? What is the flaw? And a clear one is that if you want to use LM to pre train, to fine tune or to deploy this model, you need computer resources. So this is one limitation. Another one is interpretability.
Lihu Chen [00:15:13]: So based on our intuition, smaller or shallow models they are more transparent. It's easy to understand their internal process so they tend to be interpretable. So we are considering. So maybe for some certain scenarios small models have their niche market. For example for the healthcare, for the legal question answering. And we need explainable model, we need to to know the reasoning process. If the model give an output, we need to know why it gives such an output, not that. So for the interpretability required environment we need small models.
Lihu Chen [00:16:06]: Another classic scenario is computing resource limited scenario for edge devices for our cell phone. If we want to deploy a language model, maybe a tiny model can work. So this is the view from all of the computation and also snow models and the larger language models they can co work together and for optimizing the resource usage or to have better performance. So maybe the two size of models they can complement each other's shortcomings. For example, I would use my studies to showcase this. So another issue of enlist is that they may generate hallucinating text. So a partitional solution to this is that we can estimate the confidence or the uncertainty of their responses responses or their answers. This means that we ask them to express uncertainty in their generated text.
Lihu Chen [00:17:23]: I say that for 100% sure this is this case or I'm not sure I didn't remember this correctly. I'm 50% sure I've never seen this. I don't know, I have no idea. So we can add this acidity in their answers. To do this we can use a small model to measure or to estimate the LM generated text. So this is one way how LMS and small models do the collaboration. And the second scenario, regarding the computation, I would like to emphasize my study about the short text reading. So for short text like stream matching, I guess you all know stream matching to compare the semantic similarity between.
Bruno Lannoo [00:18:22]: But I do wonder, like you mentioned, that interpretability is clear or explainability is clear for small language models. And I do know that for small machine learning models like this, that is very clearly there. But for small language models which are still relatively large neural networks, I don't fully understand what is explainable about them. Can you go details on that?
Valdimar Eggertsson [00:18:45]: Sure.
Lihu Chen [00:18:46]: You are totally correct. For the tree models we can offer linear models. It's very easy to understand. And so the intuition is that I'm sure there's no clear definition or clear things that a bigger model share a better integrality than a smaller model. But intuition is that for the shallower we have limited layers. Imagine we just have one layer. For llama we have very deep layer so it's hard to interpret the internal process. Yeah, this is the intuition.
Binoy Pirera [00:19:33]: Great, thanks. We have another one if you don't mind. This one's from he asked would uncertainty be text which is part of the output or probabilities goes across all tokens.
Lihu Chen [00:19:47]: It can be both. So first we can ask them to express identity in their text like our humans. Humans. You ask me question like maybe one audience raised the question. So I don't know. I just say I don't know. But for Adams, we force them to generate answers for any query. So one solution is that apart from the generated text with regard to the query, you can also generate a confidence score or a verbalized scene to express the ICT score.
Lihu Chen [00:20:27]: Like I'm very sure, I don't know, I'm only 50% sure. So we can use prompt like give me the answer with a confidence score in your response. We can just use this prompt based method to initiate confidence call. This is one way. Another way is that we can so imagine. I like to use this comparison. Imagine if we want to check whether a person is live so you can ask the same Query to the person multiple times and check whether his responses are consistent or contradicted. Through this way we can compute a consistency score.
Lihu Chen [00:21:26]: We use this as a convenience score. So first we can ask them to generate to express a certainty by themselves. And the second we can observe their external behaviors to get this confidence scores. And another sensitive way is that we use their internal signals. Like if you want to check whether this one is lying, you can detect their brain activities or their blood pressure heartbeats. So these internal signals also it can be applied to alarms. We can check their internal activations or the proper distribution over the vocabulary. So there are many ways to get the convenience scores.
Lihu Chen [00:22:15]: And what is the role of small models here? The thing is that you can use a small model, a fine tuned one, to check the consistency, to check whether their responses are contradicted with each other. I hope it's clear.
Binoy Pirera [00:22:32]: Yeah, that was great. Thank you Lehu. I think Sophia and Waldemar, I think would be able to give us some thoughts as well. But before that we have one more question from Keith. Do certain kinds of research in SLMs generalize better to LLMs? Are there certain kinds that don't generalize well, your work on functional embeddings seems like it would Generalize well to LLMs.
Lihu Chen [00:22:58]: Yeah, I would say that in general larger models are more generalized compared to smaller ones. And the benefit of using small models is that for very specific tasks, for example for tabular learning or for short tag reasoning or for domain specific tasks, if you fancy a small model, unlimited data samples and it can all form a general purpose large language model. But regarding the generality in general, larger models are better. I know, I guess also he asked the positional encodings.
Keith Trnka [00:23:49]: I can speak up. I meant about the generalizability of the research, not the generalizability of the model. When you learn something with positional embeddings by doing research on small language models, you can expect that that research will hold true as a larger language model. But certain kinds of research may not. When you scale up.
Lihu Chen [00:24:13]: You mean dedicated to the embeddings?
Keith Trnka [00:24:18]: Like the embeddings or for instance different types of attention mechanisms may work on small but not on large or work on large but not on small.
Lihu Chen [00:24:29]: Yeah, but my experience that even though for you batting set my experience as larger models do better. Yeah.
Valdimar Eggertsson [00:24:39]: So the idea is to just look at the paper together and collaboration parts and say a few words about each part and maybe spark some discussion. No, I don't hear anything. Can someone say something?
Bruno Lannoo [00:24:57]: Yes, we can see it by 10.
Sophia Skowronski [00:25:00]: Sorry.
Valdimar Eggertsson [00:25:01]: All right.
Lihu Chen [00:25:03]: Yes.
Valdimar Eggertsson [00:25:04]: It's a bit different now since we have the author here to present it but usually we would just go through the paper and discuss things that stood out for us, maybe get some different points of view and ideally talk about it. Whenever there's a question or idea pops up we can maybe have a little chat. So no need to introduce the motivation or the concepts. We can just go right into it. So there were two different things that. So in this paper we viewed it from two different aspects of collaboration and competition and we're going to focus on collaboration for now. So how can they enhance each other?
Binoy Pirera [00:25:54]: There's from Isaac. Right. Is that you?
Keith Trnka [00:25:58]: Yes. Just on that very first graph, the hugging face downloads. Do you think that might be because for larger models people would tend to use APIs instead of downloading them and self hosting them. So that might not necessarily. It'd be an indication of model popularity because we're not seeing the usage of the larger models in that graph.
Valdimar Eggertsson [00:26:24]: That's my intuition. I don't know what do you think.
Lihu Chen [00:26:28]: As it can be. So we overlooked some downloads from API but in general, in general also my gut feeling or you can ask people around who for most applications we still stick to the bird based model. I think this funding relatively fair enough.
Valdimar Eggertsson [00:26:56]: Yeah. Was this for like 1 month or average over a long time or is it just like per month? Yeah. August, July this year.
Lihu Chen [00:27:06]: Yes. You can collect it on August. You fix well by mouse.
Valdimar Eggertsson [00:27:13]: That's interesting. Yeah. For me like I used to use Burst and Roberta when I was like three years ago.
Korri Jones [00:27:23]: Sure.
Lihu Chen [00:27:23]: Yes.
Bruno Lannoo [00:27:23]: K. I was also wondering at that level like maybe if you're downloading such large files you're going to keep them on your local machine once and then if you're downloading the smaller ones you might be like I'll redownload it every time. Like it's not that much effort.
Lihu Chen [00:27:40]: Yeah, it can be.
Valdimar Eggertsson [00:27:43]: All right, so we can maybe look at this picture here together. It shows there kind of. Yeah like a mind map tree of the whole thing. The paper. So I'm going to focus on the first five items here. This is how small models can enhance large language models. There's data curation and weak to strong paradigm that you mentioned earlier which have to do with how to train a large language model effectively with the help of small models. Then we have inference evaluation and domain adaptation which is yeah maybe of more direct interest to people who are not training the models but just using them, applying them and yeah it took some notes.
Valdimar Eggertsson [00:28:32]: Will have Some highlights to point out and throw out questions whenever you can. So the first topic is data curation and the motivation for that is that we're already training our models on the entire Internet and that's proven quite fruitful so far. But is there any more text left? I thought it was super impressive when these smaller models were trained on the whole of Wikipedia and all the books in the world and then GPT3 came out and et cetera, it goes on and on. But recent research supports the notion that less is more. This is the main thing here that you can train on a much smaller set and get high quality. The library is not a lot of it. And I saw, I remember one paper this year or some study about textbooks are all you need or something where they trained only on like university textbooks or all textbooks in the world to get this kind of smart model instead of just using all the data on the Internet. It's just a bunch of nonsense.
Valdimar Eggertsson [00:29:48]: Selecting it properly is key to getting a powerful yet maybe effective or efficient model. It's not the biggest model I worked and there are heuristics for selecting data for filtering out the nonsense. But we can also use the smaller models and a simple classifier can be trained to assess content quality. It did that. I think this is the GPT3 paper and then subsequent papers where they trained large locus models and yeah, so as a pre filtering step for curating it we can run it through the BERT model that notices that a tweet is not as useful as a paragraph from a scientific paper. There's also reweighting which I thought was pretty cool. It's like affecting the sampling of different sources. So we could use a model especially for we are weighting the samples.
Valdimar Eggertsson [00:31:07]: So we would. Yeah, I guess we would weight something that comes from Wikipedia higher than comes from Twitter. And what do you think about this? Any comments? So we have, I think recent. There was like a recent line of models came from MIT or something. 1 billion parameter or few billion parameters which were scoring as high or higher than models that were much bigger that came out earlier. So this is one tool we have to compress the models into.
Bruno Lannoo [00:31:43]: One thing I think I read in this section somewhere is that you could use it to filter out toxic language and this kind of uninterest. And I was wondering how do we handle the fact that the LLMs, you don't want to train them to create toxic language, but you do want them to be able to handle if a user would use toxic language against them. Is there like a technique that we use to make sure that they're still able to do that if we remove the toxic language from their data sets.
Valdimar Eggertsson [00:32:12]: I'm thinking that might have to do with the instruction tuning. Like I think how did it with GPT4 for example was just have people do reinforcement learning from human feedback. They were like humans reading all the toxic materials toxic responses and flagging them and then that's used with some reinforcement mechanism. So that's my like idea. I don't really know the answer but that's what I think. If someone else has some wisdom to share maybe I think there might be.
Bruno Lannoo [00:32:44]: Other ways like this might be the way that it's done. Definitely. I don't know. There might be ways of also subclassifying in there and asking is this actual toxic generation or is this responding to toxicity? And maybe training on the responses that are detoxifying would be interesting.
Lihu Chen [00:33:02]: Maybe I think there are already existing work determine a small classifier to determine why the safe sample is toxic or not through this way to filter out. I also mentioned this in the MIPS future direction part and offending reveals that if you remove some toxic content it can decrease the performance of the original model. But I don't know the exact reason.
Valdimar Eggertsson [00:33:38]: Cool. So the same principle applies to the instruction tuning. So we can just like we can select the pre training data we can also select the instruction data. So the study less is more for alignment demonstrates that fine tuning on just 1,000 carefully curated instruction examples can yield a well aligned model I guess motor. So yeah these are the two main steps to build an LLM pre training and instruction fine tuning. So that's super useful. However maybe like not directly useful for engineers except for those who are working on developing these systems and similar to this is the weak to strong paradigm. So I thought was when I read this I was a bit confused at first I thought it was kind of about I could understand how a weaker model could teach a stronger model but this is about the alignment with human feedback to yeah I guess not say that use the toxic language.
Valdimar Eggertsson [00:34:46]: Whenever we talked about earlier and this is something that has recently been shown and I think we mentioned earlier OpenAI did this that we can so this scenario introduces a new paradigm for aligning superhuman models termed weak to strong generalization which involves using smaller models as supervisors for larger models. So we could have I think often a set of different specialized models one for maybe they're teaching or trying to align the larger model with different things that are important and that seems to work and the motivation for this Is that. Yeah, you can't just have a human judging the output of superhuman model. We need to have some automated way to scale this up and continue into the superhuman era. I don't have much to say about this but.
Bruno Lannoo [00:35:58]: The traditional approach is the upper way around. The first thing that people did was have a stronger model and then try to train smaller models based on the. On the supervision of the larger models. But so this is flipping it around or.
Sophia Skowronski [00:36:14]: Yeah, we go into knowledge distillation later on in the paper. So which is the large? The small?
Valdimar Eggertsson [00:36:22]: Is it about alignment mainly. So maybe Leo, could you share some.
Lihu Chen [00:36:30]: Like maybe explain the basic idea on this ground paradigm? Yeah, okay. The basic idea is that Open Eye thinks that in the future we can have superhuman models. That means that their capabilities are more superior than our human and also for some specific task, for example for code generation. So a very powerful model can generate complex lengthy codes. It's hard for our human to do the annotation since it's too long, too complicated and also contain some logics. We don't know whether there are some risks or some privacy problem or safety problem. So here we need to introduce another model to supervise or to actually their idea is kind of easy to understand. She just used a weaker model, smaller model to generate pseudo labels for these training samples and through this way to survive the latter model and the prove that it can work.
Lihu Chen [00:37:50]: So I think the basic idea of the big to strong paradigm.
Valdimar Eggertsson [00:37:56]: Yeah, thank you. Yeah, this is a very interesting relevant topic related to the alignment problem. And are we going to continue three, four years from now and then a small model doesn't necessarily mean a BERT model in this context. It just means small learner relatively to the large one. So we could have GPT4 assessing the like helping to tune, making GPT5 more aligned with human values. That's the idea. I guess then we can move on to the next part which is about efficient inference. And it's maybe most related to mlops or this kind of operations of using these models in production.
Valdimar Eggertsson [00:38:44]: And yeah, I guess the idea is that we can use smaller models when they work, we can combine them. We don't need to always just use the sledgehammer when you can use a hammer. Personally I tend to just use the GPT4 API or form meaning maybe not when it's actually just like a classification task for which maybe a simpler model exists and it doesn't really matter because I'm doing it 100 times or a thousand times and maybe don't care so much about 100 cents. But you do it a million times, it starts to count. So there's different approaches. So overall it's motor and sampling. So ensemble methods are just combining different models. You can have a voting mechanism, maybe see when they agree and often get a better performance than individuals.
Valdimar Eggertsson [00:39:49]: Like random forests are just ensemble decision trees. And there are two different categories. There's a model cascading, just thought it was pretty cool. It's similar to, I guess if you talk to some organization or company, first talk to the customer support and they can probably answer your question and then they raise the issue to someone higher in the chain and maybe acting with an expert. Similarly, we could first use a small, more like medium large language models to address an issue where, you know, try to solve a task, but if it's out of that model's capabilities, we cascade it up to the, to the more powerful one. And the critical step in this process is determining whether a given model is capable of addressing the input question. This method effectively optimizes inference speed and reduces financial costs. There are multiple different approaches.
Valdimar Eggertsson [00:40:58]: We have citations here for 2024 for techniques for training a small evaluator to assess the correctness, confidence or quality of a model's output. And if that's possible. I don't know how well these approaches work, but it's great at least to just to have the confidence. So if you are using a prediction to guide some decision, if you have the confidence as well that it's 60% comment or 95% changes everything about how valuable the prediction is. I can't go into the details here since I didn't read these papers, but sounds like an interesting thing. The employee verification prompt. Yeah, I guess it's just prompting models specifically for doing this kind of evaluation. Maybe trained specifically for what? And the other approach is model routing, which is about routing or forwarding the input to the right model, then we need to predict it somehow and preferably without computing the output of everyone and doing it afterwards.
Valdimar Eggertsson [00:42:15]: We need to predict it somehow beforehand to know which one to choose. So I guess you can have some heuristics for it. Like if it's a really long input, you need to use a model with a long context window. If it's about reasoning, seems complicated. You can maybe use the chain of thought model and there are different examples here. One cool one here, just to mention something forc they propose a meta model. Cool term here. It's just a regression model.
Valdimar Eggertsson [00:42:48]: It's some kind of predictor to assign queries to the most suitable model. Without requiring the execution of any large model during the process. The MATA model is trained on existing pairs of queries and model performance scores. So yeah, if you have something like this that can predict which model to use, then you can have this kind of society of different models. Maybe. Yeah, maybe more similar to natural intelligence. I work in US to have like different things responsible for different specialists.
Lihu Chen [00:43:26]: And.
Valdimar Eggertsson [00:43:26]: Speculative decoding is similar, except it's about having a small model helps a larger model generate tokens. Sorry, I don't know if I can explain it. Is it something I read? I think this was all kind of cool. I haven't tried using any of this. Has anyone in the room tried using a sample methods instead of just going straight to the, you know, straight to the sledgehammer?
Bruno Lannoo [00:43:59]: I think it is interesting because I feel like in more traditional machine learning you would often ideally start with a very simple model and then slowly build up. And then when you build up you can maybe start building an ensemble based on like all the models you had on your path. But I feel like with the LLMs, because the easiest, simple, quickest approach is to just send it to the big LLM, you end up like starting straight off with the most powerful one and then starting to think, well, to decrease cost we have to add other models. And it's a bit flipped around this kind of process. I think in this evolution we are.
Keith Trnka [00:44:40]: Doing a project at the moment using multimodal Gemini models to sort of validate whether images that people might upload to an E commerce marketplace or adhere to certain rules. And interestingly we have seen that depending on the rule, it's not always the biggest model that gives the best performance. Like sometimes the smaller model gives better performance for certain rules. But I guess as always with these LLMs, it's not very transparent or clear why that might be. So I just thought that is interesting. So we're evaluating all the different Google Gemini multimodal models and some of them perform better on some questions compared to others.
Sophia Skowronski [00:45:29]: I'll do my best to run through some of these last five. I actually created a little slide deck because I have a lot. I have a bit easier time getting through this if there's graphics because like Corey, I think I'm somewhere in the middle in terms of expertise. So I'll share like some of my takeaways. Let me just set up the screen sharing real quick. So the last five techniques, I think I wanted to start with this one primarily because it's first, but also I spent the most time here because I think a Lot of us, if we're. If you're experimenting with LLMs, you are often running into RAG as a technique. And for anyone that doesn't know, it's a technique that enhances large language models by combining them with a external knowledge base and then retrieves relevant information and embeds it in the context of the prompt given to a language model which allows it to produce more informed and accurate responses.
Sophia Skowronski [00:46:36]: Because as Lihu was mentioning, LLMs have known problems with memorization and also hallucinations. So LIHU categorized these into three different. Or the authors categorize the different RAG techniques into three different areas. Textual documents. And so these are retrievers that typically don't generalize well, I'd say because they require specific vocab needing to be present. But so I think some of us are familiar with things like term, term frequency, inverse document frequency or BM25 best match 25 which is a ranking algorithm to determine relevance of a document to a query. And so you can kind of see a nice graphic of what sparse retrievers are versus dense. And using something like BERT as the encoder for queries and documents and doing some similarity matching is the dense retriever technique at a very high level.
Sophia Skowronski [00:47:44]: So let me just quickly move along and feel free to interrupt again. So I didn't read all the papers but I wanted to pull in some like high level graphics because they kind of show you other implementations of rag using different LLMs. And in this case some of them are using smaller fine tune LLMs as part of the retriever process. Retrieval process, I say. And so Knowledge GPT at the top was one of the highlighted applications. And so it uses something like. And all these papers have their own like branded prompting technique. So this one is called Program of Thought prompting technique.
Sophia Skowronski [00:48:30]: And so it generates search queries for knowledge bases. And then it has two tasks associated with it. It's the knowledge retrieval part and the knowledge storage. And since we're talking about sms. Oh yeah, and I guess I should back up. The small models in this case are the retrievers for RAG systems. And so retrieval and for knowledge GPT involves three steps. It generates a piece of search code which you can kind of see in the yellow.
Sophia Skowronski [00:48:59]: Then the search code is executed to retrieve the relevant knowledge and then the generated code. So it uses all of these things and then fed into the actual LLM to answer the input query. And so then just quickly moving around to trag. So it basically is a vector database plus an entities tree. So for a given input user query, the Vector database is search searches across the vector document vector database, it pulls in the relevant chunk if there's any organization mention, it then queries across the entities tree to add that information to the context. And then it uses a fine tune LLM to generate a response. And then so that's kind of some information. Oh yeah.
Sophia Skowronski [00:49:57]: And so these are using structured knowledge, meaning from knowledge graphs, from tables or from databases. All right, well we are so gone on time, so I'm just going to move ahead. If anyone has any flags I'll like again, I'll share this in the Slack if anyone's interested. And then the other piece that's probably relevant to all of us, especially if you're using the API, primarily is using small models in prompt based learning. So prompt engineering generally, as we know, is constructing prompts to guide frozen LLMs without parameter updates. And in context learning is including a few examples within the prompt. So we could potentially use small models here to augment the prompt retrieval process or the prompt selection. And so I just wanted to show two examples that were given in the paper.
Sophia Skowronski [00:50:55]: The first one, uprise, it tunes a lightweight retriever that automatically retrieves prompts from a pre constructed pool. So it adds a prompt to a given input? Yeah, it prepends, I guess so. And then it uses the frozen LLM to evaluate the prompt performance. And in order to train the prompt retriever, the obtained evaluation from the frozen LLM is then used to tune the retriever in a reverse manner. And then there's deslam, which is a generator retriever to which decomposes, which is used to decompose a complex prompt into sub problems that require fewer reasoning steps. And then the large model comes in to answer this expanded input and it's called the solver large language model. So yeah, I guess there's. You're just kind of moving ahead.
Sophia Skowronski [00:51:57]: You can use models to fix flaws of large language models. This kind of starts going into contrast of decoding that Baltimore started talking about and one of the papers had a very spicy example that I wanted to talk about, but I'll just share it in the Slack. And then so you can use contrastive decoding, which is leveraging the differences between two models, a larger capable model, the expert, and a smaller, less capable model, the amateur. And the idea is to choose tokens that maximize the difference in log likelihood between the expert and amateur. So we're selecting tokens that the expert model finds highly probable. Um, and okay, so I guess a useful point here is that contrast of decoding happens during inference it's not doing any sort of training of like or knowledge transfer. It's just using combined outputs to produce the best next term or highest log likelihood term. And so where that's different is.
Sophia Skowronski [00:53:02]: Oh yeah, I, I think we're at time, so maybe I'll pause here and just see what everyone's um.
Binoy Pirera [00:53:09]: Yeah, yeah, I think one, you can just go ahead for a few more minutes. I mean if you guys can just stick around. If you guys have no shop exit at like right now, you guys feel free to stick around. So be able to just take your time I guess.
Sophia Skowronski [00:53:22]: Yeah, I'll, I'll probably, I actually have to run to but I can, I'll leave in about a few minutes but yeah, so contrast of decoding happens during inference, whereas knowledge distillation is a training technique where you're training the model larger models like parameters and transferring them to the smaller model. And so classically this is how distilled BERT was created. And so there's kind of a nice graphic of distilbert here. So distilbert had full access to the internal architecture and parameters of teacher model Bertram. And so the main goal of this is to yeah, model compression and improving efficiency of the larger model. And so yeah, so that was the goal here for distill BERT as well. And white box knowledge distillation means that they have full access to the teacher model. And so to create distilbert, um, it uses, it initializes parameters from and during training it initializes parameters from the second every second later layer of the BERT model.
Sophia Skowronski [00:54:40]: And then it employed this like triple loss function in order to generate updated internal parameters to mimic the output of bert. So I think I've worked with BERT in the past for like college projects, but I don't currently use it again, I'm mostly large language model API user. And so blackbox is augmenting is about, is still about transferring knowledge but it's without access to the teacher's internal architecture. So only the outputs are available. And so the student model must learn how to mimic the teacher's output given the same inputs. So you can kind of do this through two techniques or again here's just two examples given there's chain of thought where you're transferring the reasoning abilities, specifically the step by step reasoning process. And that's kind of what's highlighted in this very tiny example here. And then the student model through training learns to produce similar reasoning chains and final answers.
Sophia Skowronski [00:55:48]: And then instruction following and then is more about prompts that demonstrate how the teacher model follows instructions which the student again learns to mimic. So the teacher prompts are typically. I was meant, I was going to try and get an image here. But teacher prompts are typically more detailed and explicit, asking for, you know, like step by step process. And the student prompts are simpler. So during distillation the student model learns to produce outputs to the teachers even when they're given simpler prompts. And then the last, I think this is the last one right, is you can use large language models to generate synthetic data for small models to train on. And so this goes back to the point about human graded data is finite that Lihu and Valdemar also brought up.
Sophia Skowronski [00:56:42]: And so there's again of course two categories of data synthesis. So we have data generation with some examples here as well as data augmentation where you can improve. You can possibly give more examples of the using the same base text just by asking it to generate different alternatives. And I think this was flagged as one that you probably wouldn't use in a heavily regulated industry like the medical field or finance because I think for the most part you always are trying to anticipate to get the same. Yeah, because like I think in medic medical environments standard of care is very specific to a timeline and you don't want to alter any data that might change that, change that information. Anyway, that's all I had. So we, we got throttle.
Valdimar Eggertsson [00:57:45]: Wow.
Binoy Pirera [00:57:46]: Yeah, that was incredible with slides and all. Y incredible. Yeah, that was super insightful. Thank you so much, Sophia. I know we ran up on time a little bit, but honestly everybody, thank you so much for joining. This was such a great session and thank you LE for all your contributions and insights. Very, very, very well done. And thank you for writing such a kickass paper.
Binoy Pirera [00:58:08]: Honestly, there's so much to go through. Yeah. So thank you everybody for joining and we'll see you next month, same time, same place with a brand new paper. Hopefully we're going to stop with Sopia the next time so she can show off her slides, but yeah. Thank you so much everybody. Thank you Balar and thank you Sophia and thank you Corey for your contributions. See you all next time. Incredible.
Valdimar Eggertsson [00:58:34]: Thank you.
Bruno Lannoo [00:58:35]: Thank you everyone.
Lihu Chen [00:58:36]: Thank you. Bye.