MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Model Merging and Mixtures of Experts

Posted Mar 04, 2024 | Views 219
# Model Merging
# (MoE)
# JP Morgan Chase
Maxime Labonne
Senior Machine Learning Scientist @ --

Maxime Labonne is a seasoned Machine Learning Scientist and a thought leader in the LLM (Large Language Models) community. He is currently working at J.P. Morgan in London and holds a Ph.D. in Machine Learning from the Polytechnic Institute of Paris. An active blogger, he has made numerous contributions to the open-source community, including the LLM Course on GitHub, automated tools such as LLM AutoEval, and many state-of-the-art models and architectures like Phixtral. He is the author of the best-selling book "Hands-On Graph Neural Networks using Python," published by Packt. Connect with him on LinkedIn and Twitter @maximelabonne.

+ Read More
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

Model merging has recently become extremely popular in the open-source community. The idea of merging several fine-tuned models, or combining them into a Mixture of Experts (MoE), led to new state-of-the-art LLMs. This talk introduces the main concepts around model merging and how to implement it using the mergekit library. It provides a notebook to create your own models and directly upload them on the Hugging Face Hub.

+ Read More

Model Merging and Mixtures of Experts

AI in Production


Adam Becker [00:00:05]: We have a fascinating conversation that I really wanted to hear from Maxime. Let's see. Are you here?

Maxime Labonne [00:00:11]: Yeah, I'm here. Hello. Hello, everyone.

Adam Becker [00:00:13]: Hey, Maxime. What are we talking about today?

Maxime Labonne [00:00:17]: Today we're going to talk about model merging and mixture of experts. Like a kind of new fancy technique to do models for cheap.

Adam Becker [00:00:25]: I saw, I saw the abstract and I was like, I have this. I'm going to lose my. Stoked to hear it. Do you need to share your screen?

Maxime Labonne [00:00:34]: Yeah, I'm going to share my screen.

Adam Becker [00:00:37]: Yes, it's right here.

Maxime Labonne [00:00:39]: Cool.

Adam Becker [00:00:39]: I will be back in 10 minutes.

Maxime Labonne [00:00:42]: Okay, perfect. Thank you very much. Hi, everyone. My name is Maxima Bond. I'm a machine learning scientist, and sometimes I also merge models on the side. And this is what we're going to talk about in this conversation. So why merging? Why should you care about merging models? So here is a good example of why this is actually relevant to the conversation. It's about the OpenLm leaderboard.

Maxime Labonne [00:01:09]: If you check the seven b param models, it was like two days ago, all of them are actually merges. So it really tells a story, right? And it says that these models, they become really good. And you might say that because they overfit the test sets, which is not entirely wrong, but there's more to it. They are actually quite good beyond the fact that most of them are contaminated at this point. To start this conversation, we're going to talk about merch techniques. There are not five, but four of them. I want to talk about slurp deaths pass through and Frankenmoe. There are more in the wild, but I think these four are a good representation of the most interesting ones.

Maxime Labonne [00:01:59]: So let's start with slurp. It stands for spherical linear interpolation. And it's a very popular technique, very easy to use. Even the idea behind it is quite intuitive because it's basically averaging the weight, right? It's linear interpolation, but spherical. So we conserve some properties in this space. It's limited to merging two models at the same time only, but you have a lot of possibilities. You can define the interpolation factor for different type of layers and with different gradients, so you can do really precise work with that. The only problem is that from my experience, it doesn't matter that much.

Maxime Labonne [00:02:43]: Actually, the base parameters are quite good, and if you tweak them a little, it won't drastically change the performance of the resulting merges. And as an example, here's one that I've made. It's called biggle 14. Seven B, and you can basically find it on the hugging face hub. And I share all the configuration that I've used to make it so you can reproduce it. Another popular merge technique is there ties. Data ties is based on two different techniques. There's ties and then there's there.

Maxime Labonne [00:03:20]: And the main idea, the intuition behind these two techniques is that we want to reduce the redundancy in the model parameters. These parameters, they tend to store the information over and over. So here we are going to use techniques like pruning. Here pruning means that you're going to reset the fine tune weights to their original values, so to the values of the base model. And you're also going to only keep the most significant parameters, so the top cap percent most significant parameters. So those are really interesting into the redundancy. Then you're going to add other techniques. You're going to rescale the weights of the different models to make sure that they correspond.

Maxime Labonne [00:04:02]: There's also something about the sign that you need to elect, but I don't want to delve too deep into the technical details. What's really important to know is that you can merge multiple models with this technique, unlike slurp. And so people really went crazy with this idea and you can find merches with a lot of different models. It's really interesting to see. And the idea behind it is that we want to extract a task vector, so we want to extract a vector that represents the knowledge of these models and we want to combine them in an efficient way so they retain all this knowledge in the final merge model. And as an example, here's one that I've made. It's called daredevil seven B. And once again you can find the configuration if you're interested to produce it.

Maxime Labonne [00:04:55]: Then we have the path through technique. Similar idea is the depth up scaling by Kim and Al that made the solar model. And the idea here is that you're going to concatenate layers either from different llms or from the same LLM. Both work actually it's quite experimental. Right. But at the same time it really works in practice because you have more layers, more parameters, and it's been shown over and over again that you get better reasoning abilities. And these Franken merges, they're just better. They just can answer questions that the original models could not answer.

Maxime Labonne [00:05:30]: So really interesting technique. People tend to make really super big models with them. So you can see like 120,000,000,000 models. Another one from Eric Hartford is called the professor with 155,000,000,000 parameters. So then the bomb is like how to run them. You need to have at least a 30 90 or 40 90. And then we have mixture of experts, or to be precise here Franken mixture of experts. So the idea behind the mixture of experts, to be very brief about it, is that you going to improve the efficiency because you're not going to activate the entire network, but only subnetworks.

Maxime Labonne [00:06:09]: So your experts and also improved performance because you have more parameters in general that you can leverage. So these models tend to be more performance, more accurate. The problem is that they are difficult to fine tune and they also require high VRM capacity, because even if you only activate a subnetwork, you need to store the entire thing. An interesting technique that was developed, I think by Charles Godard, the author of Mergekit, is to combine the FFN layers of different models. So really like you grab some models, you combine the fit forward network layers and you add a router that you can initialize in different ways. And this is how I've made the beyond the model that you can see. You can see it's composed of four different models. Each of them has a specific, I don't know, like expertise.

Maxime Labonne [00:07:01]: And this model works pretty well. So it shows that even if the id is quite simple, it can be quite efficient in practice. Then we have merge recipes in this section I want to talk about some really recipes on how to do it efficiently. The first one I want to mention is the library that really powers this entire stack at this point. It's called Merch kit. It was created by Charles godder, and it implements all the merge techniques that we talked about so far. So it's really powerful. We can do a lot of different things with it.

Maxime Labonne [00:07:38]: And as you can see, you have these nice Yaml files as configurations for your merges, so you can easily share them and easily iterate over them. This is one that I've recently used to make a model, for example. And then when you merge your models, how to know if they perform well. It's actually, I think, the most difficult problem and the one that is the most costly, because making these merges, you just need a cpu, you don't even need a gpu, so the cost is really running them to be able to evaluate them. Unfortunately, we cannot access the LNCs arena, so we have other benchmarks that can be a good representation of how good the models are to humans. I would say that if you can afford to have more benchmarks, it's good because you can get a better representation of the performance of your model here I want to cite like the OpenLM leaderboard, but it has some issues, unfortunately, so it's not enough. I would say you have news benchmark suite, which is different. It has the excellent Agi evolve benchmark that I really like.

Maxime Labonne [00:08:51]: You can see y'all yet another LLM leaderboard. This is a leaderboard that I've made and that uses this benchmark suite. Then you have eqbench by Sam, really good benchmark. And you have empty bench more geared toward conversations. So with that you can have a pretty clear picture performance of your model. Something that I want to mention too is that after that you merge your models. You can do fine tuning, and a good way of doing it is doing direct preference optimization or DPO. It's like a freelance on top of these merges.

Maxime Labonne [00:09:26]: It's quite effective, it's not too costly. The only problem is that we constantly need new preference data set. Arguilla is doing a great work at providing new preference data sets almost every week, and the goal here is to make the models better. It's not to censor them, it's really to make them better. And you can also instill new behaviors. For example, if you have a data set with a lot of conversations, it can be really helpful if you want to have a model geared toward this kind of task. Finally, as a conclusion, if you're interested and if you want to match model, I would recommend checking the article I write about it, but also the lazy mergekit collab notebook. It's a nice wrapper.

Maxime Labonne [00:10:09]: You just have to specify a configuration and click one button. So yeah, very cheap way to make models and I hope that you will enjoy it. Happy merging, everyone.

Adam Becker [00:10:22]: Maxime, thank you very much. This was indeed fascinating. It sounds on one hand pretty familiar from just like merging more classical types of models. On the other hand, we just have completely new challenges and opportunities here and it feels like a very active line of research. So it sounds like just over the next year or so we're probably going to see new frameworks and ways of thinking about this emerge, and also how you maintain benchmarks in a reliable way so that you're not just overfitting, because now you just have so many. It sounds like there's just a full jungle of research and discovery to be made there. Absolutely. Thank you very much for walking us through this.

Adam Becker [00:10:59]: And please, if you could also like the blog that you wrote about this, if we could add this to the chat, that would be cool too.

Maxime Labonne [00:11:07]: Yeah, thank you very much.

+ Read More
Sign in or Join the community

Create an account

Change email
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Do More with Less: Large Model Training and Inference with DeepSpeed
Posted Jun 20, 2023 | Views 1.2K
# LLMs
# LLM in Production
# DeepSpeed