From Research to Production: Fine-Tuning & Aligning LLMs // Philipp Schmid // AI in Production
Philipp Schmid is a Technical Lead at Hugging Face with the mission to democratize good machine learning through open source and open science. Philipp is passionate about productionizing cutting-edge & generative AI machine learning models
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Discover the essential steps in transitioning LLMs from research to production, with a focus on effective fine-tuning and alignment strategies. This session delves into how to fine-tune & evaluate LLMs with Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF)/Direct Preference Optimization (DPO), and their practical applications for aligning LLMs with production goals.
From Research to Production: Fine-Tuning & Aligning LLMs
AI in Production
Slides: https://drive.google.com/file/d/1CdjUn6kuocUtRiRgtm_Kttunid9zDBeC/view?usp=drive_link
Demetrios [00:00:02]: What up, people? We're back. Another day is here for this AI in production conference. Whoo. This here is our first keynote of the day. And goodness, what a keynote we've got. I'm going to call Philipp to the stage. Where you at, Philipp? How you doing, dude?
Philipp Schmid [00:00:33]: Hi, I'm good. I'm in Germany. It's already like close to six, so super happy to be here. Super excited to be like, the first starting of the day.
Demetrios [00:00:43]: There we go, dude.
Philipp Schmid [00:00:44]: Everyone is ready.
Demetrios [00:00:45]: Kick it off strong. Well, I know you got all kinds of stuff planned for us, and I am a fan who's been following your work for a while. You're doing incredible stuff on all the socials and just about all different conferences. So I'm going to let you do your thing. I'll be back in like 2025 minutes. And all the questions that come through in the chat, I'll make sure to ask them. So if anyone has questions, drop them in the chat again. For those people that are interested in introducing themselves and connecting with others on LinkedIn, we've got a whole channel for it.
Demetrios [00:01:25]: Click the channel. You don't need to put it into the chat. Philipp, man, I see you're sharing your screen already. I'm going to hand it right over to you and let you kick us off. Thanks.
Philipp Schmid [00:01:39]: So perfect. Okay, screen is already there. So, yeah, welcome. From research to production, fine tuning llms and aligning llms. So who am I? I'm Philipp, I'm, as mentioned, based in Germany. I'm currently tech lead at hagging phase where I lead our partnership with AWS, Azure, or now since recent Google Cloud. So if you are doing anything with hacking phase on any cloud and it's not working, please ping me and say, yeah, philip, it's not working, it's your fault. My motivation is really to teach you how to fine tune llms, open llms, how you can use it for production use cases.
Philipp Schmid [00:02:15]: I started with Bird in 2018, which is, I guess, now very boring and are now very focused on reinforcement, learning from human feedback and llms. And I truly believe that only through open source we can use generative AI responsibly and make sure that we are not creating too much harm. And before we go into the details, I want to quickly highlight something which I think has changed over the last two years. That generative AI is now in everyday life. We have jetgpt, we have Snapchat, or like stability I generating images. We have GitHub, copilot, and code whisperer helping us write faster and more and better code. We have telecommunication companies now creating llms for customer support and many more. So generative AI is already everywhere.
Philipp Schmid [00:03:03]: And what is very interesting is like generative AI is also helping us to be more productive. And that's a great case study from the Boston Consulting group, who more or less tested how well are my consultants doing when I provide them access to Chat GPT? And they evaluated 18 realistic tasks for them. And they compared consultants with AI, and consultants without AI, and consultants with AI were able to finish 12% more tasks, were 25% faster, and also generated 40% higher quality. So using generative AI and llms, if they are correctly fine tuned, will help us be more productive in almost like any task. And to quickly look back on what's the difference between a generative task and an extractive task. So extractive is what we might know from the previous year and traditional machine learning algorithm, where we are trying to extract some kind of information. So if we have, for example, our sentence here with my name is Sarah and I live in London, I can try to extract the entities, with Sarah being a person or London is a location. Another example would be, okay, can we classify the sentence to give it some kind of sentiment? Is it positive? Is it negative? And on the other side, generative tasks are really involving creating new data.
Philipp Schmid [00:04:29]: It could be text, it could be image. We already heard earlier that we can now generate super cool songs from text and input. So generative AI is all about creating something new, and extractive tasks is more about, okay, what can I do with my task? Can I cluster it, classify it? And before we go really into how this works, we maybe should look at how do llms even learn? So llms are mostly referring to decoder type transformer models. So Bert was an encoder, and llms are mostly decoder, which are learning different. And like in super simple words, llms are more or less just some dump computers trying to complete a sentence and predicting the next work. So if my current position or input in my text field is my, the LLM tries to find, okay, what's the next most probable token or word? Could be my name is. And then it goes on, and Philipp, and then we have my name is Philipp, and it's really this loop where we have our current input and then trying to predict the next input. So if you chat with chatGpt, it's not like one prediction the model does, it's really okay, I have all of the input before, what's the next most likely token? And then, okay, I have all of this, okay, what's the next most likely token? And that's how DLLM predicts or generates its text.
Philipp Schmid [00:05:53]: And what's super exciting for me, and especially at hagging phase and the open source community, open is really an alternative to closed source generative AI and LLM. So if we think back one year ago, I think GPT four was roughly announced. Jet GPT was announced November 2022. So closed source was really dominant, was really pushing, and we were all like, okay, wow, that's incredible. That's impossible to create with current open models. But then we really, in 2023, had a rise of open LLM starting in February with Llama from meta AI, which more or less released overnight, an LLM which was as performant as GPT free. And then like a few days after IPAca, which was a research project from the Stanford University, came out, which used the first original alarma with some synthetic generated data from GPT free, and created iPacA, which was now a model which could follow instructions from inputs. And then all of this we generated Vikuna and koala, which were similar Llama tuned model.
Philipp Schmid [00:07:02]: In May we got star coder and Starchat, which were popular coding model. We got Falcon, and then in July we got llama two, which is still one of the best open models we have. Then mystery AI came out of nowhere with Mistral seven B in September, and it really looks like, okay, open is not really stopping, and we are closing the gap and catching up. And that's true. So the LmSys chatbot arena is a leaderboard where you can chat for free with multiple llms and can rate the responses. And in there we have closed source models with GPT four or Claude, but also open models with mixture or llama or e or other models. And then the users more or less give the responses a rating. So it's really feedback and community driven.
Philipp Schmid [00:07:49]: And in this leaderboard, mixture, which was the latest model from Mistral AI, performs equally well as GPT 3.5 tobo. So we really catched up on, not on the bow's biggest model, but on very capable closed source models with GPT 3.5 and now have open alternatives, which I can almost run on my M two Mac. So it's super fantastic. But of course, with those open models, we need to adjust them and need to make sure that we can fine tune them to our own needs, to my own data, and to my use case. But maybe before we go in there, one last slide on which is super important is that open source might be lagging behind, but we are keeping the pace. That's a chart an old colleague Julian Leno coronated from hanging phase, where really we compared, okay, how much compute and time was invested into closed DOS models and where we are in open source. And there we can definitely see with the yellow line and like the green bluish kind of line is that open source is keeping the pace. So we had GPT four in February 2023, and with this assumption we most likely will have a GPT four like performing model by the summer of this year, which is super exciting for everyone, I guess, since we can then fine tune the models.
Philipp Schmid [00:09:12]: And when training or fine tuning models, we typically have three different stage one, the first stage is the pretraining stage. Then we have fine tuning and alignment. And today we are only going to focus on fine tuning and alignment. And during pretraining, which is also the most cost effective stage, you more or less use the algorithm we have seen in the beginning where we are trying to predict new tokens. And this algorithm is used to learn about the world using trillions of different tokens and more or less the whole Internet. And for this we need a lot of compute. So I think GPT four costed around, or it's estimated costed around $100 million to train. And $100 million is not coming from nothing.
Philipp Schmid [00:09:58]: You need hundreds or thousands of different gpus and takes super long. So that's very cool to see that companies like meta Mistral or since yesterday, Google are releasing those open models which we can as like community or individual take and fine tune them. And fine tuning is way more accessible than pre training. Since we only need one gpu, it costs way less. We need way less data and we can really use it for adjusting the model to our own tasks. So what we are going to do with fine tuning is that we use still the same algorithm. So we are still trying to predict the next token. But what's different to pretraining is that we now have data sets which are not like just a web page where we are trying to complete a text.
Philipp Schmid [00:10:44]: We really have those instructions where we have an input and an expected output. So on the right side here, I provided an example for a typical more story writing task is, okay, the prompt, our input could be write a short story about Boston. And then the output would be a story about Boston with the spirit of revolution. And what we are going to do during fine tuning is that we have those input and outputs. And what we are really trying to do is try to minimize the difference between what a model generates and the output. So we are trying to teach the model to get as close as to the original crown truth as we have provided in our data set. But what's really challenging with fine tuning, especially on generative AI, compared to what we might know from typical classification tasks that it's way harder to evaluate, since as you can see on the right side, we have a whole text, right? So it's not only a class we don't have, okay, the weather is nice, okay, we can classify it in sentiment positive, and then we can create labels and score them and get an accuracy. And that's way more challenging with generative AI, since, okay, we get a story about Boston, but is it a good story? What does a good story for me personally mean? Maybe I'm the type of guy who prefers bullet points instead of written text.
Philipp Schmid [00:12:03]: So evaluating generative models is way harder and super subjective to personal needs. Or maybe what you're trying to optimize for. What are really the benefits here mentioned already, it's super cheap, especially with techniques like parameter efficient fine tuning, which reduces the memory needed for training even more. We can train an LLM like llama or mistral for only like $10 or so. And what's also cool, we don't need millions and trillions of different data samples. We can fine tune a model already with like a few thousand samples. And what is even cooler is that we can now use those pretty strong llms to generate synthetic data. So generate those examples for us from maybe a text I provide about my environment or about my company or some kind of FAQ, which I then use to fine tune the model to answer for my needs.
Philipp Schmid [00:12:58]: So we're already talking about instruction data sets and what makes a good instruction data set is as critical as it could be. So if we have bad data, our model will be dead. So it's like the same story as we had for traditional machine learning. Garbage in, garbage out. So we really need to make sure we have good instructions or instruction data set. And those mostly consist out of, okay, we really need clear instructions. So we should be very specific on okay, what should the model do? You might already experience this when chatting with Chat GPT that if you are not super specific and clear on what the model should do, it might generate something which is not what you want. So we need clear and specific instruction.
Philipp Schmid [00:13:40]: We need a diverse set of topics and tasks. So for example, if you have a data set of 10,000 instructions and 9000 of them are about writing stories about Boston and different companies like locations and cities and then if you want to use the model to do math or question answering, it might not perform as well. Since the majority of the data set was about content writing, we need definitely consistent formatting. So this means all of our instructions should be structured. The same example could be the Alpaca format where we have similar to markdowns like some hashtags and then the instruction with our text, and then the response is also similar structured with those hashtags and then the response and still human feedback is super critical. If we go back to our story about Boston is okay, maybe we have a story about Boston, but the better the story or the more differentiated the story is, the better we can tune our model. So if we want to create a question answering system, and the answer should not only be the answer, maybe if we want an answer which acknowledged the questions, add some more detailed information, we can help tune the model to listen to our needs by having good data. An example for a good instruction data set is alpaca, which is also the name of the fine tuned model from Stanford.
Philipp Schmid [00:15:07]: And what Stanford did here is basically use text da Vinci free. So the original GBT free model and used a method called self instruct to generate instruction data. So they used an LLM based on like 175 instruction seat data to generate 50,000 different instruction. Following examples with brainstorm creative ideas for designing a conference room and then with an appropriate output. More advanced method to alpaca or like the synthetic approach is visit LM or the method is evolct where you more or less try to involve and make instructions more complicated. Where you start with a seed instruction which could be. I hope you can see it. It's like the initial instruction would be one plus one and then one method from evil instructors deepening, which could then turn into the output.
Philipp Schmid [00:16:00]: Okay, in what situation does one plus one not equal to? And those different prompts are then used against a good LLM and the good LLM, or like the already strong big LLM generates response and then we have new prompt and response pairs we can then use to fine tune smaller models. And what's really cool about those instruction or synthetic methods is that you can apply them to your own domain or to your own context. So if you work in a medical domain, if you work in a healthcare domain, or like pharmacy or agriculture, you can more or less use a set of initial seed prompts and an already existing LLM like mistral mixture or GBD. Four. Provide your unstructured vet text or like domain specific text with the evol instruct method and create those different seed prompts and then try to create more complex prompts for your domain. And then you end up with a huge set of instructions for your domain which you can then fine tune your model, which can be then generalized to new contexts and new use cases in the same domain. And how can you do all of that? On how can I train my model? So at hacking phase, I'm pretty sure not every one of you have heard about it, but TRL is a pretty new library to make it super easy for you to fine tune for instruction following and also later for alignment where we provide trainers. So from transformers we have the trainer, which makes it super easy to train transformers in general, and TRL we have generative AI trainers with like SFT which stands for supervised fine tuning which you can use to easily fine tune models for instruction following.
Philipp Schmid [00:17:43]: And all you need to do is create your data set as a JSOn for example, where you have a prompt and a completion, and then you can use the SFT trainer, and the trainer abstracts all of the heavy lifting away from you so it takes care of like okay, how do I format it correctly? How do I tokenize it? How do like select the right batch size? What's super cool about the SFG trainer? It directly supports path. So if I want to use parameter efficient fine tuning with Laura or Qlora to save more memory, I only need to provide a config. And for this we created first of all examples in the TRL documentation. And I wrote a nice blog post, I think early January about how to fine tune llms in 2024. It's really an end to end walkthrough starting from okay, I have a use case. Does the use case even make sense for fine tuning? And then what steps do I need to create and train the model? So if you are interested in fine tuning your model, definitely check it out. And what do we want to do after fine tuning? So normally fine tuning might be enough for like 80 90% of your use cases. You can get started, you can create a proof of concept, you can collect feedback from your users to improve your data set to improve your model, and then iterate on it.
Philipp Schmid [00:18:59]: And what we now see in the research or also in the closed source domain, is that after we have our model, which can generate instructions for responses to our instructions, we want to align it. And aligning could have different meaning, for example chat, GPT, or like Gemini. Mostly aligning is to prevent outputs to be harmful. So you probably have heard about as an AI or as a language model I cannot blah, blah blah and so on. And this is achieved by using alignment technique with RHF. But alignment can also be used to improve the outputs. And what I mean by this is we can look at a practical example. So I found this stack overflow questions where someone asked, okay, is it possible to train a model by Xgboost that has multiple continuous outputs, multi regression? What would be the objective of training such a model? And for this we got six submitted answers, and I collected two of them.
Philipp Schmid [00:19:59]: So we have one. The left one is a text output with, you can use linear regression, random forest regressor and so on to achieve your training. And the right one is, my suggestion is to use, and then we have a link to a documentation as a wrapper, and then also a code snippet. And what we see here is really what goes down to the earlier point of evaluating and training those models is super hard, because theoretically both responses are correct. Both answer the question is yes, you can train a model like this using XgBoost. But both are very different. Like the right one has sources to documentation, has a code snippet, and the left one is like just a text. And me as a developer, I would always prefer the right one.
Philipp Schmid [00:20:48]: It's like way easier to get started. I can look up the documentation, I can maybe copy the code snippet and already try it. And that's where LHF can really help us. So we can really try to shift our model towards the most preferred output by humans. Of course that can be different in terms of what output you expect. Maybe we had like the bullet points or non bullet points. But by using LHF, we can really try to force the model in a certain direction, how it should generate responses. And for this we need this comparison data.
Philipp Schmid [00:21:21]: So you have seen in the screenshot before, we have two different examples where we need to say, okay, what's better? There are different techniques where you have not two, you have more different examples which you score. So on the right side we again have our story about Boston. With three different examples. We could rank those and say, okay, that's like the best story I want, and the other ones are equally good or less good. And then we can teach the model, okay, what's the most preferred output? And try to learn to generate this. And we have either those different ranking scores or pairwise comparison data, as we had from stack overflow, which is also now the most common way we see in the open community being used, since it's way easier to only get two responses for something and compare a case like answer a better or answer b better. And why are we going to do all of this? To put some numbers behind it. We have on the left side the elo score from entropic which also for their cloud model used RLHF to see okay, can we improve the helpfulness of our model? And with helpfulness they refer to responses which are more helpful for the human.
Philipp Schmid [00:22:39]: So if we stick with our code example, for me the more helpful answer would be to have a code snippet inside of it. On the right side we have from the original instruct GPT paper from OpenAI a similar test where they compared the PPO is the DrhF line here the yellow one to SFT which stands for the supervised fine tuning method. And in both papers we can see okay. By applying those techniques we are able to help the model force into a certain space where it generates more likely outputs that are preferred by us humans. Of course doing this is not super easy since for PPU you need an additional reward model with more data and like a complex infrastructure since you need to have multiple models in memory and only a few labs can use it in a very robust and stable way. Luckily for us, I would say GPU per people, they are alternatives. So once we have constitutional AI which is similar to reinforcement learning from human feedback, but instead of using humans who compare the data and say okay, left one is better than right one, we use AI which is now a common pattern. I would say for most of the comparison data we see in the open source space where we for example use GPT four, have two outputs from llama and a different model and then try to ask GBD four which of those example is better.
Philipp Schmid [00:24:06]: And there we have seen that GPT four's preferences is like roughly equal to 80 90% of a human preference. So we can substitute humans with AI and an alternative to PPO which is the method used by entropic and also like I think correctly read in the gamma paper yesterday for Google and OpmaI is the alternative is DPO which stands for direct preference optimization, which is a way easier method where you directly train on the comparison data. So you need one less step for your RLhF pipeline. A good example is the sapphire model from hugging phase. The team at hugging phase trained a Mistral model using DPO on a comparison jet data set. They completely open sourced all of the code on how to train it. So if you are interested in what DPO is, that's definitely a great starter point. DPO is cheaper than PPO, since we don't need to train an additional model to do the ranking and we can use pairwise comparison data.
Philipp Schmid [00:25:12]: And if you want to get started with DPO, similar to SFT or like normal fine tuning, we have a trainer in TRL which is the DPO trainer, where you, similar to SFT, need to create your data set. For DPO we need this pairwise comparison data set and the rest is pretty much similar to what you know from using transformers. So like super straightforward. And as of this week, similar to the guide for instruction fine tuning, I created a guide for DPO which walks you through end to end. Okay, what is important to take in mind? How do I need to create my data set? How can I use TLL? What do I need to do to load my data set until training the model? And there we use a mistral model and apply DPO and at the end we evaluate it. And the result is that the responses from our DPO model are more likely preferred than the SFT model, which is a great example showing, okay, we can see that it really helps improving aligning or moving the model in the more preferred domain there. Thanks for listening. I hope you have some questions we can answer now.
Demetrios [00:26:26]: Oh yeah, dude, excellent. Thank you so much. So there's a few questions coming through and we'll give it a minute because it takes a little bit of time for people to ask their questions and the stream to get to the platform. But I think one question that I saw come through and there was an answer in the chat, but it's also cool to get your opinion on it. Are you using any guis for RLHF.
Philipp Schmid [00:26:58]: Training, for training itself, or for creating this comparison data and looking into the data? Okay, that's the better answer, or is that the better answer?
Demetrios [00:27:10]: Either, I would say, I think for.
Philipp Schmid [00:27:13]: Training, I'm not sure why you would need a GUI. I think what's definitely helpful if you have those kind of monitoring tools like weights and biases or tensorboard where you can see the performance of your training. And for DPO, you really want to look at the reward margin metrics, which is basically telling you, okay, is the model learning to correctly generate the more accepted one? And for the data preparation method, I think there are multiple ones, with aquila is one, which is pretty popular in the open community. And then of course you can just use the Jupyter notebook and Orc radio where you iterate through your data sample and look at it manually.
Demetrios [00:27:56]: Excellent. So we've got a few more questions that come through the chat. So when you're creating synthetic data, how can you ensure that the data is good or useful?
Philipp Schmid [00:28:11]: That's a very good question, I think, and it's still a bit unsolved and currently the best synthetic data generator are probably the most experts there. But a good way is always like, I think that will never go away in machine learning is you need to look at your data. So yes, you can use LLMs and AI to generate data, but you always need to look at it and see, okay, how does it look? Is it what I need? But of course you can also use LLMs and AI to validate your data. So it's very interesting since you can use it to generate new data, and then you can prompt llms with some kind of criteria. What is a good data example for you to look at all of your examples? So if you can create some kind of constitution or a scoring board, okay, please rate those input and output based on those criteria. Like, okay, the question should be complex and about math, and that could be a way for you to filter out, okay, that's not a math question. That may be something about, I don't know, art that can be definitely used. And then of course, what makes good data? We had it in our instruction data set section is I need a diverse set of data, and that's where using AI or synthetic data is really, you really need to look at it, because when we are using those techniques, we generate a lot of data, like thousands, hundred, thousand of different samples.
Philipp Schmid [00:29:32]: And then we need to make sure at the end that we have equally big sections for different domains, not that we end up with 990,000 examples on world knowledge. And I want to do something with physics.
Demetrios [00:29:47]: Yeah, that's not going to be very useful. That's a good call. This is an interesting one. What are challenges with Rlaif?
Philipp Schmid [00:30:01]: So Rlaif, which uses AI feedback instead of human feedback, more or less also fits like in the synthetic data generation type. So I need to really make sure that my feedback I use for the comparison data is what I really expect and want it to be. And then it really depends on where I apply Rlaif. So what entropic did with constitutional AI, they really replaced the human feedback from RLHF and PPO with AI feedback. So they trained a reward model on the synthetic data and then used this reward model to rank the regular LLM outputs. But what we see on the open community recently is that we use the AI feedback for this DPO data sets where we have a good example and a bad example and then directly train it using DPO which is definitely way easier to get started and where we see really good results.
Demetrios [00:31:00]: Okay, so another cool one here coming from Bret. Red teaming adversarial testing is becoming a popular topic of discussion. It seems that most of these talks are human powered and not most of these tasks, not talks. You can see where my head's at. Most of these tasks are human powered and not automated. What are some automated tools for assessing safety, security, accuracy and quality that could be built or are in use today?
Philipp Schmid [00:31:28]: Yeah, that's a very good question. So it really depends on your use case. So if you are creating a public facing application like chat, CPT or others like safety and ethics, and making sure no harmful content is created is very important. But if you are working in company environment, the situation might be different. Like a good starting point of course is like the hiking phase hub. We have an ethics team which creates spaces and content and library on how you can detect biases in data set. And another good way would be using AI again for it. So meta published a model called Lamaguard which can help you classify and identify harmful input and outputs so that you basically try to add some kind of pre and post processing in between.
Philipp Schmid [00:32:16]: But yeah, I think he correctly mentioned that red teaming is very important since their humans look at the data and what the model generates. Of course you can support it through AI to making more productive. But at the end I still think we need really experts looking at those topics to help us classify. Okay, what do we want to achieve?
Demetrios [00:32:41]: Have you used the self play fine tuning technique? If yes, what is your take on it and does it overfit the LLM?
Philipp Schmid [00:32:49]: That's a super great question. So for the ones of you who don't know, I think the method refers to spin where you more or less create the data set with the accepted and chosen one, the comparison data set using the model by iteratively training. So if like a good example you have an LLM, you generate results for your prompt, then you have a bad example from the model and then you train it again and then basically repeat the step to slowly improve the model. Currently at hagging phase we are working on an integration into TRL that you can train and test it yourself since it looks very promising from the paper. But we haven't run experiments. So if you stay tuned for I would say two to four weeks and follow the TRL team on Twitter, you will definitely see some experiments. And if it is a very alternative.
Demetrios [00:33:43]: Method, and I think this will be the last one for you. This has been, well, okay, maybe one more if we can squeeze it in, because we've got another awesome talk coming up. But can the LLM be trained to output a decision tree? Can wizard LM do? Does that make sense to you?
Philipp Schmid [00:34:09]: Yeah, I think kind of. So like what we have seen with the wizard LM like decision tree on the right side. That was more a visualization on how you get from your initial prompt to those different alternative prompts. So that's not like one output, that's like you use the initial prompt prompt and LLM to generate a new prompt. And there are different ways you can go to like deepening, like changing topic or something for outputting a decision tree. If it refers to some kind of chain of thought prompting where you try to tell the model that it should really explain more or less how it got to the answer, which I guess is some kind of decision tree, then it's probably the method there. And there's also something called tree of fort prompting, where you use different LLM outputs to create some kind of decision tree to end up with a better generation. So maybe that's what you are looking for.
Philipp Schmid [00:35:05]: So I can try to look up like tree of thought. It's somewhat similar to chain of thought prompting, but different.
Demetrios [00:35:13]: So last one for you is in your opinion, DPO better than perfet P-E-F-T.
Philipp Schmid [00:35:23]: So PFT and Dpo are not really the same thing. So parameter efficient fine tuning helps you to fine tune models using less compute and memory. So with paft we only tune a few parameter in the model and you can use DPO and paft together. So you can use DPo to align your model using pact where you only fine tune a few parameter inside your model and then needing less memory. So like the blog post I had on the slide, if you look it up in Google search you should find it like DPO in 2024 with hagging phase you will see how to use peft together with DPO to align a.
Demetrios [00:36:06]: Mistral model, which kind of just dovetails into this other question someone's asking and it's explaining the main differences between model alignment and fine tuning.
Philipp Schmid [00:36:20]: No, I can totally understand why this question comes up. So it's super hard because at the end we are fine tuning aligning models. Pretraining is also more or less a fine tuning methods on just bigger data. And with aligning we mostly, or like the research community mostly refers to teaching certain behavior or using reinforcement learning. Since we are not really having this input output structure. It's more that we are trying to move the model in a certain space and aligning, when we think back to jet GPT and those closed models is to make sure that we generate not harmful content and focus on more harmless and more helpful content. And there we try to align the model to be less harmful, which kind of probably makes sense from this point of view, but those techniques can also be used for just making the model more helpful or trying to inject some kind of behavior on like, okay, I always want to generate code in my outputs, more or less.
Demetrios [00:37:26]: Excellent, dude. This has been fascinating. There's a lot more questions coming through in the chat, but I'm going to ask everyone to hit up Philipp on social media. The man is a popular dude. You can follow him on Twitter or LinkedIn and get all kinds of goodness every day like I do. Philipp, thanks so much for coming on here and doing this, man. This was an absolute pleasure.
Philipp Schmid [00:37:54]: Great. Have nice day.