LIMA: Less is More for Alignment
Chunting Zhou is a research scientist at FAIR Labs, Seattle. She completed her Ph.D. in Language Technologies Institute, Carnegie Mellon University in 2022, where she was advised by Graham Neubig, She received a CMU Presidential Fellowship in LTI and a D. E. Shaw Zenith Fellowship during her Ph.D. Her research focuses on large language models and efficient architectures for training and fine-tuning foundation models. Chunting has served as Area Chairs in EMNLP 2022, ACL 2023, and NeurIPS 2023.
How do you turn a language model into a chatbot without any user interactions?
LIMA is a LLaMa-based model fine-tuned on only 1,000 curated prompts and responses, which produces shockingly good responses.
- No user data
- No mode distillation
- No RLHF
What does this tell us about language model alignment? In this talk, Chunting shares what we have learned throughout the process.
All right, so next up we have Ching Zao, um, from research at Meta ai. Um, very excited for this talk. We will put her on stage. Hello, how are you? Hi. Thank you. Yep. Here are your slides. Take it away. Okay, cool. Oh, so I'll just start it. Hi. Uh, okay. Uh, so thanks everyone for coming to the talk. I'm trending from Meta AI and I'm very excited to share, uh, what we have learned from Lima with you.
Um, so in this talk, I will try to answer three research questions. So first, do we need large amount of annotated data to turn a patent, uh, chat bot? Uh, the second question is, if so, what are the critical access, uh, when we create, uh, the annotated dataset? So the third question is how well can model trend with a small number of annotated data generalized to new tasks?
Uh, as we know, large language models are preaching our childrens of tokens. And we propose, uh, a super superficial alignment hypothesis such that a model A model's knowledge and capabilities are learned almost entirely during pre-training. While alignment teacher said, which sub distribution formats should be used when interacting with users.
So in the alignment stage, we should fit, fit the model with the right format of fine tuning data that teaches the model to act as an AI assistant. So with this hypothesis when conjecture that one could sufficiently turn an operation language model with a rather small set of examples.
Um, so what are the critical access when creating the annotated data? We found that ensuring high quality and high diversity in the training data are the keys to success. Lima is trend on 1000, uh, carefully, uh, curated examples, uh, with no model destination data from a existing chat bot models and with, uh, minor human annotations of, uh, 200 examples.
So let's take a look at what components this 1000 examples. Uh, we have 200 examples from the Stack stack exchange sites, and 200 from the other size of Stack Exchange. And we have 200 Wiki. How examples? Um, we have 150 push, push shift, uh, subed, uh, dataset. Uh, of the writing prompts. And we have 50, uh, N L P examples from natural instructions, and we have 230 examples, uh, from our Coors.
So to control quality for the public data set listed above, we remove, uh, the artifacts in the community data, uh, for example, inspection. Um, the user might refer to, uh, the answers from other posts, and we eliminated such answers, uh, because we don't want them to be, to appear in the, in the training data, uh, in the, uh, chat bot response.
And we select the data with higher yield ratings if, uh, the upwards, if any need. For the in-house author data, we set a uniform tone and our Coors follow the same format of a helpful AI assistant. When writing the examples, so many examples start with an acknowledgement of the question and then the actual answer, and finally assured conclusion of the answer.
So to control diversity, For the public data side, we use also the public community website really contains a variety of topics and domains. And we sample data with, uh, uh, repeated distribution to rebalance, uh, different domains to increase the domain diversity. And for our in-house author data, we pay actual attention to the task diversity.
Uh, to cover more user, uh, scenarios. For example, creating a trip, uh, creating a trip plan or conjecture on alternative history.
So we create a test site with 300 prompts, with diverse topics and, uh, tasks. So Lima performs pretty well, uh, compared to some top chatbot models.
So one observation that is very interesting is in control setting. When we skew up the number of examples from stack exchange data, um, by, uh, ensuring the same quality, we don't see any improvements in the generation quality. Um, because more section data. Uh, don't bring more task diversity and improve quality in the training data.
Um, and this generation quality is mattered by, uh, uh, open line model, uh, with our, uh, with uh, uh, liker score.
So finally, uh, how well can model trend with a small number of an noted data, um, generalized to new tasks. I want to share some, uh, uh, fascinating generalization ability of fine tuning with a handful of examples. First, by adding just a 30 curated dialogue examples, we found Lima has been greatly improved in the dialogue conversations.
Um, and feel free to check our examples in the, in the archive paper. Um, second, by adding just the six, uh, formatted, uh, formatted constrain examples, we find that the model can generalize to test examples from other domas, um, and can generate a long form, highly structured response following user interactions, uh, following user, uh, instructions.
So one example, uh, in our, in this, out of these six training examples is review a paper from, uh, the following four aspects, uh, summary, strengths, weaknesses, and potentials. And for the for, for test, for test example, uh, here is a prompt creating a mar, create a marketing plan with the following elements. Um, marketing goals, OB and objectives define target audience, uh, research marketing tactics, uh, plan marketing tactics, and develop PO timeline budget.
And here is the example, the, the output from Lima. Uh, for the prompt, we, we just, as we just saw, um, so it can create a very good marketing plan, um, that including all of the elements the user specified, so,
So, uh, to summarize those three research questions in the beginning, so do we need large amounts of annotated data to train patent, um, comp, patent chatbot? The answer is no. Um, from the. Uh, observations of Lima. And the second question is, what are the critical access when creating the annotated data? What we found is, uh, the quality and the diversity, including the domain and the task diversity of the annotated data.
Um, so the third question is how well can model trend with a small number of, uh, I noted the data generalized to new tasks. Uh, we observed that after seeing a few annotated examples, the model can ize pretty well to relevant tasks. So in the end, I'd like to point some limitations and open questions when we develop Lima.
Uh, first we see that Lima is still weak in coding and mice, as with many, uh, other recent chat bots. One reason is that we are not building upon a worry strong foundation model that has since sufficient coding and my um, uh, pre-training examples. And the other reason is that Lima data doesn't have many well aligned coding and math examples.
So one open question here is what is the most effective data format of learning code and math, uh, and some other reasoning in intensive tasks.
So second, uh, most of the training examples in Lima are from the public data science, and this is definitely not the best one on examples. So one question was investigating is that how do we automatically and systematically discover new and d diverse scenarios and create a better version of fine tuning dataset, um, no matter, even, even in, in a lifelong continued learning setting, when the model has.
Been deployed in the wild, so it's interesting. It's also interesting to do an IPO compilation between PPO and S FT in terms of the sample and the annotation efficiency. Uh, so last but not least, evaluation is hard, especially assessing the truthfulness and, uh, the truthfulness of dominant specific questions without, uh, expert annotators.
And this will be a constant issue as, uh, large language models become, uh, more and more stronger. Um, okay, so this is the end of my talk. Uh, thank you everyone.
Awesome. Thank you so much. Thank you. That was, uh, so insightful and definitely surprised me in a couple directions, especially about the, um, needing fewer annotated pieces of data than I think people would expect. I thought that was really interesting. Thank you. Awesome. All right, well please drop in the chat.
Um, any links you think would be helpful for our, um, attendees to have, and thank you so much. This was really wonderful. Thank you, Lily. Okay. I'll just leave, right? Okay. Yeah, yeah. Let's talk to you later. Bye.