MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Building and Curating Datasets for RLHF and LLM Fine-tuning

Posted Jun 20, 2023 | Views 1K
# LLM in Production
# RLHF
# LLM Fine-tuning
# Argilla.io
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Genesiscloud.com
# Rungalileo.io
Share
speaker
avatar
Daniel Vila Suero
CEO & Co-Founder @ Argilla
SUMMARY

This workshop focuses on the crucial task of constructing and managing datasets specifically designed for reinforcement learning from human feedback (RLHF) and large language model (LLM) fine-tuning. Let's explore the utilization of Argilla, an open-source data platform that facilitates the integration of human and machine feedback. Participants will learn effective strategies for dataset construction, including techniques for data curation and annotation. The workshop aims to equip attendees with the necessary knowledge and skills to enhance the performance and adaptability of RLHF and LLM models through the use of Argilla's powerful data management capabilities.

+ Read More
TRANSCRIPT

Introduction

Hi, Daniel. Hi. Thanks. Thanks for the great intro and I'm, I'm glad to, to be here and I'm so happy to, to see some, uh, well known faces. So thanks for, for attending. Uh, yeah, I'm, I'm really excited and. Uh, because we are going to present some of the latest things we've been doing, uh, for the AGI platform, uh, and I know some members of our, of our community are here, and I also expect new people to, to join as well.

So it's a great opportunity to, to cover the, the new things we are doing and, and also an, an interesting topic and I think, uh, useful topic for, for everyone. So I will share my screen. Let me,

I hope so. Yeah. Okay. Yes, we can. Okay, great. So, yeah, uh, as, uh, as we said, like I am going to talk a little bit about, about our Gila, but my goal is to talk more about like general practices and, and give like a general overview of, of, of this, uh, topic of, uh, data collection for, for LLMs and, and for, uh, uh, LM fine tuning and R L H F.

Uh, I want to mention that, uh, Argi is useful for any other NLP task, and that's what we've been doing for the past, uh, two years. Uh, and, and, and before that we've been working also on a lot of NLP projects. So yeah, you can use argi for LLMs, but you can use and get a lot of value, uh, from Marilla also for predictive, uh, NLP models.

So yeah, I invite you to go to the GitHub page and then you will see there like how to get started. It should be really easy. Uh, we have an integration with having face spaces. That will mean that you can have like a quick start, uh, space to play with our Gila. And then if you want to experiment a little bit more, we provide Docker installations and other options.

So, yeah. Uh, most of the things I be discussing here are. Introduced like a few days ago. So they are pretty new and it's what we call our GLA feedback or our GLA for LLMs. Uh, if you want to know, uh, why we build this and uh, what we plan to, to do, uh, in the next, uh, in the next, uh, months, uh, I invite you to, to read the blog post because there we kind of, uh, position, uh, where we want to go and, and why AR is different to, to other things.

So in this talk, uh, I will be. Covering, uh, several topics. So the first one is I'll try to just like, uh, define what's, uh, what we understand by hu human feedback in the context of, uh, uh, language models, uh, then why we need it. Uh, then I just briefly discuss, uh, some of the components of the L LLM lifecycle.

And then I will also mention, uh, an important topic, which is l l m evaluation. And then I go, I go through some of the stages of the L l m life cycle, uh, like from collection, supervised, fine tuning, and so on. Uh, and by the very end, I will do a demo and I will also, uh, announced, uh, a surprise, a special prize for this conference.

So I invite you to, to stay until, until the end to, to. So what's human feedback? Uh, I have this definition from, from a paper, but it was only just an excuse to to point, uh, at this paper that I found really, really, uh, uh, interesting and, and with a lot of insights. Uh, and it's very recent. So they cover, uh, all the different usage, uh, of human feedback to improve generative models.

Uh, and they, it's a survey so you can get. Uh, a lot of insights if you, if you plan to, to, to use human, human feedback to improve your, your pipelines. So basically, yeah, in this definition, uh, it is all very complex. But what, what he's saying is that, uh, you, you have an input, you can have one or more outputs, and then, uh, you have some feedback.

And this feedback, uh, can be of different, uh, nature and, and can come from different places. So it's all kind of like complex to just explain that there's a human, like providing feedback about, about specific outputs in, we believe that. Uh, feedback is not only about outputs, but it can be also about inputs, and it can be about many other things in the pipeline and in the data.

So, for example, you could use our Gila to, to score prompts, for example. That is not really an output, is an input, but it can be very useful to, to ensure the needed quality of, uh, of your training, uh, methods, as we will see later. So, yeah, this is, uh, the explanation of the same formula, uh, by, uh, ch g Bt, the Eli five of that formula.

And basically, yeah, it's talking about, uh, building a tower y from O Y X. Uh, and then the feedback, uh, is the advice from mommy and daddy. Uh, and the goal is of, of course, to build better, better towers. So for models is essentially the same. We want to gather feedback to, to make, uh, these models, uh, these models much more robust than, much better than, than, uh, than without this, this feedback.

So going a little bit more into the specific details of what's, uh, human feedback, I, I build this, uh, uh, human feedback ui, uh, in Arla. Basically, Arla provides a way to define your own custom, uh, UI for, for providing feedback and with multi, uh, aspect and custom feedback. We mean that you can define. Any, uh, set of questions.

So on the, on the right side, we see, uh, different questions. Uh, one of them is a rating to rate the prompt. The other one is a ranking to decide or to give your preference over one of the responses. Uh, and then you can also ask for natural language. Uh, To your users or to your, uh, colleagues if you are doing evaluation.

So in this case, we, what we see is a prompt, uh, asking about the Python question, and then we, we have two different, uh, responses from. Maybe the, the, the same model. We don't know, but they are, there are two responses, and our goal as humans is to provide feedback about the full, uh, data point. As I mentioned, it's not only about the outputs, it can be also about the quality of the inputs, which is kind of essential for some of the stages that we will see later.

So basically this is, uh, how you build this UI that I, that I, uh, just, uh, show. So the idea is that you build a data set in our Gila, uh, through the Python sdk, and you define a set of questions and then you start feeding, uh, data points there. And then you can have like multiple annotators, uh, to actually go into the dataset and provide feedback.

And then you can read this data, data back. To just, uh, train models, evaluate models, or do something else with the, with the data. So, yeah, what we see there is the definition of the field. So which fields will be shown to the user and also which questions will be asked. An important part of this process is the guidelines.

So the annotation guidelines are really important, and we will see some examples on, uh, when not using and not defining well, these, uh, these guidelines can, can cause uh, problems.

So, yeah. Uh, just to summarize a little bit more about feedback. So, uh, we have inputs. So for example, the inputs, should I add RIF to my pa? And the output is nothing but the feedback is actually a completion. So this will be one of the first stages of the L l m uh, fine tuning process. That is called supervised fine tuning or instruction, instruction tuning.

Uh, and basically we are going to ask, uh, our experts or users to provide kind of like the model response. So in this case, we provide them with an input, a prompt, and then the user will come up with this, uh, feedback, which is absolutely tarrifs a popular ingredient. Uh, and that's of the type of natural language feedback.

But then you can have, uh, the same input with an output and you can ask users to rate the quality of the, of the response. So in this case, the user is saying like, it's not so good if we assume that it's from zero to one, uh, with zero being bad and one being really good. Um, and then, yeah, uh, another type of human feedback can be a ranking, which is, uh, very important for the last part of the talk.

Which is about like human preference. And in this case, we are giving the users one input and several outputs. And we are asking the user to, uh, just rank by preference. And preference is a really like wide definition, but I will be trying to explain the type of things, uh, you try to, to communicate to, to your labelers or to your experts in order to define what's, uh, preferred.

And of course that depends on the use case, but you can have like a, a set of like general guidance. So, Another type of feedback, and this is more, uh, related to traditional n l p, it's categorical or, or binary. So for example, you can show the inputs, the outputs, and you can ask the, the, the users to say if it's harmful content, if it's positive, negative, uh, and so on.

So all these kind of things, uh, can be easily defined with our Gila, and you can in fact define your own, uh, set of questions and guidelines for, for, for actually, uh, gathering human feedback. Why do we need human feedback? So this is a very famous representation of the process of going from large, uh, language models trained on like internet scale data to something that is more, uh, acceptable, supervised by tuning, and then r l hf, which is kind of like, uh, yeah.

Uh, Kind of controlling a huge beast, uh, that has, uh, read like really harmful and really violent content. Uh, and actually to steer this, uh, huge monster into something that is acceptable by, by humans. So, yeah. Uh, as another, uh, mention to the paper I, I I pointed at, at the beginning, uh, yeah. Training LLMs on internet scale data can generate toxic, inaccurate, and helpful, uh, content and automatic evaluation metrics often fail to identify.

These behaviors. So basically as models become more capable, uh, we are going to need, uh, more and more human feedback. Probably it's going to be less quantity, but uh, uh, more quality. And that's something I will discuss at the end as well. So, yeah, this is a famous figure as well showing the effect of adding human feedback in different stages.

So going from this internet scale, l l m, to something that is prompted. So a prompt including some guidelines on how to actually behave or follow an instruction. And then when you really fine tune the model to just follow instructions and do. Uh, the things you, you are asking. Uh, and then the last step, the install GT one is one that has gone through this R lhf process that I will describe at the, at the beginning.

At the end, sorry. So yeah, how we define preference or how we define when a model is aligned with human values or with human preference. So the Anthropic people came up with this, uh, acronym, H h h three H, or I don't know how to, how to pronounce it. But basically, uh, they kind of divided this into three different, uh, different factors, uh, or.

Uh, properties, uh, and you can read the paper and they, they describe more in detail that, that I will do here. But basically the first eight, uh, stands for helpful. So basically a model is helpful if it, it resolves queries six necessary details and responds, uh, empathetically and provides the, uh, su suggestions.

So the way to measure this is by human preference ranking. So the example that we saw at the beginning, About having two outputs and asking the human to just rank, uh, the two outputs. Uh, also you can measure this by preference model scores, which is. Like the model that you train with the rankings, as we will, as we will see later.

And then something that is becoming popular as well, uh, in the, in the domain of, uh, LLMs is this ELO scores, which is basically a way to, uh, to compare different models and to, to actually compute, uh, the, the probability of winning, uh, for, for, for, uh, comparison. So, The next, uh, eight is, uh, harmless. So a model is harmless, if it's a, if it avoids offensive behavior.

Uh, and all these things that we know, uh, this is. Probably the best, uh, defined, uh, thing. And it's not like helpful. It's difficult sometimes to, to, to define, but toxic or, uh, sexual or violent is much more easy for us. So basically the way to measure this is through binary questions. And we will see, uh, in some examples that, uh, basically you are going to ask, uh, also labelers and users to, to provide these kind of binary, uh, uh, inputs to, to your feedback.

Then there are a lot of bias and toxicity, toxicity benchmarks, and also some works, uh, have been doing preference modeling for harmless harmlessness. Uh, and basically here they are using a ranking, but the ranking is to compare, uh, across these dimensions. So rather than the helpfulness dimension, uh, the preferred, uh, output will be the less harmless.

Uh, and notably, this is done by the. Work. And then the last age, which is kind of ill defined or not very well defined and has been dropped, uh, after this, uh, paper as far as I, I see. And this is also discussed in the paper I mentioned about human feedback, uh, is the honest, uh, part. So basically the model is honest if it's, it provides accurate information.

So there we are talk, talking about accuracy and truthfulness. Uh, but also expresses uncertainty. So one of the things that, uh, is, uh, interesting about LLMs is that, uh, an RL HF is that, uh, we don't know what the model knows. This internet scale L l m, uh, might know a lot of things, but we actually don't know, uh, if it's like.

Kind of making up the information or, or it already contains this, uh, this information. So basically what the open AI people did is, okay, we are not going to talk about honest, uh, honesty. We are going to talk about truthfulness, and then you can come up with ways to actually measure. Uh, the accuracy, uh, of the, of the responses.

And also, uh, to, to, to measure if the model is actually making up, uh, uh, some information. So, the most notable benchmark for this is the truthful QA dataset, uh, that you will see also in the L L M. Uh, Uh, leaderboard by the Hain face people, and you can see like the comparison across the open source models for this, uh, data set.

So now going into the l l m life cycle, so here specifically I'm talking about like kind of fine tuning, uh, for human preference and alignment. But of course in the LLM life cycle, there are many other things like evaluation that I will discuss and there are many other. Issues, uh, to take into account such as deployment, inference and so on.

But here I'm specifically talking about the life cycle of the data. Uh, and the models and the data you need for, for improving them, uh, throughout this, uh, process. So basically this is, uh, this figure is inspired by, uh, the one from Tip Huen, uh, that also is inspired by the Instruc G B D paper. And basically we see the different stages, the pre-training, uh, this is the internet scale, uh, lm.

Uh, then the supervised fine tuning that is, uh, trained with something called demonstrations, uh, that we will discuss later. But you can see an example on top. Uh, and for SFT models, you have many examples, uh, many open sources, examples as well, like, uh, alpaca, Dolly, Vicuna, uh, the, the new Falcon models, uh, the Falcon Strip model, and so on.

Uh, so yeah, basically you can start using this model because. It's supposed to, to be following instructions, uh, but you can improve it further by the second step, uh, which is the R Lhf, uh, step. That is composed of two steps. Uh, one is preference modeling or reward modeling. And there you are going to collect, uh, these rankings or these comparisons that we discussed.

So for example, I have a prompt two responses and the, uh, the user will tell me which one is, uh, preferred and which one is less, uh, preferred. Uh, I'm not going to discuss much about the, the modeling part, but, uh, I will point at some, uh, good frameworks and libraries to, to, to check, uh, to check this out.

And for the final part is the RL process. And basically for that, you might need, uh, to collect more, uh, data. And that data, it will essentially be prompts because in this final process, you are going to use, uh, the model to just generate, uh, One or two or more, uh, responses. And then you will use the reward modeling that you trained on the previous stage to actually, uh, evaluate and score that.

And, and the RL process will, will not use, uh, the human, uh, preferences anymore. It will use the preferences of the, of the reward model. Um, this is, uh, a rather static, uh, view of, of training. And at Argi we like this. We are more into like iterative and, uh, dynamic, uh, life cycles, uh, within since the very beginning, focusing on this, on enabling data teams to actually iterate on data models.

And we think about the same for, for LLMs. So I find, uh, This, uh, this diagram from, from the atropic paper. Very interesting because it's probably one of the only works that discuss this kind of like, uh, as they call iterative, uh, online, R L H F. But basically what they do is like to do snapshots every week and, uh, use a ui, which is a chat-based UI to collect, uh, these conversations with the new generations of models.

And they keep on improving the quality of the data every week. Uh, so I find this process much more aligned with, uh, what we are expecting to see in the coming, in the coming months and years for, for custom LLMs. Uh, so I wanted to share also this, uh, this figure. So, yeah, just a few words about, uh, evaluation.

Uh, probably the youngest people here won know this film, but it was very famous in the eighties, and it's about the robot that has like incredible capabilities and I, when I am looking at certain, uh, papers and certain works, discussing l LM evaluation, I, I always think about this, uh, uh, idea of a robot like reading, uh, input, input, input.

Uh, and probably, uh, not understanding much. So at Argi a, uh, the, all the argi a feedback work has been inspired by several, uh, community efforts that we've been doing. And one of them is, uh, this alpaca data sets. We've been engaging with different, uh, communities in different languages, and we've been analyzing the, the quality of the alpaca dataset.

That, for those of you that don't know it is, uh, uh, synthetically generated, uh, dataset of instructions. Uh, and basically we've been, before releasing Arla feedback, we've been using the previous version of Arla to, to actually help, uh, people out, uh, for, for the cleanup efforts. Uh, and here we are seeing an example of a kind of hallucination of, uh, Of an instruction.

And remember, you are going to fine tune the, the instruction following model with this type of data. So the model, uh, will think that actually, uh, hallucinating is fine. Uh, and kind of receiving something like attached painting is fine and it can provide you with a response that is completely, uh, made up.

So, During this effort, what we did is to actually use our Gila for curating the dataset. And with all this feedback, we actually trained a model that has, has been used for different languages and has been used also in this alpaca clean effort. And basically it's a set feed model that can, uh, given an instruction and output and an input, uh, it can tell you whether.

It's a high, uh, it's likely a bad instruction. So, um, yeah, this is, this is the type of work that we will expect to be doing much more efficiently with the new feedback, because you can define different kind of dimensions for, for, for feedback. So you can train also specific models to detect certain, uh, certain attributes.

Uh, another comment about evaluation, uh, is this paper that I found very interesting and very useful and needed. Uh, that came up, uh, a few, a few weeks ago, and it's about the, the promise of imitating proprietary LLMs. And this is specifically about, uh, vicuna, uh, alpaca and all these models that. Uh, try to reuse, uh, outputs from, from, from G B D G D four and so on to, to, to just kind of like imitate, uh, the, their outputs.

Uh, I know about the ORCA work. I cannot commend on that because I didn't read, uh, fully the paper. But my first thought is that, uh, the process that they did, the Microsoft people is like really expensive, uh, in the sense of like, uh, Some works I will discuss later will require only a thousand or or 2000 examples to just get the model to follow instructions.

So I don't see a lot of value of like, at least for open source to generate like millions of, uh, instructions from from GBT four. So basically, uh, this work, uh, Kind of showed that, uh, humans, uh, and crowd workers, so the people like evaluating or scoring this, uh, these, uh, the outputs of these models, uh, actually tend to, to to be fulled by, by the style and the tone.

So what we see in the screen is a chat G B D response that is almost correct or accurate. So the truthful, truthful, uh, dimension. And then we see an limitation model that seems to be like really. Um, with good style and author authoritative, but basically it contains a lot of like false, uh, false information.

So, uh, in this paper what they showed is that the crowd workers tend to like score highly these, uh, these, uh, imitation models. But when you actually test these models in, uh, for kind of language, uh, tasks and NLP tasks, You will see that they actually don't perform even like, uh, better than the non non instruction models.

So for example, if you fine tune llama with alpaca, uh, llama, uh, the base lama will still be, uh, better at this, uh, task. So I found this real interesting because. We've seen other work like Vicuna claiming like 90%, uh, um, closeness to to, to chat G p D. Uh, and I think like, uh, in this paper they show that there's still a huge gap and that there are not shortcuts to, to actually make this, uh, models, uh, like compa comparable to those that don't, don't get, uh, that don't, that have been going through a, through a process of human feedback.

So another thing related to evaluation that I wanted to mention is this post, uh, from deep learning ai, uh, that got really popular and is talking about like, okay, in traditional ML you need to get label data. You need weeks or months to do that. Then you develop a model and then you deploy in production.

And what we are seeing now, and I agree, is that. You can get going with a prompt model. So you can just configure it and you can test it, and then you can start deploying it. Uh, and they described like this, uh, sequence of, uh, actions. So you deploy your live data and then maybe you use it on shadow mode.

So you don't use it like, for like getting like the real, uh, answers to end users, but you use it just to. To, to collect, uh, to collect responses. Uh, but what they don't discuss is, uh, this last part and it's about the model performance. So they say if model performance is acceptable, then let the model make, uh, real decisions.

But my question is how you do that. And, uh, there are not many, uh, good, uh, responses to that, uh, because. Either you collect ground truth data or you have another l l m to, to score, uh, the, the responses, or you ask your users if they find it, uh, correct. But of course you need a rigorous process to actually evaluate and decide if the model, uh, is performing well or not.

So what we propose at ar Gila, and we've been doing work on, on that as well, is that you can start with a pro model, say, uh, a long chain application. Uh, and then you can use the official Arla callback to monitor the interactions with the, with the chain, uh, and then you can build data sets on top of that.

So what you can do is to use this from base model and continuously and frequently. Kind of ask your users or ask your colleagues or experts in the domain if the model is behaving well. So you can compute, uh, kind of evaluation metrics, uh, from, from this production, uh, model. So I invite you to, to test it.

This still, uh, alpha, uh, with, we are going to keep it to keep it, uh, uh, to keep improving it over time, but, uh, it is. Usable right now. And the idea is that you can define a, a callback. So every kind of interaction, uh, of the l LM will be stored on an A data set. Uh, and this AGI data set can be configured for, uh, rating the responses.

So just provide like a, the correct response. So this is highly valuable because you have, uh, pro based model, but you are at the same time. And. Uh, setting the, the, the, the way to actually maybe fine tune it or at this get kind of like a real evaluation metrics beyond the model is, uh, performing well. So maybe we can stop, uh, we can do a small pause if someone wants to ask some questions and then I can continue.

Or shall I continue?

Hi.

Okay. Okay, so yeah, uh, now this part is about the stages, uh, of actually the life cycle that we saw at the beginning. So how to get from a base, lm. Way through, uh, instruction following, and then an RL hf, uh, model. Basically, I will be discussing the different data collection processes that you need. Uh, the first one is prompt collection,

and the type of feedback at this stage is basically there's no input. There might be some input guiding that you serve to. Please write a prompt about this topic. Uh, but necessarily you don't, you don't need, uh, a lot of input. It'll be only about like asking the users to come up with a, with a prompt. So in this case, should I, sorry.

To my paia. And the type of feedback is natural limits. So just to situate ourselves, we are at this stage, so for supervised fine tuning, we need prompts and I will discuss ways to get them. Uh, but one way is to ask users to just. Write them down. And this is what the extract g bt, uh, work did, uh, asking, uh, laborers and crowd workers to, to come up with, uh, with, uh, thousands of, of fronts.

Uh, So why do we need prompt collection? So we need to collect data to fine tune and align LLMs with human values preference, uh, and, and domain. And for that we need, uh, prompts. And as I said, there are different options. So the first one is the most obvious. You can use an existing resource or database, uh, and in fact open AI combine asking users to write, uh, new, uh, prompts, but also leverage.

Uh, a lot of data that they had from their, uh, previous APIs. Uh, and I don't know exactly the, the, the fraction of, of, uh, how many labelers, uh, were pro were generating prompts versus the, the API distribution, but they, they at least say that they use both. You can also leverage user queries to your service.

So imagine you want to, uh, fine tune, uh, a model for, I dunno, customer service for, for your. Uh, for your product, uh, or whatever, and probably you have even a non ML service for people to ask questions. You could leverage that. And, uh, as I said before, uh, here, it's important to actually, uh, measure the quality of these, uh, inputs and the, and the diversity.

Uh, and you could. Leverage this database and ask users to just rate or, uh, qualify these, uh, these queries in order to have like a, a high quality dataset. So the other option and uh, is the one that this described in the in stock g d paper is to ask experts to write prompts. Uh, as I will mention later, this was done by Dolly as well.

The Databricks, uh, people ask employees to come up with prompts, but they also ask them to write the response. And I will discuss, uh, that later because that can, uh, come with, uh, with issues. So, Here, it's important to set up the guidelines and the topics. So for example, in ar Gila, you can say, okay, I want, uh, I dunno, 2000, uh, prompts.

Uh, and I want this distribution of topics. What you could do is to, uh, create a data set with, uh, An input that is specifically telling the user to write about this topic. So you can say, okay, I want 20% of, uh, data points of this topic and 10% of this other, and the users will go there and they will not need to think about the the topic.

They will be given a topic and they just need to, to write the, the prompt. So there are ways to actually control the, the distribution and the diversity of this, this dataset. The risk with, uh, collecting prompts this way is that maybe there's a disconnection with the use case and you can ask users to write like about different topics, but maybe this is not the real questions.

You will. Uh, get for your model. So, for example, the customer service one, uh, they can come up with fake questions, but at the end of the day, the, the user will ask other questions. So maybe your model is not, uh, fine tuned to follow, uh, or to respond to this, uh, real user queries. So if you have, uh, existing resource is good to leverage, uh, that one, uh, and probably combine it with, uh, with this other approach.

So there's this question about, okay, if I'm asking my experts or my laborers to, to actually write prompts, why not ask them to write a response? But, uh, because at the end of the day, what you are going to need for the supervised, fine tuning is the prompt and the response. So why not do this at the same time?

Uh, well, this, this has some limitations. Uh, It might work for some use cases, but, uh, coupling both of them can, uh, actually, uh, produce less quality responses. Because imagine like if the same user is writing a response and it's not very clear, but for him or her is really clear, uh, she will come up with a, with a response, but probably even the first question or the prompt was not correct.

So, Maybe the data quality of that data point is, is not so good. But if you separate those, uh, to kind of like, uh, data collection processes, you can, uh, ask users to write prompts and then, uh, other users or maybe the same users to write the responses. At that point, they can say, okay, this, I don't understand this instruction, or I don't understand this prompt, so I will just discard it.

Uh, and in this way you are not collecting, uh, kind of like bad quality, uh, bad quality. Uh, response. Uh, and there's a specific, uh, case of this, and it's the Dolly, uh, dataset from, from Databricks. We've been also analyzing the, this dataset and we've been also engaging with the community to, to, to improve it.

I have to say that it's really good and I, I really admire the, the effort of, of Databricks and, and the, and the employees. But nevertheless, it has some, uh, issues. Uh, and one of them, as I said, this, The problem is that some, uh, employees or labelers, they didn't understand fully the task or, or how the fields were used.

Uh, and, and for example, in this example, we see this is supposed to be an information s instruction task, uh, where the instruction is, who is Thomas Jefferson, and then the context is really short. And in information structure, basically you are going to need to extract like the specific information. Uh, but actually in this case, the, the user, uh, the labeler, uh, just, just copy paste it.

Uh, Wikipedia, uh, and basically, yeah, you, you are kind of trying to train a model to do information structure, but the. Uh, the example that you're giving is not so, not so good. Uh, and you might ask, okay, maybe this is just a couple examples, but, uh, we've been doing kind of, uh, a community campaign and, and we've identified and fixed more than 400.

And for some, uh, tasks such as information structure, summarization and so on, this accounts for more than 10% of the example. So we believe that improving this, uh, data set can lead to, to better quality data. And as I said, this is also a good, uh, insight on how to actually collect prompts and responses.

Uh, so yeah, if you are interested in more details, uh, we actually have the data set available for, for everyone. We also provide translations to other languages, uh, and this is an ongoing curation effort. So yeah, if you are interested in, in helping out, uh, contact me.

So yeah, this I discussed already. Of course it can be much more costly to to do separate, uh, but as we will see later, uh, maybe you only need. Uh, thousand examples are not 10,000 like, uh, the dolly or the activity paper because there are some, uh, works pointing at that. You need like higher quality but less quantity.

So even with 10,000 with 1000 examples, you can get a good instruction following modeling. So another way to do this is prompt based ml, so you can use prompts from monitoring and l l M as we said before. If you are continuously monitoring this, uh, this model, then this is a perfect source of real questions, uh, from users.

And then in this case, you can even use the responses if they are good. So in our Gila, you can set, uh, this data set listening to a long chain, uh, app, and then you can say, okay, those prompts are good. The response is also good. So this is probably a good, uh, ground truth example to evaluate and potentially fine tune your, your model.

So once we have prompts, the next step is supervised fan tuning, uh, which is basically trying to, uh, To turn them l lm into a helpful, uh, model. So with ability to follow instructions and to answer questions and to, and to be helpful for, for the user. So we are there. So in this case, the feedback will be the completion.

So I will, uh, as a labeler, I will see a, a prompt, and then I will write down, uh, a completion. So the type of feedback here is natural language again, but here we are providing already the input, which is the instruction or the prompt. And yeah, this is, uh, also mentioned in the litera to supervised fan tuning behavior, cloning instruction, tuning, and yeah, lately is more like an instruction following models.

Uh, the terminology used for, for this. So I wanted to show you an example of what we want to achieve here. So this is a real example of the Falcon based model. So this kind of internet scale, uh, lm. Uh, and we are asking it to write a follow-up, uh, email, and we are adding a prompt to something, to the prompt to, to actually just nudge the model into kind of trying to respond, uh, and generate the, the, the respond that we expect.

But even with this is not so, it's not so helpful. So this is the same model, but fine tune with instruction following, uh, instruction, uh, following data. So basically they use an open data set, uh, that is. Uh, contains this, uh, completion. So the instruction, and then the, the, the response. And we can see that this is much better.

So we don't need to add anything to the, to the prompt. We just say what we want and the model, uh, try to, to answer in a helpful way. Of course, this is not perfect, but this is something that can be achieved with, uh, with this, uh, collect, uh, with this, um, completion data. Sorry. So, yeah, basically this is how you good set up, uh, the s f t phase, the supervised fine tuning, uh, phase for collecting these completions or these demonstrations.

So basically you will say, okay, I have a prompt. Uh, and I want to ask the user to write a harmless and helpful response. So the user will be asked to, to just write it, write it down, and then you just push the data set and it's available in the UI for, for the users. And this is what they will see. So, for example, this is a very famous prompt.

Uh, And if they explain the moon landing to a six year old in a few sentences, and then the labeler will actually use the UI to, to provide a response.

Okay. So yeah, about size, size and quality. So the extract G PT paper, it was 30 k. Coming from the API and coming from labelers, uh, and the type of dataset is private. Uh, and then the Dolly dataset kind of like, uh, followed the same guidelines and they collected 15 K and in this case was, uh, employees. But in the difference is that they were writing both the prompt and the response, but the data set is open.

And you also have the curated version from, from Marilla, uh, in the link. I, I just, uh, showed. And the other work that I find really interesting is Lima, uh, less is more for alignment, and there they focus exclusively on curating a high quality. Instruction data set, and it's only one K uh, and they show that this gets good results, at least on style.

I don't think they will get a lot of good results in kind of like human preference or, or alignment. But at this, uh, the style of, uh, actually following the, the instructions is really powerful. So this might show that, uh, you don't need, uh, 10, 10,000 and you might just need, uh, 2000 or 1000 to, to at least have like, uh, uh, a first version of your instructional following model.

So the last, uh, stage is once you have this instruction following model, you want to actually model and align, uh, this, this, uh, this model for, for actually providing more helpful, less harmless responses. Uh, and the way to do this, uh, at least for now, is preference modeling. And basically here what we are going to ask, uh, our users is to say which outputs they prefer.

As we said, we are here, and in this case we will get several responses. And the, the task is about ranking and it's about providing our ranking of this, uh, of these responses. And this is called comparison data, uh, or preference data as well. Uh, and the goal here is to train our reward model that I will show how, uh, it works, at least from, from a high level perspective.

So basically the type of feedback we are getting here is, uh, rankings. Uh, and I wanted to mention here that you can ask, uh, experts to rank more than two outputs. And this is done by the instruct G PT paper, but not for, for not by the entropic people. And the way they do it is that they imagine they provide seven outputs.

Uh, and they ask the user to rank them all, uh, but for training, they will binary this, uh, this, uh, ranking and they will, uh, generate, uh, pairs of, uh, ranked responses because basically the data that you will use for training the preference models is similar to this one. You have an input. You have a chosen response and you have a rejected response, and you will teach the, uh, preference model of the reward model to give better and higher scores to chosen, uh, chosen responses.

Uh, so basically in this example, we are seeing that for this, uh, question, the model is, uh, preferring much, much more the chosen one, uh, which is written by a human, by the way. And the rejected one is written by a, uh, an open source model. If you're interested in this and the training process, uh, we publish this open source model and we also have a tutorial on the ARGI docs, uh, that can help you to, to, to, to, to get started with reward modeling using the the t r l library.

Uh, and yeah, I think I discussed most of it. Uh, maybe the, the last mention is that preference models are useful for R l hf. Uh, there's a new model called, uh, direct preference opti optimization that don't need rl. So this preference model modeling is done, uh, on without rl. And this is, uh, highly, uh, interesting for us.

We haven't tested yet. Uh, but beyond preference modeling, uh, this is also used for other things that are important in the. In the l l m life cycle. So one of them is model selection, uh, because yeah, uh, selecting the best, uh, instruction following model is not easy because they tend to over fit. Uh, so basically some, some works.

What they do is disregard the, the, um, the evaluation, uh, metrics and just use the. Preference model to just select the best snapshot. So basically from the evaluation metrics, you can see that it's over fitting, but the model, uh, preference, the preference model is saying that even if it's over fitting on the, on those metrics, this, this is the most useful.

So this is used for model selection, but it's also used for evaluation. So you can take the responses of a model and measure whether they are useful or not. And this is how you set up the UI for, for, for this, uh, state. So basically you have three text fields. You could have more if you have more responses, as we said.

And in this case, I'm using a rating question, but you can also use a categorical or as we call it, label, uh, question. And soon you will be able to use a ranking question where you can actually drag and drop the, the rankings. And this is the ui, the, after setting it up and filling it with, uh, with some, uh, examples coming from, from Dolly and from Falcon.

Uh, so basically we have a user instruction and we have two responses. One coming from the Dolly dataset and the other one coming from, uh, generated response from Falcon. And this is really simple. We are just asking the user to say, okay, I prefer, uh, response once, or I prefer response two. As I said, yeah, if you're interested in this topic, uh, we've been, uh, closely looking and collaborating with the tl, uh, people at having phase, and we actually published last week an end-to-end example for doing reward modeling using, uh, uh, uh, GI data set and using the new, uh, reward model trainer from from tl.

So the key takeaways, because I want to have at least some minutes to introduce the. Special price, uh, is we believe that human feedback and likely aided, aided by machine feedback is key to deploy, uh, aligned and robust l l M solutions. We believe that domain experts will become more and more relevant because it's getting increasingly difficult to provide feedback to these models.

And you need privacy and you need all sort of things. So we believe that. Uh, high quality feedback from the main experts within your organization is gonna be, uh, very important. Uh, collecting feedback is not as expensive as, as in my scene, especially if you start thinking about, uh, it since the very beginning.

So since you are starting to experiment with l lms, you are starting also to define your own human, uh, feedback collection processes. We believe that this is, uh, not going to, to be very expensive, especially if you have, uh, domain experts and you have your data team collaborating, uh, and cooperating in, in this, uh, in this, um, process.

So, yeah, I wanted to introduce, uh, this l l m Crowd eval price. Uh, and basically it's an open experiment. Uh, we are going to learn from this process and we would love you to participate. Uh, and it's an open experiment to understand, uh, LLMs, open source, LLMs, and open source data sets. And I think it, it might be a good exercise for you as well to just look at, uh, some of these, uh, data sets and outputs.

So to understand how, how they work, how they think, or how they produce, uh, uh, responses. Uh, and also it's an opportunity to win an amazing, uh, book. So the idea is that we have set up, uh, an instance, uh, our GI space on the having face hub with 500 users, uh, is completely deployed. There is, uh, completely open source.

The idea is that, uh, you just need to go to this, uh, hack phase space that I will go, uh, afterwards logging into the GI space with your user, and then read the guidelines and then just start ranking and optionally providing, uh, feedback. So the deadline for this is, uh, next Monday at 10:00 AM uh, Pacific time.

Uh, and the way to participate is to post, uh, and tag our on Twitter or LinkedIn. With your username, you will see that the username is auto-generated. So we need to, uh, you claiming that you are that user. And then when what we will do is to, uh, analyze the, the, the different part participations, uh, the different contributions, and we will, uh, choose, uh, the most prolific, uh, but also helpful and truthful contributor.

Uh, and the evaluation will not be based solely on the number of data points, but also on quality. We have introduced several, uh, control data points. I didn't discuss this, but this is highly important, uh, for ensuring the data quality. So, uh, all of you participating will have some records in common and we will use those to measure.

Uh, kind of like if you are just not following, uh, to the next screen and just labeling randomly and the results of the winner, uh, will, uh, be featured next week on our social media. The book that we are giving away is called The Human, human in the Loop Machine Learning, and it's a really good book to to understand all these human in the loop processes and understand active learning and so on.

So I really invite you to, to participate and I will just finish with going through the steps. So basically this will retrieve one of the users from these 500, uh, uh, users on this instance. And if I run it, I will request this one. I will not use it because probably, and hopefully some of you will, will use it, but I'm already logged in.

So what I do is to go to this link, you need to copy and paste the, the password. Uh, don't worry, because this password is not, uh, this disclosing any data from you or anything. It's just to log in into this, uh, instance that contains basically open data. So basically what you need to do is to go here. Okay, so this was my user, uh, and this is the user that you need to, uh, tell us that you've been using.

And if you go to the data set, if you want to see like the full screen, you can go to the embed this space and you will see the full, the full screen I think is, is better. So yeah, uh, you will be given, uh, an instruction to responses and you need to use this, uh, scale to say, okay. Response A is much better.

So if you presenter, you will actually submit this record and you can see your progress here. Uh, and that's it. I think I will, uh, leave some minutes for questions. Uh, and if you are interested in participating, please let me know. Uh, I think you will have the presentation available as well. Yeah. Right. Um, thank you so much, uh, uh, Daniel, for, for sharing that.

I think we have a few questions, but David has been gracious enough to sort of answer most of them. Um, I think the, the outstanding question here, um, came from Atan and he mentioned how does, uh, R H H help, um, um, help with, uh, Situations that require world knowledge, you know, saka or second order logic computations, like, know, uh, ambiguities and stuff?

That's a very good question. I, I don't think I have an answer. I basically, I'm, I'm not an f uh, uh, researcher or anything, but from what I understand, it will not help much with your knowledge. Uh, and in fact, in the s paper you can see that, uh, This R L H F model gets a bit better on, on, on truthfulness, which is like factuality and, and all these kind of things, but it's not getting like much better and because of the processes like human preference, uh, and probably the user is going to prefer one response over the other, maybe both are inaccurate or they are not encoding the world knowledge good.

So probably you are going to, to just use the same kind of reward. For, for the RL HF process. So I don't think like RL HF is the, is the best way to, to, to solve this problem. I think like, uh, using external knowledge basis and other things is much, uh, much more promising. But that's my, yeah, my first thought on that.

Yeah. Thank you for sharing that. And, uh, the next question, um, is from, uh, and he mentioned, um, oh. How stable is the r uh, l HF in general? Would you recommend it in high stakes situations?

Uh, again, pretty new and hydro. I, I'm really happy to, to see here. Uh, okay. I, I think in itself, like the R L H F process is really unstable. Like even for getting like the model two. To be trained, uh, properly is very unstable. Uh, and for high stakes, I think like, at least for controlling, uh, like toxicity and all those kind of things.

I think the, the, the open AI and the atropic people, they, they've done a great job, uh, using r l HF to, to detoxify models and so on. And that's why I think they, they, they've had this huge success, but for high stakes, like really high stakes. I don't know. I, I think I would provide a lot of guardrails and other things that go beyond, uh, beyond just like fine tuning or, or, or a hf.

I think it's a much more complex, uh, system. Yeah. And, uh, interesting. I think Tom also shared a perspective in the chat as well, so, um, I I think that's a, that's a good perspective. Yeah. And, um, this, this question from Raho and he asked, are there any standard data sets, formats for exporting the training data?

I believe this is, that has to do with, uh, the aguila platform. A again, can, can you, are there any standard dataset formats for exporting the training data in Aguila? Yes. So basically, uh, I think David is, is around from, from the team he's been working on, on this. Kind of once you have gathered feedback from an ARG dataset, how to transform that into something that can be used for, for training.

And he's been doing a lot of work on kind of aligning this with, uh, with other libraries. So Transformers, uh, T r L and so on. And we'll be doing much more work on that. Uh, but basically what we show in the tutorial, if you want to go to, to the TL tutorial, Uh, you can see that you just need a couple of lines of code to actually, uh, transform the argi dataset into something that is usable by, by a reward model.

It's just like, uh, some transformations that you need to do. Yeah. And we're a few seconds over, but I think we can take just one final question and, uh, this question is from Shiv and, um, Shiv asks, how often do you update the model with user feedback incorporated within the model? That's, that's also a good question and, and I think like most of the things I, I've been discussing now, like for more dynamic environments, we are not there yet, uh, like as a community.

So I think we will get there. But for now, everything is pretty static. ECS static, sorry. So yeah, you train a supervised fine tune model and then, uh, you just start using it maybe for production or whatever. And maybe you, you do RL hf, but you don't actually update the model. Uh, but I would say like the atropic paper that I mentioned at the beginning, they did a, a great job showing that you can do like kind of weekly updates to, to the model using this R l HF process.

But to be honest, we haven't explored yet these kind of like more iterative, uh, ways of building, uh, or fine tuning LLMs. We've been doing this for predictive models for, for a long time already. Uh, but I imagine that, uh, this can become more challenging. But this is something we will definitely look at.

Right. Thank you so much, Daniel, for sharing, um, your perspective and um, for also leading us in the workshop today. So again, Daniel is the CEO and co-founder of aela. And uh, if you want to learn more about aela, you can just go to the website@aela.io. Um, also sharing the link there as well. And, uh, also give you participation in the hugging face space as well to ensure to follow the instructions that, uh, uh, see you in uh, the other workshops for now.

Bye for now. Thanks everyone.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.8K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Fine-Tuning LLMs: Best Practices and When to Go Small
Posted Jun 01, 2023 | Views 2.2K
# Large Language Models
# LLM
# AI-powered Product
# Preemo
# Gradient.ai
Building RAG-based LLM Applications for Production
Posted Oct 26, 2023 | Views 2.1K
# LLM Applications
# RAG
# Anyscale