RLHF Data Collection in Practice
speakers

Andrew has a background in ML engineering and previously worked on bringing ML to the finance industry at Kensho and safeguarding elections at Twitter. He was the first engineering hire at Surge AI and has spent the last 3 years building systems to collect high-quality human feedback data at scale to power ML teams at Anthropic, OpenAI, and more.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
This talk describes how we think about collecting RLHF data at Surge. We highlight the risks of collecting low-quality data for RLHF and describe some of the practical strategies we use in our full-stack RLHF data collection product.
TRANSCRIPT
 Next up we've got Mr. Andrew. Where you at, Andrew? Let's see it. Let's see it. Uh, there he is. Hey. What's up? Yeah. Nice to meet. Excited to be here. How you doing, man? Uh, great, great. So, I sent out an email before this, uh, all started, right? And I said, if you reply to this email and you let me know what your talk, your favorite talk that you're looking forward to is, uh, I'll send you out some swag.
And quite a few people said they were excited about your talk because SER is absolutely killing it right now. And we need to let everyone know that. And you, so you, you've come on here to talk to us about the very hot topic of what I can never pronounce, the R L H F, uh, if I put those letters correctly in order.
And, uh, it's, it's hot right now, man. Let's be honest. Yep. Yep. Uh, cool. Stretch. Should I jump into it now or, yeah, and I am gonna share your screen so that it is here and I'll come back in 10 minutes. We're switching over to Lightning Talks now, so this is gonna be a 10 minute talk. You got 10 minutes on the clock, man.
I'll see you soon. Yeah, thanks to.
Uh, cool. Yeah. So my name is, uh, Andrew. Uh, I'm from, uh, ser ai, where I'm leading the engineering team. Uh, just to introduce ourselves a little bit, we are a, a full stack, uh, human feedback company, uh, that started around three years ago. So we cover everything from hiring our own contractors, uh, to building our own, uh, labeling platform for them to, um, annotate on, uh, and, and delivering data to clients and.
Recently, we've been spending a lot of time on LMS and, and rhf. Uh, so to give some more context about our work, uh, we've been partnering with, uh, other groups like Anthropic, uh, OpenAI, uh, Google Co, here and others to kind of help provide data, to provide, to power like a lot of these, uh, top of the line systems and, and make them kind of safe and, and useful.
So this, this slide kind of has, um, a bunch of the research papers that, that have used our work. And, uh, it's is available on our website, which is search hq.ai. So today I'm gonna go through the, the process of r l hf, but kind of from the perspective of data collection. Uh, so first I'll talk about the, the supervised, fine tuning piece of it.
Uh, and then the, the ranking. Piece of it and kind of give some of the lessons we've learned along the way, uh, doing this type of work. Uh, so before we jump into the details, uh, I'll kind of motivate the problem. So let's say you're just fresh off training. Your, your new l l m you fired up and maybe you're hope to, hoping to see how well it does to chat bt and you ask it to write a story.
Um, About Harry Potter using AI to fight Voldemort. So the problem here is the base, lm, as we all know, is only trained to predict the next word on internet text. So when asked to generate a story, uh, before any R L H F, it's actually not gonna generate the story. It's just gonna kind of continue on, uh, from there.
So it actually just kind of keeps writing instructions about the story instead of actually writing the story. Once we're able to successfully, uh, you know, run the rhf, uh, we'll get the actual content we want. So here's, uh, an example from GPD four. So it has our lhf for similar training. When we ask it to write the story, uh, it actually, you know, starts doing what we want.
So, How do we actually get this useful behavior? How do we go from like this next word predictor to the kind of a useful assistant, uh, character that we're all familiar with in, in chat, BT. So there's, there's kind of two stages to the, to the process, uh, as most typically as the way it's most commonly done right now.
So the first is, is supervised fine tuning. So this involves just collecting a, um, A few thousand to tens of thousands of prompt completion pairs. So an example of a prompt completion pair, uh, could be the story I, I just showed. Um, or here's another example from a surge labeler. So we we're basically just collecting a bunch of prompts and then writing from scratch, the desired output from our, from our model and directly, um, fine tuning the model of that.
So this is kind of like laying the, uh, foundation, uh, for the assistant we want. So all the kind of behaviors we want from our assistant should be included as part of this fine tuning data set. So, you know, in this example, the model just gives the tweet directly when we ask you to write a, write a tweet about the matcha flavored sparkling water.
But you could imagine another type of assistant that's more of a revoked that says, Hey, sure, I'm happy to do that for you. Like, here's the tweet. Um, you could also imagine that if this was about like medical advice or financial advice, you could want to give a disclaimer. So any kind of behaviors we want in the assistant, we're gonna wanna put into this fine tuning data set just to kind of lay the foundation for telling the model, like what it can do and the possible outputs.
So here's kind of like one, you know, one thing that can happen if you don't do that well. So this is a llama model that has had, uh, instruction tuning, but it, it never was actually told. It never actually had any data in its fine tuning data set that, that let it know that it could say, I don't know. So I ask it for, um, the address of my favorite Italian restaurant in San Francisco.
All of the fine tuning data the model received, it always involves the model being asked the question and the model being answering that question, not saying I don't know. So instead of saying, I don't know, because this is like a very specific restaurant. It actually just makes up basically a completely random, random address.
So I think one lesson learned from the, the supervised, fine tuning spot is that this is where we really like lay the foundation, all the behaviors we want the model to have. We want to include them in here.
Uh, the next part, um, is, is reward modeling. So this involves, uh, Basically kind of the next step where instead of just feeding the model prompts and completions, we're, we're kind of, uh, nudging it towards being a more and more helpful assistant, uh, over, over time. So we kind of sample, uh, different outputs from the model and then we have labelers rank them indicating which is better, and we kind of use that, those rankings to kind of run the reinforcement learning process and, and kind of.
It gradually increase the model, like through different iterations of training, gradually increase its capability and its helpfulness as an assistant. Uh, so I think one of the interesting things here is it, it can be really hard, uh, to do the rankings because when you're looking at a model output, there's a ton, uh, model output.
There's tons of different factors that go into it. So you can imagine an example where, um, maybe we're asking for information about someone and. One of the completions is a beautiful answer. Uh, it's like perfectly written. It's great. It's the perfect length. However, it has like one or two small, like facts wrong hallucinations.
And then we can imagine comparing that to another answer where all the facts are right, but the writing's a little sloppy. You know, it's not, it's not the best. Uh, besides that. So how do we actually weigh the, all the different things that go into, like, what makes the completion good, like the factuality.
The writing quality, the length. So it can be really hard to, to manage that. And this is where we found it's really important to kind of be very clear, uh, in your instructions to people about what you're prioritizing and what your desired behavior is. So another, another, uh, problem we've, we noticed here is, um, sometimes you might sample a bunch of completions.
And maybe kind of all of them are bad and you don't wanna reward any of them. So this is, uh, an example from this, from the same model. Um, so let's say we're chatting with it and we ask it about the most recent, um, John John Wick movie. Um, there's two answers here. And they're, and they're both kind of wrong because they both say John Wick three.
Um, even though John Wick four has come out in the meantime, So the ideal situation here is we, we want the model to basically acknowledge the, that is training data is not most up to date. Um, however, if we, if we pick one of these, we'll basically be rewarding it, you know, either way for, for giving the wrong answer.
So another thing we've kind of built into our r lhf, uh, pipelines, uh, that's originally, uh, an idea from the anthropic r Lhf paper is the idea of editing. So we actually have, uh, A, a function in our labeling software where if you actually don't like any of the options, uh, you can press an edit button. And in this case, you can imagine the person, uh, editing the first, the first answer, and then actually rewriting it, uh, to have the proper response.
And then instead of comparing the original two, um, Completions, we can actually compare the, the edited response, the, the one in the bottom left hand here with, with the one that, uh, is kinda like the source's response. So instead of, uh, telling it to prepare to com to prefer one of these, uh, back completions over the other back completion, we're telling it to prefer the good completion over, um, over one of the bad ones.
Uh, so yeah, I'll mention, uh, briefly like a final kind of, uh, future research direction I'm excited about in, in Rhf. So I kind of mentioned the problem of like all these different things we're having to conflate. So there's actually some recent work from the Allen Institute where they have the labelers, uh, explicitly label explicitly mark in the text, uh, when there's an error intuity.
Um, and you know, when. When things are irrelevant and by actually, uh, specifically indicating these problems, we kind of have a gr more granular data that we can combine in different ways to create a reward single. So you could kind of decide after the fact, uh, how much you want it to penalize, uh, an error and factful.
Um, and, and now you wanted to weigh that against other, other things in the reward model. Um, looks like, um, I'm out of time now, but yeah, I'm happy to, to answer any questions and, uh, I'll be around in the, in the chat later.
Excellent. Thank you so much for this. This is awesome. So I am going to keep us cruising. For anyone that has questions for Andrew, feel free to throw 'em in the chat and learn a little bit more about this. R L H F. I have to think really hard every time I say that, which is not the best sign. But dude, Andrew, thank you so much, man.
Thank you. And it's really cool to see everything that you all are doing at Serge. I know, uh, I said it before, but. I'm a huge fan and I think that, uh, you know what you're doing. It's no small feat. Yeah. Thanks Demetrius. Yeah, I appreciate it. All right, dude. Well, I'll keep it cruising. I'm kicking you off now.
