Explaining ChatGPT to Anyone in 10 Minutes
Cameron earned his Ph.D. in Computer Science from Rice University (advised by Dr. Anastasios Kyrillidis) in Houston, TX. His research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks. Cameron is currently the Director of AI at Rebuy, a personalized search and recommendations platform for D2C e-commerce brands. He works with an amazing team of engineers and researchers to investigate topics such as language model agent systems, personalized product ranking, search relevance, and more.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Over the past few years, we have witnessed a rapid evolution of generative large language models (LLMs), culminating in the creation of unprecedented tools like ChatGPT. Generative AI has now become a popular topic among both researchers and the general public. Now more than ever before, it is important that researchers and engineers (i.e., those building the technology) develop an ability to communicate the nuances of their creations to others. A failure to communicate the technical aspects of AI in an understandable and accessible manner could lead to widespread public skepticism (e.g., research on nuclear energy went down a comparable path) or the enactment of overly-restrictive legislation that hinders forward progress in our field. Within this talk, we will take a small step towards solving these issues by proposing and outlining a simple, three-part framework for understanding and explaining generative LLMs.
Explaining ChatGPT to Anyone in 10 Minutes
Ai in Production
Demetrios [00:00:00]: Next up, I am very excited to bring onto the stage Cameroon. What's up, dude?
Cameron Wolf [00:00:07]: How's it going?
Demetrios [00:00:09]: So I don't think people know this, but I have to school them real fast at who you are and how. I follow your newsletter quite religiously. I actually, I follow a few newsletters. I'm going to be honest with you. One is TLDR, who is like a partner in the conference. And that's for stuff where it's like the latest news, right? So you get TlDR AI. But then when I want to learn stuff, I open your newsletter because you go into such depth that I am so appreciative. Every time that I read through what you are cooking up, it is like, this is incredible.
Demetrios [00:00:50]: And so you were kind enough when I just replied to your newsletter and I asked, do you want to talk about some of this that you're creating in the newsletter? You want to give us a talk? You were like, dude, I got it. Let's try and explain. Chat GPT and you originally said what, like 20 minutes? And then I was like, can you do it in ten? You're like, oh, challenge accepted. So 10 minutes on the clock. I will leave a link to your newsletter in the chat for anybody else that wants to subscribe and get smarter and let you start talking about chat. GBT in 10 minutes, my man.
Cameron Wolf [00:01:28]: All right, sounds good. Let me try and share my screen here.
Demetrios [00:01:32]: There we go. That's the fun part. Yeah, it is on. All right, sweet. See you in 10 minutes.
Cameron Wolf [00:01:40]: Awesome. So, hey, everyone, my name is Cameron Wolf, and thanks for attending my talk today. Basically what we're going to do, we're going to hear probably a lot about large language models in today's conference. So the idea behind this lightning presentation is to just give a brief and understandable overview of how LLMs work. And this presentation has gone through a lot of iterations. The first one I gave was like 2 hours long, then it got cut to 1 hour, then 25 minutes, and then now we're doing it in 10 minutes. So this is going to be pretty high level, but we're going to try and get a good kind of level set on how chat, GPT and language models work today. And for anybody who's interested, there's a written form of this talk.
Cameron Wolf [00:02:23]: If you scan the QR code here, there should be a link to a written version of this talk for anybody who's looking to dive into more details. So the first thing is like, why is this topic interesting? And the reason is that we've seen a massive breakthrough in LLM quality, right? So if a couple of years ago, I asked GPT-2 to give me, like, a description of what I should cover in a presentation on llms, I would get a response that's pretty repetitive and uninteresting. But if I ask the same question to Chat GPT, it gives a super detailed and accurate answer. So the question here is basically just what happened? And why is Chat GPT so good? And what we're going to do today is just answer this question in 10 minutes, while still kind of maintaining some of the depth. So, typically, to explain the core components of Chat GPT, I use a three part framework. So the question is, how did we go and make this a lot better? What happened from GPT-3 to Chat GPT? And to explain this, I use a three part framework, which just includes transformers, pretraining and alignment. And what we'll do is just walk through each one of these one by one, and then conclude by just talking about how they fit together. So the first thing we'll talk about is the transformer.
Cameron Wolf [00:03:41]: At a high level, the transformer is just a type of neural network architecture, and it's used by nearly all modern large language models. The model takes as input text, and it produces text as output. And it has two parts, the encoder and the decoder. Typically, what the encoder does is it takes the text that you give it as input and it builds an understanding of this text. And then the decoder is in charge of producing textual output based on the representation produced by the encoder. The only problem with that is that generative modern llms are solely generating text. So they only use the decoder, forming a decoder only architecture. So we're really just using this part of the transformer, which is really good at generating text.
Cameron Wolf [00:04:28]: And within each layer of the transformer, the decoder only transformer, there are two things that happen. First, for each word within the textual input, we're going to allow that word to look at all of the prior tokens. Tokens are words. So this is mass self attention. We're allowing each word to look at all the words that come before it. And then after that, we do a feed forward transformation, which is just individually transforming each token, right? So by combining these two things together, we're allowed to craft a rich representation, because we're both looking at other words in the sequence, as well as looking like each individual word. So we can kind of understand the words in a sequence and their relationship. Allowing the transformer to produce a useful output.
Cameron Wolf [00:05:12]: So that's the transformer. Next is pretraining. And luckily the pretraining process is pretty easy to understand. So all of LLM pretraining is based upon next token prediction, which just trains a language model by looking at a bunch of raw text and training the model to predict the next word given previous words as input. So we can see an example here. With the decoder only transformer, we give it some input, which is just llms or cool, and then we train it to predict what comes after that sequence. In this case, what comes after it is just a period because this is just like a standalone sequence. But that's literally all that next token prediction is.
Cameron Wolf [00:05:50]: So we're taking some text as input and trying to predict what comes next. And the cool thing is that this is a self supervised objective, meaning that we can just download tons of raw text from the Internet, and just the label that we're trying to predict is already present in the text. It's just what word comes next, right? So we can download a ton of raw text, a massive corpus from the Internet, and then just perform this pretraining objective with really big models and a massive corpus of text from the Internet. And that is a huge key to getting these models to perform well. It's kind of what made the difference between GPT and GPT-2 versus GPT-3 we use big models and big data sets and then train them with next token prediction. And interestingly, just a little side note, this is actually how we generate text with an LLM as well, right? So we still use next token prediction. What we do is we just auto regressively predict the next token, add that token to the input sequence, predict the next token, and so on. So as we see in this figure here, we're just continually like adding another word by performing next token prediction.
Cameron Wolf [00:06:56]: So the last step in understanding something like Chat GPT is the alignment process. And basically what this is is during alignment, we train the LLM to produce text that aligns with the desires of a human user. And there are a lot of ways that we can talk about what does it mean to align an output with a human user. But in the AI community, typically what this looks like is that we define a set of alignment criteria, like helpful, harmless, truthful, and so forth. And then we just let humans decide how they create outputs that align to those criteria, right? So they could write them from scratch, they could rank outputs and say, this one is better than this other output because it's more helpful and so forth. But we're always acting on these alignment criteria and trying to train the model so that it actually captures these alignment criteria within its output. To make this happen, though, there are two types of fine tuning that we perform. One of them is supervised fine tuning, which is pretty simple.
Cameron Wolf [00:07:57]: We just collect a data set of good examples that we think align well with what we want and then fine tune the model on these examples. And then the other one is RLHF, which basically just allows the model to produce output and then has human users rank which one of these two outputs out of a pair are better. And then we just fine tune the model to mimic whatever happens within the output that we think is better. And we might ask, why do we need two different alignment techniques? The basic reasoning is that for SFT, it's kind of hard to annotate data. So if we want good examples of outputs that the model should reproduce, we have to ask humans to write those outputs, which can be quite difficult. Writing good examples of highly aligned text from scratch can take a lot of time, whereas with RLHF it's actually pretty easy. We just have the model generate two outputs, and the human can just simply choose which one of them is better. And the model can learn from that simple ranking process, which from a cognitive perspective, that's way easier for a human to annotate than writing text from scratch.
Cameron Wolf [00:09:03]: Right? So typically what we see with a lot of modern llms is that they get a smaller data set for SFT, maybe like 20 to 30,000 examples. We fine tune the model over that, which gives a good starting point for RLHF, and then we collect a bunch of comparison data and then further align the model using RLHF, because the annotation process is a little less expensive and time consuming. And with that, that actually completes the three part framework, right? So the LLM revolution started when we began to use the transformer architecture. So before we had things like ULM fit RNN based language models, once we started to use the transformer architecture, we got things like GPT and GPT-2 by using bigger models and pretraining, using next token prediction over more data, we get all the way to GPT-3 and then the big difference with Chat GPT is we got really good at this alignment process, right? So SFT, RLHF, we're teaching the model to produce outputs that we like, and we define our definition of what we like by just laying out some alignment criteria that kind of describe what a good output would be. So thank you for listening to this talk. Pretty cool. Finished it in 9 minutes which is a personal record but feel free to scan this QR code if you want more details. There's a written version of this overview that you can take a look at.
Demetrios [00:10:27]: Excellent dude. Awesome. Wow. Powered through that one. Congratulations.
Cameron Wolf [00:10:34]: Not bad. Personal record.
Demetrios [00:10:36]: Personal record. He pr'd it on this special day. That is so cool. So thank you so much Cameron. I mentioned it before, I'll mention it again. Subscribe to this guy's newsletter, you will get smarter for it and I think a few people already said in the chat that they will. If you want to ask Cameron any questions throw it in the chat. I think you're on the platform too so he can answer some questions that are coming through there and I will be seeing you in my inbox.
Demetrios [00:11:07]: Man, thanks so much.
Cameron Wolf [00:11:08]: I'm good thanks. Demetrios.