The Next Million AI Apps: Customization From the Ground Up Using Fine-Tuning and Self-Refinement.
Mark is a co-founder and Chief Architect at Gradient, a platform that helps companies build custom AI applications by making it extremely easy to fine-tune foundational models and deploy them into production. Previously, he was a tech lead in machine learning teams at Splunk and Box, developing and deploying production systems for streaming analytics, personalization, and forecasting. Prior to his career in software development, he was an algorithmic trader at quantitative hedge funds where he also harnessed large-scale data to generate trading signals for billion-dollar asset portfolios.
Access to foundational models is at every developer’s fingertips through commercial solutions or in the open source. However, these models are not competent enough to perform specialized tasks. Differentiation becomes more challenging in this world. We’ll walk you through how you can develop custom models using fine-tuning and data-driven techniques such as self-refinement to create differentiated AI products that solve problems that were previously unattainable.
Have you on board and of course have Mac on board. Uh, Mac is, uh, the co-founder of Primo and, uh, he's gonna be sharing with us the, uh, how to, um, fine tune. It's gonna be talking about fine tuning, uh, of lms, and you know, basically all you, it's a no around about fine tuning. Um, especially with Primo of course.
So, um, before we go on, there's a GitHub repository, um, that you can, that's in the chats. So this is just really help you with the walkthrough so you can open that somewhere, some other tab. And then, you know, during the workshop as well, that's mostly going to be the repository for reference, uh, as we. Walk through with, uh, with Mark.
Mark is gonna be sharing a bit more about what Primo does, um, during his talk. And then, uh, we're excited to have your board Mark again. So if you a few housekeeping rules, if you have any questions, you know, if you have any concerns during the workshop, please leave your questions in the chat. Happy to, uh, we're happy to respond.
And then Mark would maybe occasionally, um, check in or towards the end of the, of this workshop. Um, check in to see how, uh, maybe you can get your questions answered. Uh, if you have any, uh, concerns at all, just put them in the chat as well. Would be willing to sort of help, uh, support you as you go through the workshop as well.
Um, thank you and have a great time. Yeah. Great. Thanks Steve. For the, Steven for the, um, introduction. Um, today's workshop primarily be involved. We will figure out how to get the data to actually fine tune, um, on two important parts of it. It's the quantity and the quality of the data, and then actually running a fine tuning job.
Um, to fine tune a fairly large model. So, uh, I wanna actually talk about a couple things With respect to the workshop in itself. Everybody should be able to run through the prompt engineering portion where we dig into a dataset and want to, um, manipulate it and synthesize more data. However, um, actually fine tuning the model, um, may not be, uh, accessible to everyone just because, uh, as the assumption that I have in terms of your runtime, um, you're gonna have to need some pretty beefy GPUs to do that.
So, on that section of things, you won't exactly be able to run all of the, um, the scripts that I'm gonna be showing, mostly because you'll need, um, at least, uh, 40 gigabytes of G P V Ram in order to do that. Um, that being said, uh, if everybody would just clone the GitHub repo, so then you could have it locally.
And then, um, the first thing that you should do actually run, uh, poetry install inside of the repository to get all of the, uh, library dependencies that you need. Um, for the workshop. Uh, before that, I, I'm actually going to be going back and forth between slides as well as, um, as well as my, uh, code. So, um, I'm gonna interleaf between the two sections, present some of the motivation as well as setting up like the theoretical foundations that you need and then actually going and digging into the code to, um, Deliver like actual, um, uh, value out of the, the fine tuning exercises.
So, um, that being said, maybe I'll start off with, um, just introducing myself. Um, I'm Mark. I'm the co-founder of Primo I, the, uh, AI practice. Um, I think that, uh, you know, we're in a very interesting time, uh, particularly with all of the language models that are being, um, open sourced, as well as all of the new commercial, uh, language models that are being, um, are coming out from the larger enterprises that we see today.
Open ai, coherent, et cetera. So, uh, part of the talk, what we're really gonna do is think about how do we actually leverage language models, in particular open source language models and customize them for our own purposes. Um, the, the, uh, motivation would be how do we launch the next, uh, million AI apps? So Primo, the company that I founded is, um, Trying to democratize access to language models and allow, allow developers to, um, embed the language models into their production applications.
Um, we're gonna be launching our product, um, in the summer, so then, um, everybody should have access to it and, and hopefully, um, what we want to present to everybody is the easiest solution, um, to, uh, fine tune and serve your models. So, um, to start some motivation on the different types of strategies and approaches that you actually want to take in terms of which models to choose.
So, um, typically all the modeling approaches we have, uh, are you have closed source models. Everybody sort of knows GPT four, uh, and Google has their competitor Palm two. Um, you can also take. Open source models that are fine tuned, such as, uh, star quarter and rep lit. These were two coding examples. And then we have like smaller specialized models that are the traditional models that showed up, you know, prior to 2018 that everybody's pretty familiar with.
Um, we're not gonna go into those mostly because what we really want is to be able to do, uh, generative ai, right? So, um, typically when customers come to us, they question the fact of wanting to fine tune and customize your own model. So, um, open AI is best in class and certainly you can go to it and think, Hey, why don't I just have my, uh, open, uh, G P T for X model, G B T for legal, G B T, for, uh, finance, all those type of different, um, domains.
Um, that's actually not possible. So it's really only limited to their base models. So most of those are the, uh, uh, text Da Vinci, um, uh, Flavors and you are able to hit their fine tuning APIs. But these are actually not the best of breed models necessarily. These are, um, just models that, uh, they presented are openly available, but, um, they are certainly, uh, oh, sorry, I didn't know I wasn't screen sharing.
Sorry. Um, yeah, so kind of going back on the slides, uh, apologize for that. Um, We have a few modeling approaches here. I show those. Um, so you know, in the backdrop we have open AI or we have, um, a lot of the open source fine tune models. We're gonna be focusing on UL two today. It's, um, a pretty powerful model, 20 billion parameters, openly available open license, and also has been shown to be quite, uh, useful for NLP tasks.
So, you know, I kind of presented the information of why you can't really use open ai. You can only use their older models and fine tune those. But G GPT four is just a black box model that you can only run in, in at best. Uh, Do prompt engineering around in order to get the behavior that you want, but it's not customizable in the way, way that you actually want for your enterprise purposes.
So, um, with respect to why you want custom language models versus closed source models, um, the important aspect about those are model ownership with respect to having your intellectual property encompassed into a modeling pipeline for that, um, as well as meeting certain customer SLAs. So, uh, you are, you have to meet an SLA to your customer, but your customer doesn't really care if you're relying on, uh, open AI and they're down for two hours or so.
You want domain expertise too, as well in terms of your custom models. So everybody wants to be a, you know, a, a, a master of their domain and be an expert there. So you're seeing a lot of language models pop up that are specifically designed for a, a, a set of tasks, downstream tasks. And finally you have security and privacy.
So we work with a lot of clients personally that want to have models governed within their virtual private cloud or on-prem, and they have fairly restrictive policies around where data can be moved in and out of. So, um, that basically precludes them from ever being able to use open AI or the closed source foundational models.
So, um, just to preface what we're actually gonna be focusing on in terms of supervised, fine tuning, um, For the rest of the workshop, we're gonna be doing prompt based, fine tuning. There's other types like multitask, few shot and domain specific, but, uh, the, the recent literature and the recent set of, um, the recent set of, uh, technology that's been released has shown that instruction fine tune models, uh, tend to be the best because they can be zero shot learners for tasks and they understand, um, what you want, uh, to do.
So in motivating this, I'd like to th uh, kind of delve into what we need to think about. Do you think that we can actually just, um, come up with a model and, um, come up with a model and, uh, immediately fine tune it on anything? I think that what you actually need to do is you need to start off and select your task properly in order to figure out, um, what proxies and what types of data sets that you need there.
So our task framework that we actually use at Primo, um, is to, uh, kind of think about things from a knowledge based task or a reasoning task. Um, in terms of the knowledge based tasks, you can think of, uh, wanting to have a custom model that's really proficient at doing the set of tasks that you would think are pretty basic and necessary, such as name, entity recognition, certain basic q and a, or have it be well, really well positioned for doing complex tasks such as those that you would expect from, you know, a knowledge worker.
And this is, these were actually one of the more interesting use cases that are, um, appearing today. Now on the resting side, we usually break that down between coding and math. So then on the coding side, you either wanna generate new code or you want to explain a set of code. So I would encourage everybody to look at Helm, uh, which is an open source initiative, which, uh, where the, um, Stanford University is presenting in trying to create a living, breathing set of.
Evals, um, for figuring out how to compare across multiple different language models. Um, it's becoming the standard in terms of benchmarking these models and figuring out, you know, how they actually do. Um, we can actually just go to the website right now and take a look. So, in terms of the results, I would say for question answer, you can look on top, and it's no surprise actually that, uh, DaVinci is the best in terms of the few shot or zero shot, um, results.
But, um, yeah, you can kind of use that to proxy for the things that you actually want to accomplish. Um, using your language now, um, what are the actual challenges that come about? So this is the important part that we're gonna jump into the notebook, uh, in a second. Um, so consider the scenario where you actually have your base model.
You have, you typically have two different directions that you'll, uh, go towards, and usually you're choosing between quantity or quality. In terms of your data. So on the quantity side, you can reach for synthetic data or on the quality side, you can actually have humans annotate and label the data.
Oftentimes on the quantity side, the main challenges are having the wrong format. That's not instruction to, uh, instruction data or having just poor quality data that's misaligned with your task. And then on the quality side, you'll usually see that it's expensive, so you have too few examples or you've just spent way too much money for a lot of examples.
Um, so how do we actually get around that? Well, what you actually need is a type of data synthesis pipeline in order to generate, uh, high quantity data as well as high quality data. So in motivating the synthesis, I want to present, um, a prototype of, uh, a data synthesis pipeline that, uh, my research team has worked on.
Um, so on the diagram that I'm showing, uh, we're gonna walk through in the notebook. What we actually want to do is start out with a prompt generator, which will generate the prompts. And these are just, this is just a string that's, um, sent into, uh, a large language model. And then, um, the inference module itself is a language model.
So what the language model is doing is it's just producing completions, uh, or responses, uh, is the other name for it. Um, with respect to that, This top section from prompt generator to inference module and completions is what we typically would call prompt engineering. So in doing that, it's a, you know, a iterative process with human being tinkering with the language or the format or the template of the prompt to actually make this scale or make this an actually automated process.
What you can actually do is you send it into a response post processor. In this case, an example could be, um, a processor such as self consistency. So you're getting a number of completions out of there, and what you do is the post processor will filter out the responses based upon some majority vote.
And the majority vote is just like, which is the most common answer? So, It doesn't actually involve verification yet, but what we want to do is actually get towards the reasoning path. So given the filtered responses, we can actually, uh, move that into an evaluator. And the evaluator is actually a reward model that figures out, um, whether or not to run the loop again into the language model to improve the data or.
Um, determine if the, the set of generated data is already, um, useful for, uh, adding into this synthetic dataset. So it's kind of, um, a, a recursive loop that happens here. So, um, you actually wanna do some further reading on data synthesis. We have a really good blog post from one of our scientists, um, that, you know, covers a wide range of the, uh, wide range of the research and also how to delve into it.
Um, that said, uh, let's move over into the notebook itself. So, um, if everybody can go into the prompt engineering notebook, um, you can start running some of, I'll run through some of the code with you and also explain what's actually going on in some of the steps. Um, with that being said, maybe, uh, if anybody has had any problems with.
Installing the dependencies, um, you could probably send that into the chat and then we can figure it out. If not, um, uh, uh, I'll determine I guess from whatever hand, hands up or thumbs up, uh, if everybody's already gotten their environment, um, uh, installed correctly. Quick,
I, some people are having problems with, um, uh, stolen dependencies. Is that, yep. Does everybody, um, I think the first thing that you have to make sure of in your, uh, poetry environment is that you change, um, the Python version to the version that you have running. So one thing you could do is just do Python version.
Um, I've tested this on Python three nine and three 10. I'm not exactly sure if three eight will work in this environment.
Cool. And, uh, if that's the case, um, I would say, Maybe in the public repo, you can create some issues in there and then, um, we can kind of improve the, the, uh, the set of dependencies or, um, actually figure out what's going on. Uh, just for the purpose of the time. Um, I think we can kind of move on and figure out, um, uh, what to do there.
Um, and I appreciate everybody who's sort of, uh, interacting in the chat and helping other folks in the workshop, uh, figure out how to install dependencies. Um, typically what I like to say is the only thing that is getting in the blocker for artificial generative intelligence is, uh, Python dependency resolution.
So, um, that being said, let's go into the prompt engineering notebook. Um, so you already, I actually already uploaded the set of data that you can, um, work with here. Um, you may have to change your, uh, Constance, uh, file mostly because, um, maybe your home directory or path or wherever you get cloned is not specifically, uh, set up this way.
But, um, if you have that all running, you should be able to kind of download the, the data set. Cool. It kind of seems like, you know, most people are getting the dependencies installed correctly due to my bad in terms of, uh, the, the Python versioning. Um, so remember we had kind of had the motivation over here that what we wanna do is we wanna have, um, prompt generator to generate prompts and feed into a language model.
So we're kind of on this step right now. So just run, um, you know, some of the dependency imports right here. What I actually have are a set of math questions and answers. And the goal, the entire goal of this is for the language model to, uh, produce answers to questions that we don't already have answers for.
So I'll repeat right now. This file is question answer pairs, and we want to be able, we want the language model to generate answers to another file in the data sets. That is math questions only. Um, in terms of what to expect, I'll help motivate this by looking at one of these problems. And I already am choosing the easiest problem because reading out these other problems starts to make me confused like I was during the s a t.
Um, what it actually says is, uh, James decides to run three sprints, uh, three times a week. He runs 60 meters each sprint. How many meters does he run a week? So the, the answer to that is three sprints, three times a week, which equals nine, and then nine times 60. For the 60 meters. So then you have 540 meters, you'll notice that the answer actually provides a reasoning path.
So, back onto what we wanted, right? We wanted the answers to be provided step-by-step. And a lot of the literature, um, tends to talk about chain of thought. Um, you can look at that up in term, uh, in terms of all the papers that are written about that. But that is basically what we want. We want to get the rationales and have the language, uh, model, learn the rationales to produce the correct answer on an unseen question.
So the next part that we need to do is we need to, uh, if you run the cell, what we're gonna do is that we're gonna sample, uh, and create our sampling pool of examples that we're gonna feed into a prop. So it's all basically from a software level, it's just string concatenation. So, uh, this produces a list of exemplars.
So we chose eight exemplars, and that is because that tends to be, um, the standard in the literature. If you choose too many, what actually happens is you're kind of, sorry to interrupt. Can you hear me? Yes. Just can you zoom in a bit so it's easier to see yours screen? Sure, sure. Right. Yeah,
perfect. Sorry about that. Yeah, so this should be a list of eight now. Um, interesting. Trying to see if somebody has
questions here. Um, looks like a poetry install. Um, so for those who actually have the, the environment installed, sorry if, if some folks don't have it, but, um, You'll get a list of eight question answer pairs all here. And on this next cell, if everybody run it, we're gonna take the math questions that we want the answers for, and then pair them with a, uh, meta prompt.
So basically we're telling the language model, how do, what do we want it to do? So, solve the math problem step by step, and, uh, provide an answer to that. So what do I expect coming out of it? Is that I will have the actual set of instructions at the top for the language model, and then a set of eight question answers or problem answers.
With an answer blank at the very end. So the question that we actually want answered is this one, Dax went to their farm to pick some apples and found half as many bugs as ants in the garden. There are 50 ants calculated the total insects in the garden. So we expect the answer to be produced from here.
Um, here I would actually ask for everybody to change the base model if we're running it locally. So I would say you should run Google, uh, P five small. So that's probably one of the smaller models that you can download and actually run locally. Um, but in my case, I actually wanna show you why you need a really powerful model.
So, It'd be interesting to see, maybe some folks can post the responses for what's actually generated soon. Um, but here I'm just loading Flaw Mule two into memory. And, uh, it's, uh, to be frank, it's gonna take a little bit of time, but, um, it's a 20 billion parameter model. So at Half Precision, which is, um, uh, two bytes per uh, parameter, it's gonna occupy about 40 gigabytes of, um, memory.
So if you all are running on CPUs, it won't be as much of a a, a problem. The only main problem is if you're running on GPUs, you're gonna get kuda, um, errors if you don't have enough memory, um, to support this model. So, uh, as it's just loading the shards, I'm gonna, um, look into the next set of, uh, cells for the code to generate the completions.
So what we're doing right now is, um, I'm gonna be, uh, Generating. Uh, so I'm gonna be providing the configurations for the model to, um, break down the, uh, break down the prompt string that I have, and then generate the set of code, uh, generate the set of, uh, responses that I want there. So as you see, we have the shards are loaded, and then I'm going to just run the, uh, inference module part of our pipeline.
So the inference module is the language model itself, so give it a little bit of time. Uh, l l m inference is a known problem today, so, um, probably it's gonna take a couple seconds. Cool. So, uh, if we look at the, the answer to the response, if there were half as many bugs, the answer in the garden, there are 50 to about 2 25 bugs, total number of insects.
Equals 25 plus 50 answer 75. Let's go back on the question. I just wanna double check that it's sensible. So, you know, you're trying to get the number of ants in the, or the number of insects in the garden. That's actually fairly sensible, right? Um, what it did is it provides not only the set of the correct answer here, the answer is, uh, 75, but it also, um, provides like step by step, uh, the intuition or the, the rationale for figuring out the answer.
Um, just to bring us back onto where in the pipeline we are. So we've just created a prompt. We sent it into the inference module, and then we've generated responses. So in our case, we're actually just generating one response. I could run this. Uh, so in actuality, when we internally at Primo, when we are generating new data, we have to run this language model 32 times per set of, uh, problem question to generate, you know, a set of 40,000 data points.
That's gonna take a little bit time and, uh, we don't really have hours to spend on this particular, uh, exercise. But, um, this kind of gives you a sense of like what you need to run in this pipeline to get like a singular, uh, answer with rationales out of it. So, um, we have to create a contrived example here, mostly to, uh, demonstrate the self consistency element.
So assume that. Suppose we had, uh, four answers being generated from the above pipeline. So in this case here are like a set of four answers, um, with the, uh, reasoning paths here. Um, the whole point of this is actually just to get a majority vote. So which is the most common answer? So that involves pre-processing the string and then, um, choosing or filter, sorry, filtering out the non-majority answers.
So you get the majority answer, which is typically the right answer if you have a powerful enough language model. But it would be interesting to see for you folks when you're running it, um, when you do flaw and T five small, if it's just outputting, you know, nonsensical completions because it's just not powerful enough.
That would be my presumption. So, um, here are kind of the list of completions that we're working with. And you'll notice that 24 clearly is the correct or the majority answer. So all we're trying to do here is just get the list of completions and then we're going to, um, fill the, create a dictionary for separating out the completions that have a answer of 34.
And the completion is with an answer of 24. So it should be a three to one ratio in the dictionary that's being produced here. So that is doing the, uh, part where we're finally gonna move into self consistency. So we have the set of, uh, answers and we just want to filter it out. So looking back on the, uh, post-process, we're getting the filtered responses.
So in this case, we went from three, so we get three responses where we began with four responses here. So that's kind of the whole point of it. How do we filter out the, um, nonsensical, uh, completions that we either don't want the reasoning path for or, um, aren't really useful for us? Now the final part of the pipeline is the evaluator.
And this is kind of the most tricky part. So, um, the important intuition behind that part is that, uh, we want to, uh, get a diversity of responses to increase the complexity of the dataset. So Rouge l um, is a statistical measure for the amount of overlap between, um, strengths. So in this case, what you'll see when I run this cell is that we're taking, um, a candidate and reference set of sentences.
So, um, they both semantically mean the same thing. But we're comparing them to each other to figure out which one has the least amount of, um, basically the least amount of content for the most amount of semantic meaning. So the, the point of that is the language models want to have more diversity in their set of reasoning paths in order to be able to incorrect reasoning paths in order to be able to learn better.
So, um, this is the evaluation step in which we are trying to, um, uh, choose the set of, uh, completions that we actually want to keep in our synthetic data set. So, uh, on this part, what we're actually doing, if you run this cell, is we're getting the average Pairwise ROUCHE scores. So here is the, each part of the list here, 0 1 1, are actually the Rouge L scores in comparison to this sentence with every other sentence in taking the average of that.
And then here's just the average of the average. So pretty meta there. Um, finally running this cell, what we get is what we expect, which is, uh, a very important distinction here. So this sentence says, hi. It's nice to meet everyone, right? And what are we comparing there? Why did we keep that particular sentence?
There's, hi. It's nice to meet everyone versus, hi everyone. It's nice to meet. It's a little hard to specifically, uh, kind of catch the difference, but. The candidate, uh, sentence object here is actually a little bit more fluff versus the reference one. So it's, you know, shorter. So the reference one actually contains more meaning, which is it's nice to meet, right?
Whereas the one above says it's nice to meet you as well. So, um, you're taking the, the substrings here and then figuring out which ones, which parts are overlapping between the two, and then, um, creating a metric that shows diversity between those. So in this case right here, we're filtering out the rouge scores that, um, Or too high because a really high roo, gorgeous means everything is clustered together and we're not getting good diversity in our data sets.
So, um, that's kind of what we're doing here. Uh, and with the Rouge evaluator, we now can see, you know, the ROO scores for all of this data sets. Um, I'll kind of move on because, you know, for the purpose of time, I want to get into the fine tuning element of things. Um, this notebook everybody can kind of use to, uh, play around with, um, send your own completions to particularly, it's well-suited for math right now.
Um, but I've generated, you know, before you everybody got here, I've al also already previously generated a bunch of, uh, GSM eight K, um, uh, uh, sorry, GSM eight K uh, synthetic data. So I take the original GSM eight K data, and then I synthesized more data on top of it and improved it for the Flon two model.
Um, this kind of brings us into the next step of the, uh, workshop. So I think everybody probably across the entire community is very interested in, um, supervised fine tuning. Um, we already have now demonstrated that we can take, uh, data and actually get more quantities of data and higher qualities of data as well.
But, um, I wanna motivate our discussion by presenting an example of why it's important to count, uh, you know, the amount of memory that you have at your disposal. So, um, when you do naive data parallelism, let's do some back of the envelope calculations. If you have a 15 billion parameter to model, you're gonna have to hold, uh, two bytes.
Per parameter in the model and half position. And then, um, you actually have to keep a master copy of the, uh, other parameters in full position, which is four bytes per parameter. And then if you're using Adam, which is the workhorse optimizer for basically all of training for, um, deep learning gradient descent, you're gonna have to keep, uh, two variables momentum in the, uh, running average in memory as well in full precision.
And then you have four bytes, which are, uh, must also be, uh, four bytes per 15 billion parameters for all the gradients. So this doesn't included activations, but you're already running at a total of 270 gigabytes of B Ram. You can look up the cost of a 100 s right now online. They're pretty expensive. Um, so we need a way of really reducing, um, reducing the amount of memory footprint in these models.
So, What does that actually look like in terms of, uh, all of the breakdown of memory? Um, in terms of the activations, you're gonna expect 25% in the activations, 25% in the model parameters, and then, um, 50% in the optimizer. Um, I can illustrate that in a second, but, uh, why does that matter? So you have a huge language model where all of these things are sitting into memory and what do we, what's the whole point of attention?
Attention is trying to provide more context, right? Like the atten, if you remember what attention is defined as it's like the cross product. Of all the context you pass in. So the context is just the huge string or the prompt that you're giving a language model, and then you're taking like a cross product of every token in that prompt, uh, to every other token.
So, uh, that is why sequence length is actually a memory pillar. It's going so even though, um, you can increase your batch size, the moment you increase your sequence length, the amount of memory footprint is Inc. Uh, goes up quadratic. So I actually took the math GSM eight k, um, dataset and then, uh, plotted a distribution of the token lengths here.
Luckily we're okay in our instance because most of the tokens, as you can see, are less than about, uh, 256, uh, token length. That's gonna come back to us in a second where, let's take a look at the data set itself. Um, so let's open up the notebook actually for the GSM eight K pre-processing. Um, if everybody's able to go to this notebook, um, uh, I think that it would be, uh, useful to kind of dig into the data itself.
So, um, if we can run this cell, uh, you'll see. Um, in this instance you'll have to change, uh, the model path. Um, if everybody else was using Flon, uh, you all, or sorry, if everybody wants to use a different model such as the, um, Flon five small, which maybe you downloaded, uh, previously. These are the, this is basically the line that you'll have to change.
Um, I already have the data downloaded, but if you don't have the dataset downloaded, you can change the, the dataset in here, in the data files into a different, uh, data path. So in this case, it's the math gsm a k, uh, dataset. If you look up on hugging face dataset, um, you basically just have to point to that path to download the dataset in memory.
Um, but let's just take a look at the data itself and then consider why I want to, uh, cut off the token length in order to facilitate, um, better processing and memory. So, um, you can run it here. I'm gonna download the data set. It's already cached in my instance, so there's not much I need to do. Um, take a close look at the pre-process function.
Um, in here the most important parts to note are the max lengths. So I'm basically, uh, you can't have, uh, in, you know, generally in matrix multiplication, you have to have, uh, a consistent length in your tensor, right? Like each row can't be a different length and a tensor, otherwise it's kind of a weird shaped, um, matrix.
So, um, in, in our case, I, we, I want to cut down to 256 tokens at most. And then, um, for the ones that are too short, that's why I have padding, um, I'm gonna pad them and basically you're gonna see a bunch of, uh, I think negative ones for the padding token or negative 100. Um, and then, uh, for the ones that are too long, I'm just gonna cut them off.
Uh, you can use a different pre-processor where, uh, you actually extend the tokens to be one long set of tokens where you don't cut it off. But, um, that's not really within the scope of this particular discussion. So I'm just gonna run this and using hugging face data sets, I can just tokenize the data set really quickly.
So, um, let's take a look at the question here. So, again, this is the mat GSM a k, uh, data set that we synthesized. And we have a bunch of questions here. There are numerous other questions. If you wanna look at the first five, then, um, that's the raw data set. Let's look at the tokenized data sets and see kind of what's going on here.
So in the tokenized data sets, as you can see, this particular, uh, the first question, answer that we have here, or sorry, this is the first, uh, question. Um, Is shorter than the 256 token length. How do I know that? I know that because as you can see at the end, um, in our case, the padding token is just one. So it's just adding all the padding into the, um, the tensor so that, uh, when I pass it into, um, the, uh, large language model, it.
Kind of considers this as, you know, uh, uh, you know, a null token. Um, if you've kind of heard of, uh, you know, end of string tokens or patting tokens, that's the intuition behind all of it. You're just mapping, uh, empty space or you're mapping, uh, out of vocabulary, uh, tokens into a particular numerical embedding or a number.
So, um, I'm gonna go back to the slides for a second to talk about what's actually going on and what, how we're gonna get across the problem of the cuda, uh, the cuda, um, error. So I'm actually gonna, uh, go over to, um, another part of the, uh, the workshop where I have a script here, so, You guys don't need to exactly know, uh, the exact details of every part of the script, but if you look at it, what I'm actually doing in the script is I'm just running the pre-processing and then I'm going to run, uh, exactly what I showed on this slide.
So naive data parallelism here, alright? Yeah. Naive data parallelism with, uh, without Laura. So, um, I have to run this as a script because I need a launcher in order to do that, but, um, if you take a look, I want to show you the number of GPUs I have running on this virtual machine. So I am running on four, uh, 40 gigabyte a 100 s cards.
So this is an extremely beefy, uh, runtime environment that most people just don't have access to. Um, and it is. Going to even not be enough for me to run fine tuning on our synthesized data set. So if you look into the Read Me that I provided, you'll see the script that you need to run in order to accomplish this.
So, um, wherever you have the model path. Um, so this, the end of here. This is gonna be the local model path, uh, of the model that you wanna run. But, um, in our case, it's gonna be flaw you all too. And I am gonna be using, um, some naive parallelism in order to run this. And, uh, as you can see, it's gonna kick off with job, oh, sorry, I actually kicked off the, uh, correct job that will not have an out of memory error, but in this case, um, Spelled that incorrectly.
All right, so in this case, I'm actually gonna be running, um, a job. Uh, there is a really useful command that you can run to profile your data and it's NV top if you, uh, have that, uh, tool downloaded. But right here, you can already see that it's loading the model shards into memory and, um, what deep speed.
So the library that I use is a hugging face accelerator, uh, accelerate to launch this job, um, using naive data parallelism, and it pre allocates the set of memory that I need in order to accomplish the, the job. You'll see that it's almost at a hundred percent already and it's just loading in the, the model, and it's going to, um, basically have an out of memory error fairly soon.
Yeah, so someone put in a question. Is it using all four GPUs without having to code for parallelism? It is actually using all four GPUs. Um, accelerates very easy. Uh, I'm not gonna go into the exact code, but, um, what everybody should note is, uh, the training loop over here, um, already takes care of all of the parallelism that you need.
So the main thing that you need to change locally, if you want to run, uh, this job, is to change the number of machines. So if you just do one process, it's just gonna run in a single process, assuming that you can, um, hold the entire model and memory on your local environment or whatever runtime environment you're using.
Um, and then we're using beat speed in this case. I'm just gonna put it back to four. Um, in order to accomplish this, um, lemme just check that all the memory has already been, um, freed up from this entire process. So I'm just restarting the kernel so that what I want is this to all go down to zero. Cool.
We're getting zeros across the board in terms of the memory, except for, um, this amount of memory that's being occupied on each card is actually literally because of Cuda itself takes, uh, a little bit of memory. Um, so as you can see, if we look at the stack trace of what just happened, right? You should see a cuda um, error.
And this is what the deep learning community is always extremely upset about and why we have, uh, You know, effectively a GPU shortage. Um, there wasn't enough memory. It tried to allocate three gigabytes of memory when only, uh, one gigabyte of memory was free. So this is the entire, you know, change everything here from 15 B billion parameters to 20 billion parameters.
And that is what's going on. You're having out of memory errors because you're using naive data parallelism with mixed precision and you're not using parameter efficient, fine tuning. So, uh, what is our solution? I'm sure everybody has sort of heard about parameter efficient, fine tuning, um, in the community these days, but, uh, specific technique that we want to use is lo, which is, um, a low rank adaptation.
Uh, method. So basically what it's doing is it's taking the set of data and assuming that you can dimensionally reduce the weight matrix and learn the weight matrix, um, uh, on like an incremental basis. So it freezes all the previous weights in flow UL two. I have a previous, um, presentation that actually goes into this in a little bit more detail.
We're actually gonna go into it in a practitioner standpoint and, um, run the script that will achieve lower fine tuning. But for the purposes of this demonstration, I don't need to motivate, you know, kind of the other parts of it. Uh, the main thing that you just need to know is that the loss function that we're actually trying to, um, optimize, you can separate out the frozen set of parameters with the, um, you know, the low rank parameters.
So you have the pre-trained weights here. What we're trying to do is we're trying to freeze the free train weights, and we're trying to learn the new set of weights on the training data that we're gonna send over to flan UL two. So few caveats I just want to get across the board. Mostly this is only applied to attention layers.
You may have heard of a new technique called Q Laura that actually does four bit quantization and runs lo on every single layer. Not gonna get into that today. It's outside of co scope of this, um, demonstration, but for Laura at least, it's been shown that you can actually use, um, this technique to fine tune and perform even better than full fine tuning.
So full fine tuning was the instance that we just showed on the command line where, um, you had out of memory errors. And that's when you don't do any sort of freezing of weights. You just allow all the weights to move around while you're optimizing and you have to hold everything in memory there. So what are we gonna expect?
So I'm gonna show you, uh, the memory profile of what's gonna happen when we actually do the LoRa fine, primary efficient, fine tuning, but, um, you know, let's just have back of the envelope on our mind amount of memory that's gonna be occupied using, um, Laura, uh, as opposed to the full fine tuning. So it's gonna be about a 68% reduction in the memory footprint.
So this is when we load the model, um, as you noticed, like even upon trying to load the weights of the model, you're getting towards 80% of the, uh, memory inside of the model, right? So if we go into here, what we'll see is. We don't have anything in memory yet, and we're gonna run this script on the correct lo fine tuning.
So in the fine tune, uh, folder, you'll see that you can run the, uh, deep speed accelerate, um, script that does, uh, parameter efficient fine tuning. And what's kind of great about it is, um, we're just leveraging, um, uh, hugging faces PFT library here, and it's a one line change. So if you do a dift on the two files, you'll see that all you actually need to do is to add this part.
So you have pft. So in our case, R is a parameter that like the larger the r the more weights you're gonna have to learn, so the more memory footprint you're gonna use. And then, um, drop is just, uh, tends to be a parameter. I don't really change that much and it doesn't really. Do much of a difference for our fine tuning, but then you just need to call, uh, what, what get pft model actually does.
It's just unwrapping the object and then, uh, making sure that you use torch, do no grad on the set of parameters that you don't want to, uh, fine tune. So I'll just launch the script cuz it'll take a little bit of time to do. Um, it's kind of fun actually for me cuz uh, whenever I start getting angry and having kuda um, errors, um, I can kind of memory profile it.
So it's kind of useful here. So I'm running deep speed with the launcher configs here. Um, this is specific to, uh, hugging face accelerate and um, I want to note that I'm not using CPU U offloading. So what CPU offloading is, is like when you're about to run outta memory on the gpu, you just. Actually move the weights into your cpu, but you take actually a significant, uh, memory hit doing that.
Cool. It actually seems like some stuff is happening right now. Um, weights are getting loaded and it is occupying all that space. So let's see if it will run successfully.
So it's gonna take a little bit of time, it's already loaded all the shards into memory, pre-allocated all that memory, and then it's gonna run, um, a bunch of data on it.
And, uh, for those that actually have the luxury of getting many more GPUs than I was able to get here, all you have to do again is just change the number of processes here. So presume that you have like one virtual machine, in this case I'm running on G C P to do that. Um, so while it's doing this entire process, I'm gonna go back and just, uh, compare the, the chart of, um, the set of, you know, memory footprint that we'll actually see.
So when you use Laura, you're actually significantly reducing the amount of optimizer memory footprint. So we're, you know, we have different batch sizes, but in the, these are proxies. So in the activation, the activations again, um, take up a lot of the memory. So the activations are like the set of contacts and tokens that you pass through the model and then the forward pass.
So like the same thing as saying, I want the model to generate completions. Those are a lot of the memory that's been being taken up by the activations. And then you have 45% of the memory being taken up by the parameters themselves. So that's why you see like this huge set of, um, uh, memory being allocated here in terms of the parameters.
So, um, my GP utilization is actually not that high, uh, which is not the best. I should've optimized this a little bit more, but, um, some tooling and advice for everyone out there after it's actually running right now. So it's running 15 out of 748, um, iterations in the training cycle. So I ran deep speed zero three, um, which is a type of parallelism that shards the model parameters, the optimizer states, and also the gradients all in GPU memory.
Um, even in doing that, what we see is that, um, The yellow line here is the memory footprint. So like how much memory is being occupied, uh, during this exact point in time. And then the blue line is the, um, is the G P U computation. So basically the ideal state is if the yellow line is almost at a hundred percent and blue line is also tracking it almost at a hundred percent, this means that all the memory that's being occupied at every given moment is being used to in, in a calculation.
So that's kind of the whole point of, you know, all of the stuff that you hear about, um, parallelism in, in primary efficient, fine tuning. It's to fit everything on the set of cards that you have, as well as being able to run everything in memory, run all the computations in memory. Um, if you wanna read further on more memory usage kind of tricks, we have a blog post out there that, um, actually gives a really good, uh, uh, set of back of the envelope calculations and allows you to kind of, um, heuristically define how much, uh, memory you need to train certain models.
Um, well that's basically it for what, um, I wanted to chat about. Um, I'm gonna kill this process right now, but, um, I set a notebook for everybody to, uh, look at as well. Um, This is just the code to enable you to take what we just fine tuned. So suppose we finish this fine tuning job, um, and then load the set of parameters, uh, into memory.
So, um, in this code, if you run these cells, what's actually gonna happen is it's gonna load flon UL two first in this step, and then it's going to load the LoRa parameter weights that were, um, being learned in our fine tuning job in this step to be able to attach them to this model. So now, you know, Flon UL two had its original set of knowledge and now we are, um, teaching it to perform specifically well on the math reasoning data set that we synthesized, um, before.
So that's kind of the entire exercise that we're going through. Um, with that being said, um, kind of rush through the end of the workshop, but I'd like to open it up for any sort of questions that people have.
Hi, Mike. I think it's been a really, uh, insightful workshop so far. Uh, I don't think there's been any question that you haven't answered yet. A lot of things has been troubleshooting in the box.
Seems like it's mostly, uh, runtime errors and environment errors, unfortunately. Right? Yeah. Yeah. Okay. I, I think a few questions are coming in, and this question is from Petra. And, uh, Petra asks, are there limitations on using different gps to get, say if for example, you're using the media H 100 and, uh, uh, uh, a 100, you know, is that bad practice?
In this case we're Einstein? Um, using heterogeneous hardware is not bad practice. It's just difficult, like for, with respect to that, with, I mean, is, um, a lot of the frameworks like deep speed, they just use kind of, um, heuristics to be able to move the memory around for your model. But, It presumes that every single, uh, card or the G P U chip is the same.
So, um, you may have to roll out or monkey patch some of the code inside of the framework to facilitate that. Yeah. All right. For fair enough. And, um, yes. Um, so even there, there's a Slack channel. So I'm just gonna post the, the link to the Slack channel in the, in the chat, so you can join the Lops community Slack.
And, uh, I think there's a channel there for the L l m um, pro Duction conference too, so you can join in and ask all your questions. I think most of the speakers, uh, will be, will, will be there. And, uh, of course, uh, next question is, can we get more hardware? Yeah, I mean, there are third party providers out there, like Lambda Labs and um, um, weave that you can use.
But um, at this point it's so saturated honestly, that it's, uh, it's kind of hard to get hardware no matter who you are. Yeah. That, that makes sense. So, um, all the other comments are in fact, comments on the workshop. It's, I think it was super executed well, uh, it was executed pretty well. And, uh, I think that's it.
So if you have any other question, please join the M Community Slack. Um, you know, uh, if you have comments as well, I think Mark is also in the community, so you can tag him and potentially if he's available, of course, uh, he'd be willing to, uh, get to that question. Uh, uh, to your question and Mark, could you probably tell us a bit more about Primo, if that's something you want?
Yeah. Yeah, sure. Um, so specifically our company is, um, going to be offering, um, APIs for folks to fine tune and, uh, serve models. Um, so basically it's gonna be, um, you know, kind of a managed platform. Um, the main point of all that is to show, um, you probably don't have to go through all this trouble, uh, per model, per instance, per whatever, in order to, um, develop your applications on top of it.
So, um, I think that, uh, you know, my experiences with all this is that the last mile. Um, kind of effort to be able to productionize applications tends to be kind of hard. So most people just wanna leverage the models and the completions and know that your inference is gonna happen really quickly. So you notice that even when we are trying to do the synthetic generation, it took a little bit of time for flon new L two to generate one set of answers for our prompt.
Right Now imagine trying to do that 50,000 times, it's kind of hard. So, um, yeah, we're, we're gonna try to democratize this, uh, access and this, uh, to all these models for everyone and have developers be able to use that. Yeah, that's, uh, that's awesome. And um, also, what are the trends with, uh, I think self.
LLMs this days, I think most of the tools are coming out there. What's sort of your take on, uh, um, self-hosted LLM spas, um, you know, picking, uh, using a platform that really streamlines the deployment process? Uh, I kind of think, I mean, obviously there's the new stack, um, emerging out of the entire, uh, landscape, but, um, me personally, I don't, it's not my greatest joy to try to, uh, set up all the machines and specifically have the exact, uh, batch size or the sequence length, correct.
Um, There was one more optimization that we could have done on FLA new L two that I, I didn't show, but you can monkey patch the attention layers and try to use flash attention there and that will, um, help reduce the memory footprint to be able to train, uh, you know, more and larger batch sizes. Um, but uh, the open source stack I think is something the community really needs because people need to be able to create their own custom LLMs.
But then, um, it's just kind of being, uh, blocked by the fact that the infrastructure is too hard to spin up or you just can't even get GPUs. Right. Right. Absolutely. Absolutely. Um, thank you so much for sharing. Well, I think that, um, brings, uh, the workshop today to a close. Again, if you have any question, um, please ping, uh, Mac and the Lops community.
You know, just chime in mls.community. You should be able to see more details on, uh, uh, joining the community podcast and every other thing. Otherwise, um, wish you a great time for the rest of the, the conference. Right. Bye. Cool, thank you. Thanks Mac. Yeah, thanks for everyone for joining. Bye.