Sign in or Join the community to continue

Building RedPajama

Posted Jul 21, 2023 | Views 591

# LLM in Production

# RedPajama

# Together

Share

speaker

Vipul Ved Prakash

Co-founder & CEO @ Together

Vipul Ved Prakash is the co-founder and CEO of Together. Vipul was the co-founder of Topsy, a social media search company and Cloudmark, an anti-spam company. Post Topsy's acquisition by Apple, Vipul led efforts around search, federated learning, differential privacy, and AI/ML systems at Apple. Vipul was named to MIT Tech Review's Top Young Innovators in the World list.

+ Read More

SUMMARY

Creating a new LLM is a difficult and expensive process, and there are several aspects that we need to get right — (1) a broad training dataset (2) a strong base model, (3) a well-aligned instruction dataset, (4) a carefully designed moderation subsystem, and (5) cost-effective training infrastructure coupled with an efficient software stack. Together’s central thesis is that these processes can be open-sourced, and we can harness the power of community to build and improve models, in the same way great open-source software has been built for decades. In this talk, I will introduce RedPajama, an open-source effort driven by Together and Collaborators, and show how to build an LLM with the power of community.

+ Read More

TRANSCRIPT

Next one is about building red pajama. Um, and I will bring the pool to the stage. Hello. How's it going? Hi. It's going well. Did you do any dancing on your end? Oh no. I was quite inspired. Yeah, so I mean, talented. And I think you could see that he is a father of two with the unicorn and the bassinet in the background, so right.

Well, here are the slides. Really pumped to hear about building red pajama. Thanks so much. Yeah, uh, thanks for having me. Good afternoon everyone. Prakash, uh, co-founders here of together. Um, and, uh, uh, we are focused on making AI more accessible and efficient. Uh, and today I'm talk. We'll be talking about building red pajama and open source LLMs.

Um, so we've, you know, seen this amazing progress in machine learning and AI over the, uh, the last few years. We have, um, you know, these incredible models like stable diffusion and opening AI's GPT that can. Generate, um, you know, cohort images and, and text. Uh, and it really feels like a step function has happened.

Um, but there's really been continuous progress, uh, in the field, and we have principled ways of, um, uh, you know, measuring this progress with benchmarks. So if you look at the ImageNet, uh, uh, benchmark, we've. Achieved 30 points of accuracy in the last decade. Um, if you look at the squad benchmark, which is, uh, NLP Benchmark, we have achieved 25 points of accuracy in in the last seven years.

And this is pretty incredible. They are hitting high nineties and accuracy on these systems, and they're opening up, uh, uh, a lot of applications for ai. Um, and we have reasons to believe that this is a trend that is going to continue for a long time to come. Uh, and, and what's behind this, uh, you know, there are a lot of factors from research to experimentation, um, you know, to to to, uh, uh, capital that has gone into ai.

But, uh, really the two, uh, driving forces, uh, behind progress in AI are, can be boiled down to data and compute. And, uh, there's a couple of ways of thinking about, uh, the scale in this. Um, one of them is, Open AI scaling laws, which were published in the paper a couple of years ago, uh, um, which showed that as the data size, um, gets bigger, the model gets better.

There's almost a linear relationship here, and the counterpart of this on the right is. The compute graph, which shows, you know, as we are pumping more data through, uh, the training framework of these models, we need more compute. And we are now using pro just amounts of compute to train these models. But this is not, you know, this is not basic compute.

Uh, it really, every flop is producing, um, you know, some proportional gain in quality of the models. And this reflects in, um, The data sets that we've had and the sizes of those. Uh, if you look at, um, the largest dataset we had 10 years ago, uh, which was available to the community in the open, was the ImageNet dataset with 1.3 million images.

And today image models are trained on 3 billion images, uh, or more. This is, you know, three hours of magnitude increase. Uh, and the counterpart, uh, on the compute side, To train a model like GPD three, um, you know, you need 1000, uh, 800 GPUs running continuously for a month, and this is not even the most expensive model that you can, uh, train today.

So it's really the combination of, uh, scale in data and scale in compute that is giving us the, uh, amazing AI models we have today. Uh, and, and this is, this is the reason we think that this will continue to happen. Um, from a practitioner and researcher's perspective, this sort of manifests as a tension between, uh, three things.

And again, this has been the case for, for the last decade and likely will be the case for the next decade. Um, uh, and, and you know, one of the components is data, uh, as the. Data volume and complexity goes up. Uh, invariably the data quality drops and the cost of cleaning and acquiring this data goes up. Uh, and to sort of contend with, uh, higher volume and higher complexity data, uh, we need models that are.

Bigger and more complex. And, uh, both of these things then put pressure on the infrastructure side where we need more flops, uh, to train these models and, uh, more memory to hold them for, for inference. Uh, and, uh, um, this has led to, uh, as, as we've seen, uh, specialization in, uh, GPO hardware, we have. Um, you know, um, uh, systolic arrays and tensor cores that, uh, improve the speed of matrix multiplications, um, by four times.

And just as importantly, scale out, which is, uh, using many GPUs in a distributed setting and, and doing that effectively. And it's, you know, the, uh, uh, in this sort of, uh, uh, Setup is where the Lama moment has happened. Um, if you look at the LAMA paper, they used, uh, a trillion tokens from, uh, diverse set of, uh, data to train the LAMA model.

It used, uh, you know, tens of thousands of, uh, GPU hours to do this and, uh, you know, produce this really sort of high quality, uh, uh, uh, model that we have. And we were inspired by, by this, uh, you know, partly, uh, if you look at the dataset, this is an open dataset. This is not just available to, you know, one of the largest companies in the world.

This is available to, everyone is available to everyone here. Um, and our objective with, uh, red pajamas to, you know, see if we can produce a, you know, create a fully reproducible open model. Uh, with Lama quality and more importantly go beyond, beyond this quality. Um, when we talk about open models, um, you know, open weights are obviously important, but we think.

There's openness on these four dimensions. Uh, uh, that's, uh, critical for monotonic progress in the quality of open models. We need open weights, uh, so they can be used anywhere. They can be fine tuned. Uh, they can be used in fully private on-prem settings or on devices. Uh, we also want open license, so these models can be used in commercial applications.

Um, And we want open data. This really kind of gives us a way to think about what's gone into the model. Uh, there's transparency associated with it, and this data can then be refined and filtered and processed by the community to develop better models or fix specific applications. And, you know, if there's concerns about some data, it can be removed from, from, from the dataset.

Uh, and just as importantly, we think the. Recipe that creates the data, uh, should also be open, uh, because this lets us one, reproduce this model from scratch. Uh, but it also gives community the tools to, uh, improve the data. Recipes create the same model in a different language. Um, and, uh, you know, models that sort of fit these four pillars, uh, can really, uh, uh, contribute to monotonic progress, which is, which is what we are looking for.

Um, so bit detailed about Red Pajama. Um, on the red pajama dataset, we followed the LAMA paper. Uh, it has seven different slices of data. Uh, uh, we also carefully followed the. Uh, process described to filter this data to, uh, um, uh, uh, you know, transform it. Uh, and, uh, uh, all of that code is also available with, with the data set.

Uh, and we, we, we tune the hyper parameters to select roughly the same number of tokens that, uh, allow and paper, uh, described they use from different, uh, different slices. And for the model, you know, we took, uh, a trillion tokens from this dataset. Uh, we used the Pithia architecture, which is one of the most well documented, uh, open architecture, uh, transformer architectures available.

Uh, and we also instruct you and chat team these models to align them to applications. That's really very important because it, uh, uh, brings out, um, you know, uh, Quality, uh, uh, in the latent space of, of these models. A lot of this was done. Uh, in fact, I think all of it was done with open source tools. Uh, we used s Slurm and Spark for data processing.

We used deeper speed for pre retaining and the gathers. Uh, open Chat gets fine tuning system for, uh, instruct tuning and chat tuning these models. Um, use the Helm Benchmark, which is a comprehensive benchmark around generative capabilities of LLMs, as well as the LLM Harness model, uh, benchmark published by a Luther, which, uh, uh, really focuses more on the log prop side.

Uh, and the model was built on. Uh, the Summit Supercomputer at Oakridge National Labs, uh, with a generous grant from the Insight Program and used 3000 V hundred, uh, GPUs, uh, and everything that was, uh, you know, from, from the data to the final model checkpoints, uh, to intermediate checkpoints were published on hugging face and their Apache to license.

Uh, and the model works well. It's uh, uh, uh, you know, one of the more, uh, surprising things about this was that we, we built a 3 billion scale model and a 7 billion scale model in the first run. And the 3 billion scale model, uh, is just six points. Behind the base model is just six points behind Lama seven B, and, uh, 3 billion instruction model.

Is just one point behind, uh, LAMA seven on Helm. Uh, and this is fantastic because you can quantize this model further. We've worked with, uh, the LAMA dot CPP folks and the MLC folks to bring this to CPUs. And, uh, you know, the iPhone, you can, uh, Uh, download MLC chat app today to try out the chat version of this model.

Uh, and it's fantastic because you can use these, uh, you can use this model in few shot mode, uh, uh, to develop applications in context where this was just not possible to do before this model existed.

Uh, and the seven B model is also interesting. Uh, you know, a couple of things to note about this. The one is that we have made, uh, progress over the last generation of open models. It's, uh, the base model is 2.4 points ahead of GPT J, which is an excellent open model from a Luther. Um, uh, which you'll also note that open models, including M P t, Falcon and Red Pajama are lagging behind, uh, 3.3 to four points behind the llama model.

Um, With instruction tuning, uh, the open models become the red pajama Open model is two points ahead of lama. So it really kind of provides a alternative for few short applications like sentiment analysis and classification, um, uh, data extraction, uh, and. And for these applications is probably the best model that exists, uh, in open source today.

Um, and so super promising and, uh, you know, and we have, we still have a long way to go, uh, uh, to get to LAMA and to increase the quality of these models beyond Lama. One of the interesting things about open data is that it really lets you do. Uh, you know, principled analysis on the quality of the model.

So one of the things that we spend a lot of time doing is contamination analysis, especially on the instruct data set, because it's a really important part of model quality and aggressively decontaminate it, uh, uh, against the benchmarks and helm. Uh, and this is important because, uh, Uh, if there's contamination in the data, it can kind of give you a false sense of, uh, uh, uh, sort of performance in these models.

And, uh, uh, you know, this, having the open data allows us to do this sort of effort and allows anyone else to who's using it and applying it to downstream tasks to verify that the models have, uh, the data has been properly decontaminated. But as I said, we, we, we still have a lot of work to do and I want to touch upon some of the avenues for this.

Um, you know, a big part of, uh, uh, monotonic progress is evaluation. We need to know how well these models are performing. Uh, hugging face has in, for instance, an open LLM leaderboard, which is, uh, uh, really fun and um, Evaluates lots of models against a set of benchmarks, which shows Falcon Lama and Red Pajama are sort of roughly the same on the M L U uh, benchmark.

But a few days ago there was, uh, Um, a technical report from a great set of researchers that, uh, tried to reproduce M M L U and found very different results on the right. Um, where the open models are lagging behind Lama. Uh, the instruct model does well there. Uh, but this discrepancies really sort of, uh, problematic in, uh, you know, we want.

These benchmarks to act as north stars for moral development. And, uh, we have to find better ways of doing these evaluations, making sure they're reproducible, they're unbiased, um, and properly is, you know, possibly other strategies like human evaluation, um, which is, uh, uh, you know, which can be fairly hard to scale.

Uh, so we need to do a lot of work around this because once we have. Um, effective evaluation methods as a community, uh, you know, we can really sort of drive progress around, uh, model development. Another, um, uh, interesting area is data quality. One of the things that we did for, at Pajama Dataset was to use the, uh, the mixture description in the LAMA paper and, um, Uh, it's pretty clear that if you weigh these, these slices of data differently, you get, you know, more quality from, from the final artifact.

Um, and, uh, uh, we also likely need to not just weight these mix, you know, these slices also take sub slices of bigger, uh, size, like common crawl and weight those, and have to do this in that sort of principle manner. Uh, uh, and there's a lot of value in this. Uh, deduplication is another interesting area. Um, uh, Cerebra just published, uh, a data set called Slim Pajama, which is, uh, you know, more aggressively duped, uh, version of the data set.

It's about half the size. Uh, again, we have to figure out what's the right strategy here, how much, what's the sort of sweet spot for deduplication and how that interacts with the data mixers, uh, that we design. And then there are problems of data transformation, de-biasing data, you know, detoxifying this data.

Uh, and we think this is really the going to be, um, in many ways the cornerstone of progress towards better, better models in, in, in over the next few years. Uh, and another interesting and, uh, Uh, problem to solve is data utilization. Uh, as you are scaling out and using a larger number of GPUs to train these models, it becomes a challenge to use, uh, all those GPUs, uh, effectively.

And this graph shows, uh, an ideal run on the summit supercomputer and the real run for, for red pajama. And they see fairly close. But you know, once you get to 3000 GP scale, um, Uh, the difference means that 500 GPUs are ideal, you know, which is yeah, not ideal. And, uh, uh, you know, probably represents over a year, like a few million dollars worth of, uh, computing resources.

Uh, there's a lot of work happening in this area around communication optimization together is particularly focused, uh, uh, in, in this field. And, uh, there is. There are open problems of how to take these techniques and apply them into training harnesses so that, um, you know, you, you can use them for training different kinds of models without sort of doing custom work for each, uh, type of training task.

Some of the work that we've done around this, uh, uh, is, uh, now being used by us for instruct training models. It's a algorithm called cocktail, s g d, uh, that, uh, is able to do, uh, training on, you know, uh, either over low network conditions, uh, in distributed settings or get really sort of maximum utilization and.

These high performance computing, uh, supercomputer type, uh, training setups. And this, this, this is a talk in its own right, so I'm not gonna dig into details. It's a paper in I c ml, uh, this year, uh, as well as, uh, uh, a source code release that's coming in. As I mentioned, we are now instructing, uh, all the models add together with this approach and looking forward to, uh, pre-training models with soon.

Um, so hopefully this gives you a little bit of a window into, you know, the process of training these models, how we are using open data sets and open software. Uh, and, and in case I've read pajama open, you know, public infrastructure, uh, to build these. Uh, but that said, the. Uh, models that we have today, the artifacts that we have in hand are very powerful and can be used effectively for real world applications.

There are a few strategies that, uh, are particularly effective. Uh, one of them is few short prompting. It's the most lightweight way to boost, uh, model quality. Where you provide one or two examples of your task, uh, along with your prompt, uh, direct pajama three B and seven B instruct models are, you know, specifically optimized for this.

Um, and if you wanna go further, you can create your own instruct data, uh, start with a base model and instruct you in that model with their own data. Uh, we've seen several examples where people have, um, instruct tuned. With their own data and then apply f shock on top of that to get quality from three B and seven B models that they can't get from GD four.

So this is, uh, really an effective way to get, uh, uh, you know, high quality, high accuracy for your tasks, uh, with these models. And you can also, if you have a large amount of. Data, you can continue to pre-train these, uh, starting and you can mix the original data set, uh, which is an advantage with open data into your data mixture so you don't have problems of catastrophic forgetting, et cetera.

And find, uh, you know, prepare a model that's, uh, really kind of designed for your tasks. Um, there are many deployment options now. You can quantize these models further, uh, to sort of fit your. Um, infrastructure, uh, application and performance needs, uh, and because the weights are open and the licenses are permissive, the search really gives you a lot of different options.

And finally, I, uh, uh, wanna mention that uh, we have used, uh, uh, taken a lot of the optimization work that we've done and, uh, created a AI specific cloud service, which has. Thousands of GPUs has efficient support for inference model hosting, fine tuning training. Uh, we are supporting many users in beta today.

And, uh, you know, if you're focused on building your application, uh, you can use a service like together as AI Cloud to manage all the, uh, production needs. Uh, and I'm happy, uh, to announce that uh, you know, anyone watching this can go to this link and contact us for early access. That's all I have. Awesome.

Thank you so much. And let's take a moment and see if anybody has some questions in the chat. Um, otherwise, um, what's the best way for people to reach you, follow your work? Is it LinkedIn or Twitter? Uh, yeah, LinkedIn is great. Uh, you, you can also, um, you know, uh, uh, uh, send me an email at Ripple together, xyz.

Um, you know, and, and, and write to us on using the contact form. Uh, we, we, uh, store that and, um, we'll definitely contact anyone who writes there. Awesome. That's great. Yeah. It's been so interesting to kind of learn about like the proprietary models and the open source models and kind of the, the different sort of pros and cons that people need to weigh when considering one or the other.

Um, you know, I think, I think the, uh, some of the sort of value that we see in open models is. Uh, you know, higher, higher ability to customize these models on your own data. Um, there is, uh, you can achieve better privacy if you are running them, uh, you know, in your own environment or your, uh, cloud instance that only you have access to.

Mm. And no one is able to see that data. You, you can be assured that your data is private. Um, uh, and, you know, also control, uh, this is, uh, What's happening with these models are often fairly important in strategic business processes that are being encoded into, um, you know, into, into prompts. And there's customer data that's going, uh, uh, uh, uh, against those prompts.

And I think as. Uh, it becomes a more important part of, you know, uh, organizational sort of, uh, uh, you know, process. Uh, having control of those models will become more important to more people. Yeah, definitely. And explainability aspects of it as well, I imagine. Right. Transparency and explainability are also, uh, I think, important ideas that, uh, you can achieve with open better today than you can with closed.

Yeah. That's very cool. That's awesome. Well, thank you so much and I think we have to cut over to our next and final speaker. Um, but it was awesome having you, having you on today. And all of these talks are being recorded and we're gonna share this with the community. Oh man. Okay. Of course now the questions are streaming in.

Great. Um, quick question, Cesar, he's asking why do you suggest mixing the data sets? Can't we just fine tune RP with our private data set? You can and, and, uh, I, I think that's the great first thing to try. Um, there are cases, if your data set is large, then, um, especially for smaller scale models, you sometimes get this, uh, you know, sort of catastrophic forgetting where they forget some of the.

Uh, things that they've learned from the original data set. So it's, uh, a practice that data mixing practice is often used in this context where you don't want the model to forget what, uh, uh, it has learned from, from the previous data. And, but it's not, it's probably not a concern if it's, you know, uh, a few thousand examples.

Uh, and, uh, you can get incredible amount of customization with a few thousand examples. That's great, Cesar. I hope that answered your question. Um, we have another one from, um, rit. He says, these LAMA based models don't distribute the final weights, but differential, but differential ones and the repos provide instructions to create the final model weights.

What is the rationale behind the pa that pattern? What are the implications on license since the model weights do require the original weights? Yeah, so the reason, uh, uh, people are distributing diffs is that they don't have the rights to distribute the model. Um, the model itself is, uh, under a restrictive research.

Only license. And it was also one of the reasons for us to create red pajama was that it would be a permissively licensed, uh, you know, base model that does not have those restrictions. And you could, you know, easily really, uh, models that are fine tuned on top of it. So it's, it's really a way of sort of getting around the, uh, restrictive license that LAMA was under.

Oh, cool. Great question. I'm pretty glad you brought that up. Um, I'm not gonna pronounce this name right and I apologize. X I N K U N is asking does the AI cloud service work for private data or should companies with private data host their own open source models? Uh, so we have designed our AI cloud to be, you know, uh, very sort of aggressively private.

We think that, uh, one of the. Sort of big benefits of open source models is that you can run them in your private, um, you know, in your private instances that you control. Um, uh, that said, these models can really run anywhere. Uh, they can run in your v pc, they can run on, uh, together's, cloud. They could even run on-prem and on devices.

Great. I hope that answers your question and if you guys have others, maybe find in the chat, um, or at some of the, the places that he mentioned earlier. So thank you so much. Um, this was awesome. Thank you so much. Thanks for having me. Of course.

+ Read More

Sign in or Join the community

Watch More

Building LLM Applications for Production

Posted Jun 20, 2023 | Views 11.1K

# LLM in Production

# LLMs

# Claypot AI

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io

Building Better Data Teams

Posted Aug 04, 2022 | Views 1.6K

# Data Teams

# Data Tooling

# RN Production

# Financial Times

# Ft.com

Building Reliable AI Agents

Posted Jun 28, 2023 | Views 1.3K

# AI Agents

# LLM in Production

# Stealth