MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Efficiently Scaling and Deploying LLMs

Posted Apr 23, 2023 | Views 2.1K
# LLM
# LLM in Production
# LLM Deployments
# MosaicML
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Share
speakers
avatar
Hanlin Tang
CTO @ MosaicML

Hanlin is the CTO & Co-founder of MosaicML, an ML infrastructure startup that enables enterprise to easily train large scale AI models in their secure environments. Hanlin was previously the Director of the Intel AI Lab, responsible for research and deployment of deep learning models. He joined Intel from its acquisition of Nervana Systems. Hanlin has a Ph.D. from Harvard University, and has published in leading journals and conferences such as NeurIPS, ICLR, ICML, Neuron, and PNAS.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Hanlin discusses the evolution of Large Language Models and the importance of efficient scaling and deployment. He emphasizes the benefits of a decentralized approach of many small specialized models over one giant AGI model controlled by a few companies. Hanlin explains the advantages of companies training their own custom models, such as data privacy concerns, and provides insights into when it is appropriate to build your own models and the available tooling for training and deployment.

+ Read More
TRANSCRIPT

Link to slides

Yeah, thanks a lot for having me really excited to, to be here. Uh My name's Hanly Tang. I'm the CTO and co-founder at a mosaic M L. Um And today I'll be talking through what it takes to efficiently scale and deploy these large language models that have sort of uh taken the world by storm, you know, uh in the last couple of months, right? And we, we've seen all the headlines and this entire amazing uh virtual conference that's been, that's been put together uh has been built specifically for this topic.

OK, great. Yeah, thanks a lot for having me really excited to, to be here. Uh My name's Hanly Tang. I'm the CTO and co-founder at a mosaic M L. Um And today I'll be talking through what it takes to efficiently scale and deploy these large language models that have sort of uh taken the world by storm, you know, uh in the last couple of months, right? And we, we've seen all the headlines and this entire amazing uh virtual conference that's been, that's been put together uh has been built specifically for this topic.

What we've observed early on is that um people used to think that OK, L MS are here. Uh And you need one giant A G I model uh for every use case. Um That's very centrally controlled by a few individual uh companies, I think over the last month or two, that thinking has really evolved inside the ecosystem What we've observed early on is that um people used to think that OK, L MS are here. Uh And you need one giant A G I model uh for every use case. Um That's very centrally controlled by a few individual uh companies, I think over the last month or two, that thinking has really evolved inside the ecosystem to saying, hey, you know what uh you don't, the world doesn't actually look like one, you know, A G I model.

It looks like many small specialized models that are used to solve very specific use cases owned by many companies. And I'll provide some examples of what we're seeing in the ecosystem today uh with, um uh with this type of thinking and it's not just us. Uh you know, um you can see here um to saying, hey, you know what uh you don't, the world doesn't actually look like one, you know, A G I model. It looks like many small specialized models that are used to solve very specific use cases owned by many companies. And I'll provide some examples of what we're seeing in the ecosystem today uh with, um uh with this type of thinking and it's not just us.

Uh you know, um you can see here um uh articles from uh both ill luminaries in the field as well as venture capitalists that are saying that, you know, what we're gonna be in a much more decentralized world where the power of L L MS will be put into every individual company uh or startups uh uh world. And you'll see these very distinct general A I models start to come into place. uh articles from uh both ill luminaries in the field as well as venture capitalists that are saying that, you know, what we're gonna be in a much more decentralized world where the power of L L MS will be put into every individual company uh or startups uh uh world. And you'll see these very distinct general A I models start to come into place.

And on the right, you can see, you know, an article from uh uh folks in, in venture capital that companies with unique data stores will see clear advantages from training their own models as modes. And on the right, you can see, you know, an article from uh uh folks in, in venture capital that companies with unique data stores will see clear advantages from training their own models as modes. So we're really, you know, seeing this future world where companies will do both, they'll use these amazing external API S that have come onto the scene uh to build some of their applications. Uh But they've also will undertake the effort to build their own custom large language models. And what I'll do today is talk through exactly what that second bullet point means. Um So we're really, you know, seeing this future world where companies will do both, they'll use these amazing external API S that have come onto the scene uh to build some of their applications.

Uh But they've also will undertake the effort to build their own custom large language models. And what I'll do today is talk through exactly what that second bullet point means. Um Is it too expensive? Is it too hard? When should you do it? When shouldn't you do it? And what tooling is out there to help you really be able to train and deploy your own custom language models, train on your own data. Is it too expensive? Is it too hard? When should you do it? When shouldn't you do it? And what tooling is out there to help you really be able to train and deploy your own custom language models, train on your own data.

So why build your own models. Uh One of the key reasons that we see out there is and you see these headlines here, a lot of kind of data privacy concerns where you don't want to be spending all of your time outsourcing and sending your private data, your core IP uh to an, an external service. But I think if we dig uh deeper there, it actually comes um a lot more um than that. Um So why build your own models. Uh One of the key reasons that we see out there is and you see these headlines here, a lot of kind of data privacy concerns where you don't want to be spending all of your time outsourcing and sending your private data, your core IP uh to an, an external service. But I think if we dig uh deeper there, it actually comes um a lot more um than that.

Um We see many reasons that companies are now citing that they want to be able to train their own models. Uh We already talked a little about data ownership. Uh But the other piece is really understanding where that model data is coming from. If you want to be controlling the bias and the outputs of these models that many people, you know, so far have already talked about as a big problem for large language models. Part of that is understanding, well, what data went in, We see many reasons that companies are now citing that they want to be able to train their own models. Uh We already talked a little about data ownership. Uh But the other piece is really understanding where that model data is coming from.

If you want to be controlling the bias and the outputs of these models that many people, you know, so far have already talked about as a big problem for large language models. Part of that is understanding, well, what data went in, if you're building a financial model for enterprise, do you really want to be using a model that was say trained on, you know, reddit's uh Wall Street bets channel, right? Uh Or if you're building a, a health care uh specific model, you want to make sure you be excluding certain data sources uh from uh your model itself. if you're building a financial model for enterprise, do you really want to be using a model that was say trained on, you know, reddit's uh Wall Street bets channel, right? Uh Or if you're building a, a health care uh specific model, you want to make sure you be excluding certain data sources uh from uh your model itself.

Part of the challenge that we have in controlling the model output is that we're applying all these downstream packs or modifications to get the models to align uh when many of the problems actually just come from where the source data came from. So controlling what data went into training the model um can uh play a huge role in this data provi provenance problem. Part of the challenge that we have in controlling the model output is that we're applying all these downstream packs or modifications to get the models to align uh when many of the problems actually just come from where the source data came from. So controlling what data went into training the model um can uh play a huge role in this data provi provenance problem.

The other piece is really controlling the content filters that makes sense for your business, right? Um Some businesses need a very long sequence length or have very particular filters uh that they want to apply. Uh And you wanna be able to apply that control. The other piece is really controlling the content filters that makes sense for your business, right? Um Some businesses need a very long sequence length or have very particular filters uh that they want to apply. Uh And you wanna be able to apply that control.

And the third piece that we see is model ownership. Um Companies want to be able to own their weights either for portability so that you don't have to be beholden to a particular service to run the deployment side. Um But also for better introspection and explain, And the third piece that we see is model ownership. Um Companies want to be able to own their weights either for portability so that you don't have to be beholden to a particular service to run the deployment side. Um But also for better introspection and explain, we see even more uh reasons why customers want to be training their own models uh inside this ecosystem that that's growing very quickly.

Um Of course, I mentioned the disadvantage where kind of your data is your remote. Um But the other underappreciated piece is inference economics. Um Originally, I think the field when they came onto the scene is OK, we need to train these 100 billion parameter models, 175 billion parameter models, we see even more uh reasons why customers want to be training their own models uh inside this ecosystem that that's growing very quickly. Um Of course, I mentioned the disadvantage where kind of your data is your remote. Um But the other underappreciated piece is inference economics. Um Originally, I think the field when they came onto the scene is OK, we need to train these 100 billion parameter models, 175 billion parameter models, you know, 500 billion parameter models and playing that kind of scaling game.

Uh But what's really emerged is that for many individual use case that you may be solving, um you don't need a 1 75 billion parameter model that can do a lot of magic. And A G I, you have a very specific M L problem or N LP problem that you want to solve. And for that training, smarter smaller models um becomes much more interesting. you know, 500 billion parameter models and playing that kind of scaling game. Uh But what's really emerged is that for many individual use case that you may be solving, um you don't need a 1 75 billion parameter model that can do a lot of magic. And A G I, you have a very specific M L problem or N LP problem that you want to solve. And for that training, smarter smaller models um becomes much more interesting. Also, in many applications that are very domain specific, we see a lot of folks starting to train their own language models as well. So in genomics, for example, where now you're dealing with uh sequences of proteins and DNA S instead of natural text or electronic health records or in medical or, or in vehicle. Also, in many applications that are very domain specific, we see a lot of folks starting to train their own language models as well. So in genomics, for example, where now you're dealing with uh sequences of proteins and DNA S instead of natural text or electronic health records or in medical or, or in vehicle. And then lastly, I already mentioned kind of the need for data and model ownership uh from a privacy and from a regulatory standpoint, we've heard from insurance companies where, hey, if you're in a highly regulated environment model and data ownership is critical to building more explainable and, and better models also, you don't want to be relying on external services that may, for example, you know, discontinue a model or obsolete a model that, that you've been relying on for your particular application. And then lastly, I already mentioned kind of the need for data and model ownership uh from a privacy and from a regulatory standpoint, we've heard from insurance companies where, hey, if you're in a highly regulated environment model and data ownership is critical to building more explainable and, and better models also, you don't want to be relying on external services that may, for example, you know, discontinue a model or obsolete a model that, that you've been relying on for your particular application. So we're really seeing this world where customers and, and you all may want to start training our models and deploying our models. And I think when we first say that a lot of what we hear from folks is, hey, it's, that sounds hard, So we're really seeing this world where customers and, and you all may want to start training our models and deploying our models. And I think when we first say that a lot of what we hear from folks is, hey, it's, that sounds hard, right? You need to find the right number of GP US, you need to figure out what the software thing is to be able to run these things to figure out how to deploy them. Um thankfully and, and I, and as I'll walk through during today's talk, um there are many amazing tools out in the open source and in the community today um to make training and deploying your models um a lot easier than you may think. Uh It, it would be from, from first principles right? You need to find the right number of GP US, you need to figure out what the software thing is to be able to run these things to figure out how to deploy them. Um thankfully and, and I, and as I'll walk through during today's talk, um there are many amazing tools out in the open source and in the community today um to make training and deploying your models um a lot easier than you may think. Uh It, it would be from, from first principles and I'll show you some examples of, of how uh these are being done today. So um this is an example is Biomed L M. It is a domain specific large language model that was trained just on PUBMED. So just on a large corpus of biomedical literature, and I'll make a few interesting points here. First. Um This was trained by Stanford University uh generally with us, but primarily by Stanford. and I'll show you some examples of, of how uh these are being done today. So um this is an example is Biomed L M. It is a domain specific large language model that was trained just on PUBMED. So just on a large corpus of biomedical literature, and I'll make a few interesting points here. First. Um This was trained by Stanford University uh generally with us, but primarily by Stanford. Um and uh by just a team of uh two or three um M L engineers. Um and uh by just a team of uh two or three um M L engineers. Um So, despite like the size and scale of these models, the tooling has gotten to a point where even small teams uh are able to train very interesting and powerful language models. And so this is a very simple example of a three billion parameter, fairly small in terms of large language model world. Um But we, we've seen that models like these really punch above their weight. Um So, despite like the size and scale of these models, the tooling has gotten to a point where even small teams uh are able to train very interesting and powerful language models. And so this is a very simple example of a three billion parameter, fairly small in terms of large language model world. Um But we, we've seen that models like these really punch above their weight. Um This model is able to hit state of the art accuracy at the time of the release um on the US medical licensing exam. Um And Uh So that's this med Q A US M L E uh example, on the lower left here where the model gets this very long prompt about a history of a, of a patient and the symptoms that's been presented. Um And it has to do a multiple choice about what the right test to order for this particular patient is. Um This model is able to hit state of the art accuracy at the time of the release um on the US medical licensing exam. Um And Uh So that's this med Q A US M L E uh example, on the lower left here where the model gets this very long prompt about a history of a, of a patient and the symptoms that's been presented. Um And it has to do a multiple choice about what the right test to order for this particular patient is. Uh And on the right is a different uh benchmark that we use. That Stanford used to measure this model which is PUBMED Q A uh where you're given a question, you're given some context and you're supposed to answer, you know, some a particular medical problem. Uh And on the right is a different uh benchmark that we use. That Stanford used to measure this model which is PUBMED Q A uh where you're given a question, you're given some context and you're supposed to answer, you know, some a particular medical problem. And what's interesting here is that uh if you look on the chart on the right is that this biomed actually, it was originally called PUD GP T. Uh But we renamed it to pub med L M uh model. Uh This first row here, uh we can hit state of the art accuracy at the time um on this uh med Q A uh benchmark. And what's interesting here is that uh if you look on the chart on the right is that this biomed actually, it was originally called PUD GP T. Uh But we renamed it to pub med L M uh model. Uh This first row here, uh we can hit state of the art accuracy at the time um on this uh med Q A uh benchmark. And um what's interesting is that this three billion parameter model can reach a similar performance as Galactica, which is a model that's about 40 times larger. Um But it was trained on much less specific data. And um what's interesting is that this three billion parameter model can reach a similar performance as Galactica, which is a model that's about 40 times larger. Um But it was trained on much less specific data. So really, really drives on the point where we're seeing growing evidence out there that training domain specific large language models in the medical space and the legal space in genomics and health care and insurance. You can actually build models that are economical uh to deploy at scale and inference across uh a large um set of, of customers. So really, really drives on the point where we're seeing growing evidence out there that training domain specific large language models in the medical space and the legal space in genomics and health care and insurance. You can actually build models that are economical uh to deploy at scale and inference across uh a large um set of, of customers. The other point I'll make here is that um this is actually pretty accessible. There's a lot of data that's already exists out there uh for particular domains out in the web, for you to take in and, and, and use uh in order to, to train these models. The other point I'll make here is that um this is actually pretty accessible. There's a lot of data that's already exists out there uh for particular domains out in the web, for you to take in and, and, and use uh in order to, to train these models. So smaller three billion to seven billion parameter models that are fairly specialized. Uh We've seen in the ecosystem can actually have very, very strong business value. So smaller three billion to seven billion parameter models that are fairly specialized. Uh We've seen in the ecosystem can actually have very, very strong business value. Um And we've seen applications everywhere from in the financial space where folks want to classify or summarize loan documents um to cogeneration and helping uh with, with programming tasks where these small models bring a lot of value because in most cases, you don't need the power of that large, you know, a G I system that's been trained on um a lot of data and is very expensive to serve. Um And we've seen applications everywhere from in the financial space where folks want to classify or summarize loan documents um to cogeneration and helping uh with, with programming tasks where these small models bring a lot of value because in most cases, you don't need the power of that large, you know, a G I system that's been trained on um a lot of data and is very expensive to serve. The other example that just actually came out uh last week was uh Bloomberg uh who trained a 50 billion parameter large language model on a combination of web data uh and also uh internal Bloomberg data. The other example that just actually came out uh last week was uh Bloomberg uh who trained a 50 billion parameter large language model on a combination of web data uh and also uh internal Bloomberg data. And uh this uh is also fairly interesting because traditionally, we've thought of building large language models as requiring either the training from scratch piece uh or uh the fine tuning piece. But what's happening here also is this continue to pretrained piece where you take uh a combination of web data that's already out there and internal Bloomberg data, that's what Bloomberg did. Um And they found out that outperformed existing open source models on a large set of financial tasks. And uh this uh is also fairly interesting because traditionally, we've thought of building large language models as requiring either the training from scratch piece uh or uh the fine tuning piece. But what's happening here also is this continue to pretrained piece where you take uh a combination of web data that's already out there and internal Bloomberg data, that's what Bloomberg did. Um And they found out that outperformed existing open source models on a large set of financial tasks. Um And this really drives on the point that hey, your internal proprietary, proprietary data matters and can be used uh to solve better tasks for, for your business. Um And this really drives on the point that hey, your internal proprietary, proprietary data matters and can be used uh to solve better tasks for, for your business. Now, I think, you know, uh when we first say this, there are a few, I think myths that are coming out and the first one is, oh my gosh, Hamlin, that's great. I wanna build my own models. Uh They give me data and privacy. I get control over where my IP goes, but it's too darn expensive. Um If you read the press out there GP T three took somewhere from 10 million to $12 million to train this model that's nowhere near feasible for kind of a prototype system. Now, I think, you know, uh when we first say this, there are a few, I think myths that are coming out and the first one is, oh my gosh, Hamlin, that's great. I wanna build my own models. Uh They give me data and privacy. I get control over where my IP goes, but it's too darn expensive. Um If you read the press out there GP T three took somewhere from 10 million to $12 million to train this model that's nowhere near feasible for kind of a prototype system. The other, I think myth that we've seen out there is that, oh, it's just too difficult. Um You have to figure out all these different individual components in order to train your large language models. How do I get everything to, to connect together? Uh How do I figure out uh the right GP? How many GP U SI should use? How did I get multi node orchestration to work? How do I deal with node failures? How do I get the data streamed in? How do I fit all these model into memory. What type of model should I train? The other, I think myth that we've seen out there is that, oh, it's just too difficult. Um You have to figure out all these different individual components in order to train your large language models. How do I get everything to, to connect together? Uh How do I figure out uh the right GP? How many GP U SI should use? How did I get multi node orchestration to work? How do I deal with node failures? How do I get the data streamed in? How do I fit all these model into memory. What type of model should I train? Um And so these are two very common myths that I wanted to sort of bust uh in this talk uh to show you uh where the tooling is today uh to make training and building your own models uh very accessible. Um And so these are two very common myths that I wanted to sort of bust uh in this talk uh to show you uh where the tooling is today uh to make training and building your own models uh very accessible. The first one is, you know, we put out a Twitter poll. I think this is somewhere in September uh asking folks how much do you think it actually cost to train a GP T three quality model from scratch? Um And just like it's being reflected in, in a lot of these articles that I just showed about 60% believe that a GP T three quality model um costs, you know, between every 1 to 5 million plus or per run. And you know, you do have to train these models a few times in order to get it just right. The first one is, you know, we put out a Twitter poll. I think this is somewhere in September uh asking folks how much do you think it actually cost to train a GP T three quality model from scratch? Um And just like it's being reflected in, in a lot of these articles that I just showed about 60% believe that a GP T three quality model um costs, you know, between every 1 to 5 million plus or per run. And you know, you do have to train these models a few times in order to get it just right. Um But the reality is that training large item is actually fairly accessible with the tooling that's available from us and other people in the in the community. Uh Here is a, a actually a chart of different model sizes. Um how long they take to train on about 100 and 28 GP US? And the approximate cost. Um But the reality is that training large item is actually fairly accessible with the tooling that's available from us and other people in the in the community. Uh Here is a, a actually a chart of different model sizes. Um how long they take to train on about 100 and 28 GP US? And the approximate cost. And suddenly, if you combine kind of the tooling that's available now, with the idea that hey a one billion to seven billion parameter model actually is starts to becoming very attractable uh for, for business use cases. Uh You can see that training these models is no longer that expensive. You can train a three quality model for about half a million dollars. Um And uh for many business use cases, a $30,000 model can already go pretty far And suddenly, if you combine kind of the tooling that's available now, with the idea that hey a one billion to seven billion parameter model actually is starts to becoming very attractable uh for, for business use cases. Uh You can see that training these models is no longer that expensive. You can train a three quality model for about half a million dollars. Um And uh for many business use cases, a $30,000 model can already go pretty far uh and being deployed into, uh and being deployed into, into production. into production. And uh how is this being done? So, um there's a lot of great work happening right now on good tooling in order to scale these models really efficiency efficiently. Uh And additionally, a lot of work in the open source and from us on ways to efficiently and stably train these models. And uh how is this being done? So, um there's a lot of great work happening right now on good tooling in order to scale these models really efficiency efficiently. Uh And additionally, a lot of work in the open source and from us on ways to efficiently and stably train these models. So to efficiently scale these large language models for training. One of the challenge is that these models are so large, they can't fit into the memory of a single uh GP U. So to efficiently scale these large language models for training. One of the challenge is that these models are so large, they can't fit into the memory of a single uh GP U. Um Pytorch has released a fully shard data parallel, which is a pretty powerful and very simple strategy that essentially splits the model across many GP US that's very flexible and, and easy to use. Um And it's native to Pytorch and so integrated with everybody who's using pie charts these days uh to, to train these models Um Pytorch has released a fully shard data parallel, which is a pretty powerful and very simple strategy that essentially splits the model across many GP US that's very flexible and, and easy to use. Um And it's native to Pytorch and so integrated with everybody who's using pie charts these days uh to, to train these models and um fully started in fully charted data parallel. Essentially what happens is you split the model and the optimizer um across all the different GP US and essentially fresh them just in time with training. and um fully started in fully charted data parallel. Essentially what happens is you split the model and the optimizer um across all the different GP US and essentially fresh them just in time with training. Um This saves a ton of memory uh but it's also very flexible uh because you don't have to deal with the more exotic, you know, paraly strategies uh in order to, to train your models. And so we found coupling fully charted data parallel uh with our own Composer library. Uh and all the optimation that'll be baked in uh will event will make uh trainees large language models uh fairly, uh fairly efficient and uh and easy to, to, to use. Um This saves a ton of memory uh but it's also very flexible uh because you don't have to deal with the more exotic, you know, paraly strategies uh in order to, to train your models. And so we found coupling fully charted data parallel uh with our own Composer library. Uh and all the optimation that'll be baked in uh will event will make uh trainees large language models uh fairly, uh fairly efficient and uh and easy to, to, to use. The other thing that's kind of evolved in the research space in the last couple of months is that The other thing that's kind of evolved in the research space in the last couple of months is that uh you don't actually need to train a large model for it to be efficient. Um You can see here a plot from the metas uh Llama model. So this is training loss on the Y axis and how many tokens of data is being crunched through on the X axis. Um And you can see here in blue is the llama seven billion parameter model and in red is a llama 65 billion parameter model. What's interesting is that you can see for the seven billion, the smaller model, the loss continues to go down as you train more and more. uh you don't actually need to train a large model for it to be efficient. Um You can see here a plot from the metas uh Llama model. So this is training loss on the Y axis and how many tokens of data is being crunched through on the X axis. Um And you can see here in blue is the llama seven billion parameter model and in red is a llama 65 billion parameter model. What's interesting is that you can see for the seven billion, the smaller model, the loss continues to go down as you train more and more. And so interestingly, the when this ball, when the set of models was released, um the seven billion per model was the one that got everybody's attention. Uh Because hey, it's small, you can run it on your laptop sometimes for inference it's very cheap to deploy. And so interestingly, the when this ball, when the set of models was released, um the seven billion per model was the one that got everybody's attention. Uh Because hey, it's small, you can run it on your laptop sometimes for inference it's very cheap to deploy. Um And if you continue training it and if you use it, it actually can solve uh many of the business problems um that, that you're, you're that, that you need uh to deploy these large language models. So internally, you know, we have this mantra, you know, train longer, not larger. Um And this has really changed the accessibility of these large language models uh from the for, for everyone to be able to use Um And if you continue training it and if you use it, it actually can solve uh many of the business problems um that, that you're, you're that, that you need uh to deploy these large language models. So internally, you know, we have this mantra, you know, train longer, not larger. Um And this has really changed the accessibility of these large language models uh from the for, for everyone to be able to use the other myth that um that we've seen out there is uh hey, uh training, these models is just too hard. Um I need to deal with uh challenges are I finding the right kernels? How do I do my paraly, how do I stream my data in uh monitoring these runs becomes challenging because the other myth that um that we've seen out there is uh hey, uh training, these models is just too hard. Um I need to deal with uh challenges are I finding the right kernels? How do I do my paraly, how do I stream my data in uh monitoring these runs becomes challenging because nodes can fail. Um in our experience, nodes can fail almost uh every other day or every few days. How do you recover from those very well? And you have to deal with challenges such as loss spikes. How am I training more efficiently? Um How do I orchestrate all these multiple nodes and get them to talk to each other in the right way? Um All these different challenges at first blush, you know, get in the way between the model and the data set um and your, your underlying hardware. nodes can fail. Um in our experience, nodes can fail almost uh every other day or every few days. How do you recover from those very well? And you have to deal with challenges such as loss spikes. How am I training more efficiently? Um How do I orchestrate all these multiple nodes and get them to talk to each other in the right way? Um All these different challenges at first blush, you know, get in the way between the model and the data set um and your, your underlying hardware. And uh there is some evidence to substantiate this when your training models at very large scales. Um These are the training logs uh from um one meta train their O P T model. Um And uh you can see notes fail every so often. Uh when you resume from training, you know, it takes about 50 minutes when all the the cluster is idle, uh waiting uh for the model to be reloaded and for the data loader to fast forward. And uh there is some evidence to substantiate this when your training models at very large scales. Um These are the training logs uh from um one meta train their O P T model. Um And uh you can see notes fail every so often. Uh when you resume from training, you know, it takes about 50 minutes when all the the cluster is idle, uh waiting uh for the model to be reloaded and for the data loader to fast forward. Um or you know, sometimes uh your provider may actually delete your entire cluster uh when trying to re provision the nodes. And so all of these combined do make the sense or the the uh sentiment that hey, maybe training these models is, is just too hard. Um or you know, sometimes uh your provider may actually delete your entire cluster uh when trying to re provision the nodes. And so all of these combined do make the sense or the the uh sentiment that hey, maybe training these models is, is just too hard. Uh Fortunately, that's not actually the case with tooling that we've built and other folks in the community, um we built an L L M stat that sort of just works uh where we have provide optimized configurations across many model scales to be able to realize kind of the the cost and, and the model training that I I spoke about previously um that is very fast and scalable and essentially, you know, provide your data and just go. Uh Fortunately, that's not actually the case with tooling that we've built and other folks in the community, um we built an L L M stat that sort of just works uh where we have provide optimized configurations across many model scales to be able to realize kind of the the cost and, and the model training that I I spoke about previously um that is very fast and scalable and essentially, you know, provide your data and just go. Um So if you're out there and you have some interesting internal data that you want to train or fine tune these models uh to, to deploy for yourself uh using, you know, a lot of the models that are out there in the open source um us and other folks have built um a great stack, a great set of stacks um many in the open source um that make training these models uh very easy and sort of just working uh right, right out of the box. Um So if you're out there and you have some interesting internal data that you want to train or fine tune these models uh to, to deploy for yourself uh using, you know, a lot of the models that are out there in the open source um us and other folks have built um a great stack, a great set of stacks um many in the open source um that make training these models uh very easy and sort of just working uh right, right out of the box. And so what we observe in enterprise is that many folks may say, hey, I've trained a Burt model, Burt base burnt large or deployed into production. That's great. Um Now I want to scale up. And so using these tools, you can now break this multi node barrier um and start training smaller, you know, one billion parameter models and deploying those into production, seeing if that brings you a business R O I and then continuing to stack up uh from there And so what we observe in enterprise is that many folks may say, hey, I've trained a Burt model, Burt base burnt large or deployed into production. That's great. Um Now I want to scale up. And so using these tools, you can now break this multi node barrier um and start training smaller, you know, one billion parameter models and deploying those into production, seeing if that brings you a business R O I and then continuing to stack up uh from there uh to continue to, to, to derive more value from training these larger models that are maybe more accurate. uh to continue to, to, to derive more value from training these larger models that are maybe more accurate. And um I spoke previously about many of other unique infrastructure challenges that come with training large models and large systems. Uh to some extent, many of these are now starting to be solved like like many platforms that are out there. So um the ability to uh detect out of memory errors that occur when you have a very large model and dynamically adjust the usage on the fly to prevent these And um I spoke previously about many of other unique infrastructure challenges that come with training large models and large systems. Uh to some extent, many of these are now starting to be solved like like many platforms that are out there. So um the ability to uh detect out of memory errors that occur when you have a very large model and dynamically adjust the usage on the fly to prevent these ability to resume instantly uh or near instantly from uh from training. So that if there is a node failure, it automatically catches it and then restarts ability to resume instantly uh or near instantly from uh from training. So that if there is a node failure, it automatically catches it and then restarts the ability to kind of gracefully resume from node failure and loss spikes such that you don't actually have to be monitoring these runs uh on a 24 7 right under the hood. Um the ability to kind of gracefully resume from node failure and loss spikes such that you don't actually have to be monitoring these runs uh on a 24 7 right under the hood. Um They'll uh They'll uh they um they um all right, one second all right, one second um be able to resume gracefully from node failures and, and loss spikes. um be able to resume gracefully from node failures and, and loss spikes. Uh So that when things happen, you don't have to be monitoring them 24 7. We, you people have built systems that are able to automatically find the bad node and restart. So from your perspective, you just provide the data and training. Sure it just happens. Uh And of course, many of the efficiency work that's happening in the ecosystem from flash attention uh from uh from the the Stanford folks to various efficiency algorithms that we're developing uh to either bring down the cost of, of training these these large language models. Uh So that when things happen, you don't have to be monitoring them 24 7. We, you people have built systems that are able to automatically find the bad node and restart. So from your perspective, you just provide the data and training. Sure it just happens. Uh And of course, many of the efficiency work that's happening in the ecosystem from flash attention uh from uh from the the Stanford folks to various efficiency algorithms that we're developing uh to either bring down the cost of, of training these these large language models. And so that really ends up, we end up with this ecosystem for the large language model training stack uh where um there's this full stack that's now developing uh to allow you to easily train large language models on your data and your secure environment. Everything will tracking tools to, to from ours and others like web data set to be able to stream the data in And so that really ends up, we end up with this ecosystem for the large language model training stack uh where um there's this full stack that's now developing uh to allow you to easily train large language models on your data and your secure environment. Everything will tracking tools to, to from ours and others like web data set to be able to stream the data in uh to various distributor training frameworks, the underlying deep learning libraries, uh deployment and orchestration and also various dis device drivers and, and toolkits. Uh So this is literally leading to a world where you know, for what we build like the mosaic K L platform and others. Um it is very straightforward to start training these one billion and seven billion primary or large language models um for your particular use cases. uh to various distributor training frameworks, the underlying deep learning libraries, uh deployment and orchestration and also various dis device drivers and, and toolkits. Uh So this is literally leading to a world where you know, for what we build like the mosaic K L platform and others. Um it is very straightforward to start training these one billion and seven billion primary or large language models um for your particular use cases. And we really see this as positive for the ecosystem, right? We want to be in the position and you know what this actually this, this M L Ops committee has shown is we want to be decentralizing um these capabilities um so that we're not beholden to just very particular api providers uh to be able to train these models. And we really see this as positive for the ecosystem, right? We want to be in the position and you know what this actually this, this M L Ops committee has shown is we want to be decentralizing um these capabilities um so that we're not beholden to just very particular api providers uh to be able to train these models. I know I'm kind of running uh close on time here, so I'll close by saying there are many reasons to build your own models, training, you know, seven billion per large iron rolls and, and lower on your own data now is very cost effective and straightforward. Um, the communities continue to push out many great open source models out there for you to build off of, uh, and where you've sort of, uh, busted, uh, two myths here. I know I'm kind of running uh close on time here, so I'll close by saying there are many reasons to build your own models, training, you know, seven billion per large iron rolls and, and lower on your own data now is very cost effective and straightforward. Um, the communities continue to push out many great open source models out there for you to build off of, uh, and where you've sort of, uh, busted, uh, two myths here. Uh The first myth is, hey, it's just too expensive, um, which is not actually true. It's now fairly cost effective to train these models. And the second myth is too hard, not true at all. Uh There are a lot of great open source tooling and full stacks out there, including ones that we've built, Uh The first myth is, hey, it's just too expensive, um, which is not actually true. It's now fairly cost effective to train these models. And the second myth is too hard, not true at all. Uh There are a lot of great open source tooling and full stacks out there, including ones that we've built, uh to, uh, make it very easy, uh to train these models. And so I would really encourage everyone to, uh, don't be too scared by the large part and a large Irish model, you know, if you've trained or fine tuned Burt models in the past uh or computer vision models. Uh The tooling is there now uh to start your large Irish model model in journey um and start scaling up efficiently and deploying these models into production. uh to, uh, make it very easy, uh to train these models. And so I would really encourage everyone to, uh, don't be too scared by the large part and a large Irish model, you know, if you've trained or fine tuned Burt models in the past uh or computer vision models. Uh The tooling is there now uh to start your large Irish model model in journey um and start scaling up efficiently and deploying these models into production. Um Of course, we as Mosaic Kamel um offer um some of these uh platforms as well. Uh And so you can work with us or many others uh to, to get started. So, very excited to see, you know, over the next 6 to 12 months, the plethora different models dedicated for legal or medicine or for your particular business that's gonna emerge because of all the great work that's happening in the community and, and in the, in the open source. Um Of course, we as Mosaic Kamel um offer um some of these uh platforms as well. Uh And so you can work with us or many others uh to, to get started. So, very excited to see, you know, over the next 6 to 12 months, the plethora different models dedicated for legal or medicine or for your particular business that's gonna emerge because of all the great work that's happening in the community and, and in the, in the open source. Um So with that, um I'll close and thanks again to the committee for inviting me to come and, and share uh what we're seeing out in the, in the enterprise ecosystem. Um So with that, um I'll close and thanks again to the committee for inviting me to come and, and share uh what we're seeing out in the, in the enterprise ecosystem.

+ Read More

Watch More

Considerations and Optimizations for Deploying Open Source LLMs at Your Company
Posted Jul 17, 2023 | Views 687
# LLM in Production
# Open Source LLMs
# MysticAI