Sign in or Join the community to continue

End-to-end Modern Machine Learning in Production

Posted Jul 14, 2023 | Views 593

# RLHF

# LLM in Production

# Hugging Face

Share

speakers

Omar Sanseviero

Machine Learning Lead @ Hugging Face

Omar Sanseviero is a lead machine learning engineer at Hugging Face, where he works at the intersection of open source, community, and product. Omar leads multiple ML teams that work on topics such as ML for Art, Developer Advocacy Engineering, ML Partnerships, Mobile ML, and ML for Healthcare. Previously, Omar worked at Google on Google Assistant and TensorFlow.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

In this quick talk, Omar will talk about RLHF, one of the techniques behind ChatGPT, and other successful ML models. Omar will also talk about efficient training techniques (PEFT), on-device ML, and optimizations.

+ Read More

TRANSCRIPT

Who do we have up next? None other than Omar. Hey, how are you? I'm great man. It's great to have you here. I know it is. Uh, what, a Friday afternoon for you? Almost Friday night, so yeah, it's 8:00 PM but yeah, it's very sunny, so it's good. But so you can handle it and I appreciate it. I appreciate you taking the time outta your Friday evening to come and talk to us.

Yeah. Thanks a lot for the invitation. Excited to be here, dude. So I'm gonna kick it off to you and I'll be back in 10 minutes. I'll get this, uh, this little, uh, ad off the screen and I'll share your screen and then we're gonna keep it rocking. Go ahead man. Awesome. Perfect. Yeah. So, hi everyone. I'm Omar. Uh, I work at phase, uh, as a lead of machine learning.

And today I will be talking about end-to-end modern machine learning. So this is a extremely complex topic. It's changing very fast, and then there is no way that in 10 minutes I can, uh, talk about everything here. So my goal today is really to increase awareness of which are the existing tools out there that can make it easier for you to use the state-of-the art, uh, ML models in your products, in your services.

So I will divide this, talk into two parts. So in the first part, I will talk about how to do inference with these models, and then in the second part, I will talk about how to adjust these models for your own very particular use cases. Yeah. So, uh, if you have been looking at two Twitter this year, every few weeks, there are new exciting models.

Uh, here I, in this slide I put three example examples of popular recent models from the last few months. Uh, So Star Coder at the left is an open source community led co pilot replication. It's a model that can generate code, uh, for many, many different programming languages. Uh, there is LAMA by Meta, uh, that has caused a great number of tools and research around it.

Uh, so from Lama c plus plus, uh, C P P, which is a c plus plus version of it to many other applications and tools around it. And two weeks ago there was Falcon. It's a very, very recent, uh, model, and it's, uh, right now considered the best, uh, The best open source, l l m uh, large language model. Uh, each of these have a bit of different licenses as you can see here.

Uh, all of these are quite large, uh, and they go from a few billion parameters or 7 billion parameters, up to 65 billion parameters in the piece of Lama. Uh, so yeah, so there are many challenges on how to use these models in inference event. Uh, so that previously showed some common things. So for example, as I mentioned before, the models are huge.

Falcon, for example, requires a lot of G P U memory. It requires 90 gigabytes of G PDU memory. Uh, that means that not even, uh, a 100, which is considered like the one of the best GPS to, to use, uh, you cannot put that model in a single A 100. So it makes it very hard to use as the model one fit in a single machine.

Uh, but it's not just a model size. It's also the evaluation of the large language models. Uh, there are many benchmarks right now. Uh, you can evaluate these models, but the benchmarks are not necessarily representative of real world usage or like the real world use cases in which you will put these models.

Uh, and most likely, most users or companies will want to adjust or tune these models for their own data, their own use cases. And then you, uh, you will expect very fast latency. Uh, So there are many, uh, things that have been popping up in the ecosystem in the last couple, uh, of months, the last half year.

Uh uh, so for example, there are techniques such as loading models in eight or fit, uh, four bit mode, uh, that allows you to use, uh, uh, less memory to load these models. Uh, so, uh, for example, there are open source libraries such as bit, uh, bits and bytes or accelerate, uh, So, for example, with a four bit mode, you can load, uh, the larger Falcon model, uh, just with 27 gigabytes of ram, which is still a lot, but it's much, uh, less than the 90 gigabytes that I was talking before.

And you can even do like some interesting things such as putting part of the computation in C P U, uh, which of course means that it will be much slower, but, uh, you can use very large, uh, models. Uh, Then there are also tools, uh, such as Text Generation inference, that are optimized entirely for lms. Uh, so I would like to talk a bit more about Text Generation Inference, which is our open source library.

You can go to GitHub and find it. And there are many features that this library has. Uh, but I put. Here, like a couple of the interesting ones. Uh, so for example, tenor, probably some, uh, allows you to use multiple gps for a, a model for a single model. Uh, pretty much what it means is that you can split a tenor into slices and each slice would be processed in a different G P U.

Uh, if you have used chat gpt, for example, when you're talking with it, uh, chatting with it. Uh, you won't generate, you, you won't receive the full response. You will be receiving, uh, characters at the same time, or tokens. So that's called token streaming, and it's essential for fast latency. So rather than waiting for the full generation, what you want to do is to have the server just answering as soon as it starts to generate tokens.

It provides a much faster ux and it's a nicer experience. Uh, metrics monitoring quantization, uh, with tools such as bits and bytes, which was what I, what I was mentioning before. And there are many other optimizations such as, uh, uh, flash attention for fast attention mechanisms, uh, and many other things.

So this tool tax generation inference is actually being used right now in a couple of different, uh, places in the op, in the open source ecosystem. Uh, there is co in chat, which is, uh, open source ui. Uh, Like uhp. So an open source UI for open source LLMs. Uh, there is a effort called, uh, open Assistant, which is a, a production of very large LLMs and net treatment.

Launched a couple of weeks, uh, months ago. Uh, l l m Playground. Uh, so all of these are examples that are power powered by tech generation in France. So it has been battle tested and it's again, a fully free open source tool. So this was the part on how to use these models. To do inference, uh, research lab, uh, community shares a model to put these models in, in production, but most likely, as I was mentioning before, you will want to adjust or tune this model for your own just case.

Uh, most people here are probably familiar with, uh, With some of this. Uh, so in the classical ML setup, what you want to do is that you will want to train a model. You will require lots of data, lots of compute, as well as disparities on how to train these models. Uh, that's usually quite expensive. Uh, so in the last five, six years, many people have been doing fine tuning, uh, fine tuning requires, uh, Much less data.

So in fine tuning, you pick a base model, a very large model that was usually was, uh, most likely very expensive to train, shared by a research lab that has lots of compute. And then you adjust or fine tune to your own domain or data. So it can be your own company data or your own personal data. Uh, and here you need much less data, much less compute.

Uh, you can train models much faster, but now with very large LLMs, uh, it's becoming very hard to just do fine tuning, and that's where PE or parameter efficient, fine tuning is quite interesting. So it enables phone use cases such as, uh, Fine tuning Whisper or Falcon, the, the model I was talking before in a free Google call app instance.

So without having to pay for expensive pps. So the idea of perf, uh, in 20 seconds is that, uh, rather than tuning or turning the full model, you will freeze the model. You will add some adapter, uh, or some additional, uh, parameters. Uh, and those are the ones that you, you will train and you will have maybe a bit of a, a quality, uh, A bit of a quality hit, but even then, the, the performance will be almost on par and the inference will be just as, as the same, but the training will be extremely fast.

Uh, so I will talk about two, three quick examples about, uh, perf. Uh, so for example, there's whisper, uh, whisperer and automat feature recognition model. Uh, that means that it can transcribe audio files to text, uh, and fine tuning whisper. Uh, usually in a Google CoLab would crush your memory. Uh, you will get an out of memory.

Eh, but with Laura, uh, lo is a parameter efficient, uh, technique. It's a pf technique. You can just add a small adapter, uh, which will be much smaller than the original model, and it will enable much, much faster training with similar quality and without, uh, requiring, requiring having like extremely large additional models.

Uh, PE can also be used for things such as stable diffusion. So for example, uh, stable diffusion, if you don't know, is a image generation model. Uh, and you can have different adapters for different, uh, concepts, for example. But in the con context of LLMs, uh, perf enables things such as fine tuning, uh, falcon, even if you don't have that much compute power.

Uh, and there was a very recent technique, uh, from, uh, about a month ago called, uh, Qra allows you to, uh, to, uh, fine tune a model, uh, with four ization and adapter tuning. Uh, that means that, uh, so, uh, long story short, it allows you to require much less memory, uh, to train models. Uh, and that enables very interesting things.

So, for example, in the context of R L H F, uh, Reinforcement learning with human feedback, which was the previous presentation. What you can do is have a single base model and then have multiple adapters. So for example, for the preference model, you will have a different adapter. And for each of the components, you will have different adapters, but you will always keep a single base model.

Uh, this is a very recent area. This is just from the last couple of, uh, weeks, months. Uh, but this enables some very interesting things. Uh, So, yeah. Uh, so again, what I wanted to do today was just give a very high level overview of what's the current, uh, state, uh, of the ecosystem. And I hope this was useful.

Uh, everything is in GitHub. Everything is open source, so feel free to check it out. Feel free to give some source. Thanks. Excellent. Incredible. Or thank you. Yeah, thanks. And I get the feeling, I'm listening to your accent right now and I get the feeling you're from some Spanish speaking country maybe.

Yeah, so I'm from Peru originally and I grew up in Mexico. Yeah, yeah. Ah, moving in. All right. There we go man. So if anyone wants to continue the conversation with Omar, throw it in the chat and he will cruise on over there. Thank you so much, Omar. This is awesome, man. I really appreciate you coming on here and joining us.

Yeah, thanks all for invitation. Thanks everyone. See you.

+ Read More

Sign in or Join the community

Watch More

The State of Production Machine Learning in 2024 // Alejandro Saucedo // AI in Production

Posted Feb 25, 2024 | Views 1.1K

# LLM Use Cases

# LLM in Production

# MLOPs tooling

Machine Learning Engineering in Action

Posted Mar 07, 2022 | Views 1.2K

# Presentation

# ML Engineering

# databricks.com

Trustworthy Machine Learning

Posted Sep 20, 2022 | Views 1.4K

# Trustworthy ML

# IBM

# IBM Research