Sign in or Join the community to continue

Helix - Fine Tuning for Llamas // Kai Davenport AI in Production Lightning Talk

Posted Feb 22, 2024 | Views 563

# Finetuning

# Open Source

# AI

Share

speakers

Kai Davenport

Software Engineer @ HelixML

Software person, mainly infrastructure and APIs, lives in Bristol, and enjoys a simple life.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

A quick run down of Helix and how it helps you to fine tune text and image AI all using the latest open source models. Kai will discuss some of the issues that cropped up when creating and running a fune tuning as a service platform.

+ Read More

TRANSCRIPT

Helix - Fine Tuning for Llamas

Slides: https://docs.google.com/presentation/d/1UKryRg3bUcyL30WX6ZMCpx-RGGEvLaB8/edit?usp=drive_link&ouid=112799246631496397138&rtpof=true&sd=true

Demetrios [00:00:01]: Here we go, here we go. Keep it rocking, keep it moving. It's been too long.

Kai Davenport [00:00:07]: Bless you, sir. Nice to see your face. I also heard the song. I thought it was very cool, especially when it's, like, kicked in with the lyrics. I'm like, go, Demetrius. He's got some production skills over here. I like it.

Demetrios [00:00:19]: That is a lot of what we would call auto tune, and so there has been no AI that was used in this. Like, I didn't generate any of the beats with AI, except you could potentially argue that autotune has. It's a little bit of machine learning AI, so I don't know.

Kai Davenport [00:00:45]: I think it was a bona fide classic. That's what I think.

Demetrios [00:00:48]: There we go. That's what I'm talking about, Kai. That's why I bring you on here, man. I knew there was a good reason that we had you as a speaker.

Kai Davenport [00:00:56]: Anyway. Yes. I'm going to speak a little bit about Helix, which is a project, you know, my co founder Luke.

Demetrios [00:01:02]: I've heard of that guy.

Kai Davenport [00:01:03]: We've worked before in the past together.

Demetrios [00:01:06]: Should we tell everyone? So real fast? Let's just. Let's tell them the quick story, because there is a story to be told, and I want to let everyone know. So Kai and I worked together, and basically we worked with Luke, who we're talking about, and Luke was the CEO of this company called thought science, and when Covid hit, all of a sudden, the company went out of business, and so then, boom, we were like, oh, no, we don't have jobs anymore, except turned out to be a blessing in disguise for, like, nine out of ten of us because, kai, you went on to do great things with your life. The community went on to be created because of that, and here we are, man.

Kai Davenport [00:01:54]: I think definitely that moment can be attributed as the forging place of this great community, and, I mean, it's down to you, what you might say, the forge master, but, yes, we just discussed the origins of mops community, and it was an exciting time, man, like I said, and well done.

Demetrios [00:02:13]: I applaud you, dude. Couldn't have done it without you. So. All right, talk to us about Helix. What? You and Luke have been up, so.

Kai Davenport [00:02:21]: Yeah, Helix is an exciting project. I was working with Luke last year. We were actually doing. I'm just going to not go into details. It was a decentralized marketplace for AI, right. So it kind of had that Venn diagram of, like, crypto plus AI, woohoo. It was a cool project, and then this. We have no moat memo kind of came out of Google, right? And they were like, oh no, open source is going to come for our lunch.

Kai Davenport [00:02:52]: And it was an interesting debate. Now I put a question mark there because who knows, and especially today, if you've seen OpenAI's video model, then they clearly still have a bit of a moat, right? But who knows how this is going to go but me? And we're discussing this and oh, open source stands a chance, especially if you remember this thing called Windows versus Linux and what is going to end up running the Internet. Question mark. It feels like a similar thing. And Linux in my head won that because open source beats out all the edge cases you could possibly do inside a company. Anyway, there's a big debate like who's going to win this race? Then we started talking about, but regardless of who wins the big fat model race, what does it look like if you take domain specific knowledge and you fine tune a smaller model with less parameters? And so that sounds like an interesting idea. Let's park that for a bit. And then I'm not quite sure what happened, but we decided to then just start a company and build the thing.

Kai Davenport [00:03:51]: And then Mistral seven B, which I'm sure a lot of you know, but if you don't, then it's an open source model, it performs very well. Mixtral, which is like the mixture of experts architecture comes out of the same guys that made mistral, it performs very well. And you can fine tune it on a 1490, it turns out, which is kind of exciting for me because I'm sat there with a 1490. I don't have a big rack of gpus to play around with. So we got to work and we thought to ourselves, what would it look like if you put a Ui just like chat GBT? That kind of aha moment where you have ask it a question, it comes back with the answer, you're like, whoa, this is actually for a guest, the next word machine. This is actually really good at what it does and it's impressed me, but you can't kind of install it on your own computers was the hypothesis we were playing with. And hence let's build Helix, which is like a UI deployable platform for running best of breed open source models in a nutshell, but with a distinct lean towards making fine tuning easier. And that's basically the core tenet of Helix.

Kai Davenport [00:04:57]: Now, it's not to say fine tuning is better than rag, we are actually building rag retrieval augmented generation features into Helix as well, because it's not like we're not trying to bang the drum. We think fine tuning is the way forward. Clearly one is a chisel, the other is a hammer. And pick the right tool, maybe combine them. That's also possibly we're going to yield better results. We're speaking about evaluation just now, and let's use these evaluation frameworks, find out what approach is better. The key thing with Helix we're trying to do though, is make it really easy to do the fine tuning right, and so click and drag your documents. You have to make these question answer pairs, which is, I had no idea about this, but it turns out this is a core tenet because large language models are really good at guessing the next word, but they don't really understand the concept of a question.

Kai Davenport [00:05:46]: Right. They're just like using math. I'm going to guess the next word would be what we have to do to fine tune nicer, make good fine tuned models as a service is also have the question answer generation step very much part of the workflow. So ideally the workflow is drag your document in, it cracks on with a nice UI, as you can see in the background, and once it's finished, it tells you here it is. Now, as part of that process, though, we need to kind of go from this. Like here is some text, which is the documents you uploaded, your pdfs or your word documents. It's just a stream of tokens, and we kind of need to convert those into the sorts of questions that the end users of this model might ask. Right? And so we actually use a large language model in that step as well, which is we kind of go through the documents and we chunk down into smaller bits, and then we get a large language model to say, give me a list of questions that represent this text.

Kai Davenport [00:06:41]: And we do that in lots of different prompts. And it's actually that step that leads to the higher quality of fine tunes. And Demetrius mentioned how we got fine tuning Mistral not to suck. And it was a blog post that Luke put out last week, which is basically kind of like the culmination of our efforts in making significant leaps forward with the quality of our fine tuned because we're doing the question answer step much more rigorously might be the way I say it, as in we're doing a lot more of it and different. But here on the screen here, it's just this kind of process of saying like given the raw text, let me generate a question answer pair, which then the fine tuned model would know how to complete if asked that question. So a quick diagram to kind of explain our workflow. Underneath the hood is upload your documents, PDFs, HTML document, word documents or URLs, essentially unstructured or llama index. These sorts of tools that will do the extract bit for you.

Kai Davenport [00:07:41]: We're actually using them under the hood, right? So get me the plain text out of all of the stuff you've put in. Convert that plain text into question answer pairs using a large language model. We actually use Mixtral to do that because it turns out to be much better, producing quality QA pairs that gives you your data set and then you can fine tune from that data set. We use Axolotl under the hood, but we're also poking around with all sorts of different projects that will allow you to fine tune various different open source models. So our UI and our control plane is very agnostic when it comes to what actual model we use. At the moment it's mistral and Mixtral to do the QA pair generation, and this is one of my favorite slides because it's like what happens once you've done all of that is like without really knowing it, the AI now knows something completely new, and that's really useful, it turns out, for specific use cases. So it's like neo from the Matrix. I don't know how, but now I know kung fu.

Kai Davenport [00:08:40]: And that's the Lora file, the low rank adaptation file. It's essentially kind of like a boost for the weights. It's beyond my pay grade to tell you any more about how that stuff actually works. I'm sure there's lots of people in the community that can explain it happily, but our job is to produce those lower files by clicking and dragging and then get inference sessions to happen that use those lower files. And also to talk about some other use cases, like looking at time here. So not to go on and on, but obviously code is a really important one. Fine tuning a language model on your code base, that could be interesting, especially when it comes to like hey, write me a thing, just like I've written 20 other things, but with some fuzziness thrown in. That could be an incredibly useful version of copilot, which is single handedly the most productive tool I've ever installed on my computer anyway.

Kai Davenport [00:09:33]: But getting it to know my code base would be the reason to do fine tuning on code voice would be another one. A use case we've been toying around with has been what happens if your star salesperson leaves, and it's just like their kind of tone and character was actually what made them a star salesperson. And we could fine tune our language model on all of their communication and such forth. That might be very interesting. How am I doing for time? Just looking at a couple more minutes. So let me crack on for a couple of minutes. Here's a really important part of this whole thing is because it's open source models, you can run it on your own infrastructure. And I think that's one of our distinct hypothesis, is that companies who have strong regulatory concerns with where they can send their data, like they would love to use GBT four, right? They can see the value, but they can't be sending their data off to the US.

Kai Davenport [00:10:32]: And they would love to get their hands on similar quality models. And so if you can say, here is a platform you could run on your infrastructure, running open source models, that's one of our hypotheses. And I think the other one is like, can we actually be the go to SaaS platform for open source models, right. Which is clearly a different play altogether. And who knows? We're still finding out which we think might be the best thing. We're getting some signal in both places, is what I would say last part I will talk about, because this is running AI in production is like we've done some tricks when it comes to using GPU memory efficiently, because there's a lot of I o latency when it comes to taking lower files and base foundation model weights and sticking them into VRAM and then running inference on them. So you need a kind of multitenant solution to this. There's projects like VLLM and other things that are like, let's run inference at scale in production.

Kai Davenport [00:11:30]: Not all of them will support every format of Lora file. So you have these interesting, let me call it compatibility matrixes, where you come to run all of these different things on servers. But the key thing we've realized is boot a model and its weight into memory, and then run multiple inference sessions in a multitenant way on that same model instance. That's been something we've had to solve, because obviously we're running in production clusters of gpus. But also when we install this for people on premise, they would also want to connect it to clusters of gpus and use the same management layer that we've got for helix. So I'm going to throw that up in the air and say, that's my quick ten minute, probably, hopefully that was about ten minute brain dump of helix and say to Demetrios back to you, do you have any questions?

Demetrios [00:12:19]: We ain't got time for questions. Man. What are you crazy? We front loaded this with a story.

Kai Davenport [00:12:26]: We did, didn't we? I was like, launching the clock going, no. Hurry up, hurry up, dude.

Demetrios [00:12:35]: It was awesome, though. And I will say that two things. Jump in the chat if you want any questions. And there's some people that are mentoring that they have questions. So yes, there are questions coming through the chat. You did say one thing that I think is just perfect timing. You said copilot is one of the most productive tools that you have. Guess what we've got.

Demetrios [00:13:02]: Brian coming on. Where you at, Brian? Brian's coming.

Kai Davenport [00:13:06]: Me?

Demetrios [00:13:06]: What do you do, Brian? What kind of stuff do you work on?

Kai Davenport [00:13:09]: I work on GitHub, copilot, as well as any other GitHub product in general.

Demetrios [00:13:16]: There we go. Yeah. Kai is very happy with do that. All right, Kai, thanks so much, man. It's always you always brighten up my day. I appreciate it. And we'll let you go to the chat and answer all the questions.

+ Read More

Sign in or Join the community

Watch More

From Research to Production: Fine-Tuning & Aligning LLMs // Philipp Schmid // AI in Production

Posted Feb 25, 2024 | Views 1.3K

# LLM

# Fine-tuning LLMs

# dpo

# Evaluation

Graphs and Language // Louis Guitton // AI in Production Lightning Talk

Posted Feb 22, 2024 | Views 671

# KG

# LLMs

# Prompt Engineering

RagSys: RAG is Just RecSys in Disguise // Chang She // AI in Production Lightning Talk

Posted Feb 22, 2024 | Views 726

# RAG

# Hybrid reranking

# AI