MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Pitfalls and Best Practices — 5 lessons from LLMs in Production

Posted Jun 20, 2023 | Views 1K
# LLM in Production
# Best Practices
# Humanloop.com
# Redis.io
# Gantry.io
# Predibase.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Share
speakers
avatar
Raza Habib
CEO and Co-founder @ Humanloop

Raza is the CEO and Co-founder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Before Humanloop, Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a Ph.D. in Machine Learning from UCL.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Humanloop has now seen hundreds of companies go on the journey from playground to production. In this talk, we’ll share case studies of what has and hasn’t worked. Raza shares what the common pitfalls are, emerging best practices, and suggestions for how to plan in such a quickly evolving space.

+ Read More
TRANSCRIPT

Introduction

And I am going to bring on a good friend, Mr. Raza from Human Loop. Where you at? There he is. Pleasure to be here. How are you doing? I'm good man. It is great to have you here. I know that. The Human Loop team is helping coordinate the live meetup that's happening in London right now. So shout out to everybody that is watching from London.

We've got, uh, a London local given a talk here, huh uh, and Raza and Human Loop is doing some incredible stuff. I want to. Uh, let everyone know that in case they want to go deeper into human loop. Remember, we've got the solutions tab on the left hand side of your screen. You can click on that and find all kinds of good details about human loop.

And speaking of live, In person events. We are tuning in right now to Amsterdam. That's a live view of people having some drinks and pizza by the bar and watching us live 20 seconds later. So they're not gonna realize it, but we are putting them on screen and they're gonna see it in just a moment, and by the time they see it, they're not on screen anymore.

So Raza, man, I'm going to hand it over to you. Feel free to, uh, So, yeah. So you got your screen shared. There it is. All right, cool. Now I'm giving it to you and I'll be back in 20 minutes, 25 minutes, man. Talk to you soon. Awesome. Thanks very much, man. And uh, and nice to meet everyone who's virtually out there joining us.

What I wanted to chat about today was some of the pitfalls and best practices that we've seen from putting LLMs in production or helping others do that, um, at Human Loop. So maybe just to start off with, you know, like why. Are we sort of in a position to talk about this? So Human Loop is a developer platform that makes it easier for people to build reliable applications on top of large language models.

We've been doing this for over a year now, uh, since So, you know, not just post chat, t p t. And we've seen, you know, several hundred projects both succeed and not succeed in production. And through that we've gotten a sense of like, what are some of the things that really work? And also what are some of the things that seem like they might work or feel like good ideas, but actually end up being pitfalls or, uh, are kind of things to avoid.

So that's what I'm gonna try and talk about today, which is what are some of those lessons that we've seen and how can you apply them to try and get to, uh, more reliable, robust LM applications in production. So before I kind of dive into the pitfalls, best practices, et cetera, I think it's helpful for all of us to kind of be on the same page about what we're talking about.

What are the components of an LM app, how do we, how do I think about this? And so the way I think about it is that LM apps are composed of traditional software with blocks, you know, what I call an L LM block, which is the combination of some kind of base model, could be anthropic, could be open, ai, could be a custom, fine tuned, uh, open source model.

And then some sort of prompt template, which is, you know, a set of instructions or a, a template into which data is gonna be fed and some selection strategy for getting that data. So it's becoming increasingly common to do things like retrieval augmentation, followed by generation where user query comes in, you search for something and that gets pulled into the context.

GitHub copilot grabs code from around your, uh, code base and puts that into a prompt template. You know, in chat, bt the. It's the history of the conversation that's there. But in either case, you've got, you've always got these three components. And then, you know, if we have agents like we just discussed in the talk previously, then you'd be chaining these together in some way or putting this in a while loop.

But fundamentally, in order to get something to work well, you've gotta get all three of these pieces, right? And so we're thinking about like, what is the right base model? Um, how do I come up with the right prompt template? What data goes into it? What's the selection strategy? Where is, where is that coming from?

And the challenges to building strong LM apps are that, you know, getting each of these three parts to be good is still difficult. Prompt engineering, which is the art of coming up with that template or set of instructions to the model, um, has a big impact on the performance of the model. So, small changes can have quite large differences in performance, but it's difficult to know.

What those changes are, unless you do quite a lot of experimentation and then, you know, hallucinations or the bullshit problem as I call it, is something that comes up quite a lot. How do you stop LLMs confidently answering things incorrectly? So, Evaluation is harder than traditional software cause it's often very subjective.

Um, and then if you're using the largest models and they can quickly become very expensive or they may not be appropriate for situations where you need very low latency. So how can you go beyond some of those things and, and, you know, related to, to these challenges, it's just a host of questions that you have to start thinking about.

Once you get into the weeds of building this, you know, you start off with like, okay, what model should I use? Should I be going open or closed? How do I find good prompts? Should I be using chains or agents? Do I use tools? Do I stick to prompting? Should I find you, should I not find you? You know, I, it goes on and on.

And actually there's a lot of small decisions you gotta get right. In order to get good performance. And so what I wanna talk about across this is sort of, you know, how can you get the right structure in place to get those decisions, right? Um, and what are the, what are the differences between the people who succeed and, and the ones who don't?

So that kind of naturally leads me into talking about the pitfalls and best practices for building LM applications. And there's really five, or sorry, four that I'm gonna talk about here. There's a fifth one I'll sneak in later, but four that I want to talk about. Um, And I'll go to, I'll go into detail on each of these across the next slide.

So I won't, you know, dwell here for too long. But, um, I believe I've left this slide in here so that you guys will have a summary at the end and we can come back to this. But the first one I want to talk about is objective evaluation and the fact, you know, the necessity for objective evaluation. Um, So one of the, the pitfalls that we see is that for some reason when people come to build LM applications, especially because prompting feels different to traditional software, um, they don't put into place as systematic a process for measuring things as they might otherwise.

So they start off, you know, in the opening eye playground and they're eyeballing a couple of examples. They're trying things out. Um, if they do have a retrieval augmented system, then you know, even if they do have an evaluation system in place for the whole thing, they maybe don't look at evaluating the quality of the retrievals versus things that are results of the prompt or versus things that are the embedding.

And the mindset for some reason seems to be a little bit different to how you might go about doing traditional software. I think it's very natural to end up in this cause it feels like you're just programming natural language. And then related to this, people don't plan ahead for when things are in production.

What are the feedback signals they're gonna need to be able to monitor things or measure performance. And they ultimately end up trying to shoehorn this into traditional analytics tools like Mixpanel, or they dump logs to a database with the intention of looking at those latest for driving performance improvements.

And they end up not looking at them or not really having a sense for how things are working. And the consequence of this is that either people give up, they try something and they think, oh, it doesn't work. So we've seen a few companies, you know, different companies running in parallel, some of them achieving a goal and others like mistakenly reaching the conclusion that it's not possible with lms.

Or you see people just trying a lot of different things, one after another, changing the retrieval system, trying chains, trying agents, doing different prompting strategies, and they don't really have a sense of whether or not they're making progress. And so that's the kind of like risk here. Um, and I also wanted to talk about, okay, what are examples of getting it right?

Um, and so, oh, and before I do that, actually, it's just bearing in mind that evaluation matters, like at various different stages of building these applications. And so some people have it in some of these, but not in all. So the place where it's often missing the most is actually during prompt engineering and iteration.

Um, you know, you're, you're tweaking things very quickly. It's very easy to try a lot of different changes without versioning them or seeing the history. Um, but it's also actually important to have good evaluations. Once you're in productions, you can monitor what's happening and whether or not your users are actually having a good outcome.

And then finally, if you make any changes, um, it's hard to avoid regressions. You know, these are the same problems you have with traditional software. But now if you're going in and changing a prompt, or you're ch, you know, upgrading from one model to the next or something like that, how do you know that you aren't causing problems or, you know, causing regressions?

It's not as easy as it is for traditional software. Um, so maybe to look at one example of. An application that really gets this right is, I think GitHub copilot. And it's an interesting one to look at because they also took evaluation like really, really seriously when they did this. So, um, GitHub copilot, you know, most people are not familiar with coding Assistant.

One of the most successful applications of LMS out there, well over a million users and serving a quite critical audience, software engineers. And one of the things that they do for evaluation is they rely very heavily on end user feedback in production. So, So they're looking at, um, when you get a suggestion from GitHub copilot, was that suggestion accepted?

But not only was it accepted, but does it stay in your code at various different time intervals later? So they're really getting a very strong signal of was the code they generated actually useful to people? And that allows them both to monitor the application well, but also take steps to improve it later.

Um, and just the, you know, the amount of thought and engineering that's gone into this, I think like, Speaks to the importance of having appropriate evaluation tools in place. And it's not just GitHub copilot that does this. You know, I think the best apps that we've seen out there, and I just put a handful here, chat, G P t, find psda, write others, try to capture both explicit and implicit sources of user feedback in production in addition to human feedback during development, and use that for monitoring improvement and development.

And so, you know, one thing that the pattern that we've seen to work out really well is trying to capture these three types of feedback. Uh, votes are like the simplest one that we, you know, you'll see in applications things like thumbs up, thumbs down, but also actions. So implicit signals of what is and isn't working.

And then finally, uh, corrections. Uh, you know, if these are able to edit a generation, then that's a very useful thing to be able to capture. So, if you take the co, the case of GitHub copilot, being able to capture any edited text is extremely, is extremely helpful. Um, and then the, okay, so yeah, so that's the first sort of pitfall, but also best practice, right?

Which is the importance of having objective evaluation. If you can't measure things, then you can't improve them, and kind of everything else in terms of all the effort you pour into your applications can go to waste. The next one that I think is underappreciated is actually the importance of having good infrastructure around prompt management.

We have all of these new artifacts floating around. Um, that affect the performance of our applications, but it's easy to not give them the same seriousness as you would to, you know, your normal code. And so it's, it's very common to see people starting off in something like an opening eye playground and then using two fighter to kind of push the experimentation a little further.

Maybe prompt templates end up being stored in Excel, or if they're collaborating with non-technical teammates, they maybe put them on that Google Docs as well. Um, and the problem with this is it's really easy to lose histories of experimentation. So you're trying lots of stuff. You're implicitly learning a lot in that process that your teammates then don't have access to.

So you're not accumulating learnings. We've seen companies like actually run the same set of evaluations. With external annotators on the same prompts multiple times because they hadn't realized that across teams, they basically tried the same thing. And so a lot of effort gets spent, um, trying to solve this.

And then I think people also underestimate. How hard it is to get the right tooling in place for this. So it feels like maybe you can hook together or stream that app and ju by notebooks with spreadsheets and get something that works quite well. But the challenge is that they become a main, the, the maintenance burden becomes really high and it's difficult to have the right level of access controls to who can deploy things to production or not.

So you, you often either have a lot of friction with getting things from that system into code and into production. Or, um, you end up in a situation where if you do do what might feel like the most natural solution, which is let's just keep everything in in code and use git for versioning and management, right?

It's, these are prompt templates are fundamentally just text artifacts. They're like, code the problem that we've seen with that. Is that unlike traditional software, there's typically a lot more collaboration going on between domain experts who might have an ability to contribute to prompt engineering and software engineers.

And so if you put everything into Git, and you should definitely do that, but you don't also have a way for the non-technical experts to contribute to that or change it. You end up creating a lot of friction in sort of how well the team can, uh, can sort of work together. And so, you know the, the solutions here, you can, there, there are many options.

You know, we, human loop works on this. You don't need to use human loop, but I think there are three things. That you want from a system like this, whatever system you choose to have that makes it work well, which is one you want it to record a history of everything you're trying throughout experimentation right the way through from when you're playing in the playground to when you're doing more quantitative evaluation and you should be storing what was the history alongside the model config and making it genuinely, easily accessible so you can draw conclusions from that.

And it needs to be accessible to both non-technical and technical team members because if you separate this into code, then you end up alienating a really key stakeholder in this process. So the next, so that's kind of where I wanna leave that one. The next one I want to talk about, um, is actually avoiding premature optimizations.

And this relates really strongly to the previous talk that we just listened to. So I think the phrase was used, you know, my friends on Twitter are gaslighting me with how good these AI agents work and actually in practice, um, that's what we've seen happening. Um, With people who've tried to put this in production as well.

So we've had a, a number of customers who have tried to cross a wide range of use cases to try and build with agents or chains, and sometimes moving really quickly to quite complicated sets of chains or agents, um, to try and get things to work and later on having to rip these things out. Um, and the, you know, the problem with doing that too soon, and I'm not saying that the chains of models don't work or agents don't work in certain circumstances, they definitely do.

And we've seen positive examples of it. But if you do it too soon, it makes it harder to evaluate what is and isn't working because you have multiple different places that you're changing things. And so you have this combinatorial complexity or I, I, you know, I have five different prompts in a chain. So, which one of those is affecting outcomes?

It becomes harder to maintain things over time because as much as you might try to modularize this, changes in one place often affect. Things down the line. Um, and it's also harder to to evaluate. So the kind of advice on this one is actually to start with the best model that you can get your hands on.

So typically this is one of the largest lms, GD four Cloud from Anthropic, whatever it might be. And don't over-optimize costs and latency. Don't over-optimize the complexity of, you know, stuff early on. Focus on the limits of, of what you can do with prompt engineering initially. Push that as far as you humanly can.

And then once you've done that, the next thing that I might consider would be fine tuning smaller models to try and improve performance or latency. And I would only go to agents, uh, or chains if you're in a situation. More complex chains, if you're in a situation where reasoning is really important. And where you've like you have explored these other alternatives and found them to be wanting first with one caveat to that, which is there is like a very common change.

You know, there are two caveats I guess. One is chat, you know, you can think of as chat as the most primitive agent, and that obviously works well. And the other is retrieval, augmented generation where your first could have. Calling from a, a database of some kind, putting that into a prompt template, and then calling the model those chains work.

But, but more complicated ones. I would discourage people from starting there unless they've already explored alternatives and uh, and seen how well they work. And, and so you know, when people start off with prompt engineering, they hit the limits. The default today feels like it's to go to something like chaining or agents.

But actually I think this is another kind of common mistake, which is to underestimate the power of fine tuning. So people have gotten so used to the power of these very large models that there's a tendency to assume. That fine tuning smaller models will either require more data that can get the hands on or that small models, you know, it's anything smaller than GP 3.5 won't be effective.

So it's not worth trying. And actually, I think this is a really common misconception, like we quite regularly see customers successfully fine tuning smaller models, often not relying that much on that much annotated data. So sometimes hundreds of data points is enough for them to get started and thousands of data points can get you really good performance in terms of annotated data for fine tuning.

And when you've got a smaller model, you're getting the benefits of increased cost and LA you know, well, lower cost in. Yeah. The consequences of not doing this, that's what my slide says, are increased cost and latency, reliance on models that are larger than you need, and the fact that you don't benefit from proprietary data.

And by that I mean that if you are fine tuning. And you're capturing feedback data in production, then actually this can give you a bit of a data flywheel where you are able to improve the performance of your applications quicker than your competitors because you can, for your specific use case, get, get better at that task, right?

The very largest models are very general in their capabilities, but if you are trying to generate a sales email or you are trying to answer questions about a legal document, you don't need GPD four to be able to do all of the other things that can do, like write poetry or answer questions about sports, whatever it might be.

Those are not relevant tasks. So fine tuning can be very effective. Um, and we see this kind of often underestimated, uh, and people are very surprised when they try this. We had, we had a customer who spent I think three weeks, like doing different forms of prompt engineering. They tried retrieval, augmentation, embedding their kind of history of data, and we asked them like, if you have all this historical data to embed and it's not changing very quickly, have you tried fine tuning?

And they hadn't. And when they did, Um, it outperforms GPD four on a much smaller model for their task quite quickly. I think they get three and a half thousand data points, so, um, Another kind of concrete example of this in practice is, uh, is one of my favorite startups. I use this pretty frequently, which is a company called Find, which is a search engine for developers.

So it's l l M based search. They do, like, we've discussed retrieval First, they put that into a template and then they generate an answer to your question, but they're very much focused. On questions that are relevant to developers and software engineers. So their model needs to be much better at code and those things, but it doesn't need to be good in general.

And they started with GPD four. They gathered a lot of different, you know, user feedback and production. And you can see that on the right hand side. I just grabbed a snippet of all of their kind of feedback buttons that they have in the app. And as a result, they've been able to fine tune a custom open source model.

Um, and that model is now, or in my, in the case I've written here, open Soro model. I'll fix that after. And that is now better for performance in their niche than anything that you could get from the closed model providers. It's not better in general, but it is better for their specific use case. And I think it's a really, you know, telling case study and as a result they have lower costs, lower latency, and better performance.

Um, and I asked them kind of if they had any recommendations on what the best performing. Open source models are, and they recommended the TI five and FLA L two sort of family of models from Google. They also said they're exploring Falcon right now, but they've had a lot of success with those models. Um, And you know, I kind of mentioned this in passing, but fine tuning builds you a level of defensibility that it's difficult to maybe otherwise get.

And a very common pattern for doing this that we've seen in terms of the best practice that works well is people will generate data from their existing model. They'll filter that data based on some success criteria, and that criteria might be explicit user feedback. It might actually be another L L M scoring those data points for how good they are.

And then they'll fin tune. And often they can run a few cycles of this while still getting, um, improved performance over time. And so, you know, I I come, I come to the end of my, my talk and really want to like, leave you guys with a message, which is that like we need to draw the right lessons for traditional software with respect to building applications for large language models.

So it's not that we want, you know, in the, in the beginning I talked about people having maybe a. Casual or less rigorous attitudes toward things like prompt engineering, um, than they might otherwise have, uh, compared to normal, traditional software because it feels like it's in natural language and maybe isn't quite as, um, it doesn't feel the same as code.

I'm not advocating that we should copy all of the same things from traditional software, though I think that actually. We need, in this case, our own set of, you know, tools that are appropriate for the job that have been designed from the bottom up, from first principles with LMS in mind. And I kind of jotted down like some of the traditional software principles that you might have, but, but thought about like why they're different in the case of lms.

Right. So the first point I talk spoke about having prompt management and having a good way of handling that. Um, you know, we have solved this problem for traditional software. We have git, we have version control. So why do we need something different? Um, and in this case, the big difference with LM applications is there tends to be a lot more collaboration between technical and non-technical team members.

Prompt engineering is something that actually you often want a domain expert to be involved in. And also the speed of experimentation and updating is just a lot higher. So for traditional software, you write a spec, you can write tests, you get something going, and then you deploy it. And yeah, you're gonna change and update it over time, but not at the frequency that you're gonna be doing this for LM based models, whether that's fine tuning or updating prompts.

Um, and then when it comes to evaluation and testing, Like, obviously we do evaluation and testing for traditional software, but now we're in a situation where the outcomes are non-deterministic. Um, there's a lot of different criteria for success because everything is so much more subjective. So you, it's much harder to write down an objective metric like accuracy or even just write a unit test than it would be compared to, um, you know, even traditional machine learning where at least in, you know, you're doing classification or any AR or something like that.

There is a correct answer here. It's not so clear cut. Um, and then when it comes to something like ci i c d, we obviously want something similar for LLMs, right? We wanna avoid regressions, we want to be able to make changes without worrying about it, but we need a different type of tool, so, Partly because we want a much faster iteration cycle than we might for traditional software, right?

Like you can go in and make changes in natural language to a prompt and have an impact from that. Almost immediately. We wanna be learning from data. And so having the ability to kind of co-locate feedback data and be able to use that as part of this process is different. And it's just much harder to write unit tests.

You can maybe use LLMs as part of this because they're quite good at evaluating their own outputs, but it's, you know, there isn't a correct answer for a lot of the things we want LLMs to do. Which is why we need to have something that's slightly different. So I would be encouraging people to think critically about being rigorous in their development of l m applications, especially if they want to see success in production, but also to kind of acknowledge the differences in what they're doing now.

Compared to traditional software. And that's really, you know, what we've been thinking about as we've been developing applications at Human Loop. We're trying to build the right tooling for this paradigm from the ground up. Uh, so thanks very much and if there are any questions, I'd be very happy to answer them.

Uh, you saved me. Your timing is impeccable and I just have to say that. Uh, you make my job much easier. So I'm very glad, Demetrius, I had a little stopwatch up in front of me. I was like, if I can get the same stuff in, then I'll, uh, I'll accelerate in the places where I can. Yes. So now we have time for questions there.

What are some awesome questions that came through in the chat, and I get to ask you to them directly. First things first, how do you manage these so-called data leaks that you spoke of? So data leaks, I'm not gonna, or, or leaks I think is what you talked, you said maybe I threw in the data there. I missed, did I, did I mention leaks?

Does someone mind just like clarifying what they, what they Yeah. In the minute. Tell us what exactly you mean by leaks. And I'll go onto to the next question, which is a bit of an ethical one's. A beefy one. So excellent. You on your ethicist hat. What are your views on the ethics of GitHub co-pilot sending telemetry data back to Microsoft?

Is this an example of opting into alignment or something more sinister? I think as long as people know upfront what they're signing up for and you always get the permissions from users to capture telemetry, then, then I think it's fine. Right? People have to be willing to give that. People have to give it willingly.

Um, and as long as it is willingly given. Um, and everyone's like clear on what's going on. Then actually it improves the service for everybody and everyone benefits together. But I think like where it's a bit sinister is if they were capturing these things without permission. So for me, that's the, that's the ethical line that I wouldn't wanna cross.

Um, but as long as you make it clear to your users what you're doing, I actually think this benefits them. So basically, if you say it, it's cool. If not, not cool. No bueno, nobuo. Let yeah, let people choose. That makes sense. All right, so do you have examples of tools we can leverage for prompt management?

Aside from general solutions like Google Docs, et cetera, I think you, you may know a tool that helps with that. Is there with a piece, this case, like part of the reason I've seen this so much and thought about it so much is because this is part of the human loop platform, right? So part of what Human Loop does is it gives you a place where your team can iterate on prompts, find out what works, evaluate performance, manage the I C D.

Like, the reason that I'm able to talk so deeply about some of these problems is we've been building tooling specifically to solve them. So I'd encourage you to come check it out. Um, you can, you can sign up and try it for free. Uh, and yeah, that's my, that's my one, that's my main recommendation.

Few questions that came through about fine tuning. Uh, is there a way? So basically there was one that said, I have a, ah, I'm trying to, every time somebody asks a question, then it pulls me back down. So I have a, can I use a 16 gigabyte laptop with GPUs? Without GPUs to fine tune Falcon? And I guess they're talking about, I mean, the small Falcon, I would assume.

Yeah, I would assume so. Um, honestly, I would assume not, but I don't know. As in like, I'd be very surprised if it's the case, like we've been getting Falcon up and running, you know, to try and include it as one of the models in the human loop playground and, and uh, working with, uh, a company called Mystic AI that I'll give them a quick shout out to.

Cause they've been super helpful to us. Um, and it's been hard work to actually get those models to be served in performant and, you know, have low latency and fine tune them. So, and that's operating on a lot of GPUs. So I'd be surprised if you can get the, the Falcon model, the, you know, even the medium-sized one, there's two, right?

The smaller and larger running on a laptop without GPUs at 16 gigs. But, you know, people are very creative and maybe you can quantize it or do something smart. So I don't want to, I don't wanna say can't be done, but I'd be, I'd be surprised just, yeah, maybe next week someone, somebody's gonna come out with something cool like, render on your phone.

It's all good. So, going to the, uh, Going to the question about, and I'm just checking time. This might be our last one because I wanted to get in a quick meditation during the break before we have Wale's, um, Wale's talk coming up there is an awesome question here from Matt, specific version of the data leak question.

How do you and human loop think about the model exposing private information that has been pre-trained or fine tuned on. If you fine tune on docs that not everyone at the company has access to and someone prompt injects it to get the output that, yeah. So this, this is a really interesting question and maybe I can expand a little bit on just explaining what MAP means for, for people out there.

Because I don't know if, if everyone's kind of come across this issue before, but the, the, the situation that you would be in is you go and you gather a bunch of data and you fine tune a model. And some of that data you gather may be private to, you know, maybe a subsection of your company, but now it's stored in the weights of your model.

Somehow can, you know, could other people at the company, uh, exfiltrate that, that data that they shouldn't have access to? And how can you kind of mitigate or have ac have that control? And the honest answer is, I think beyond fine tuning separate models or not using instruction tuned models, I don't think there is a very.

Clear cut solution to this yet. I think this is like something that people are still working on. So what we've seen, I haven't actually seen a situation where people have shared a fine-tuned model across people who wouldn't all have had access to the fine, to the, to the training data. Um, so it's either been in cases where the training data was like publicly available, like find where they were fine tuning on.

You know, search results or it's been within a company department where everyone has data access. If you were trying to control that, my suspicion is you might want to have something like adapters. Um, so these are like lightweight things, like, uh, low rank basically changes to the model that you can fine tune and swap out so that you could have multiple different versions of a fine tuned model, uh, of which, you know, you could control access in certain subsets of your company, but, but I think it's an unsolved problem.

Dude, awesome. Uh, there are a few more incredible questions coming through in the chat and so I am going to direct you there. If anyone wants to talk with Raza, keep the questions coming and I'll talk to chat and I'll, I'll answer people there if people wanna, uh, wanna, here we go. There were some awesome ones.

So, Raza, my man. Thank you so much. And if anybody wants to know more about Humal Loop, you just gotta click on the solutions tab on your left hand sidebar and you can go down and uh, explore the booth, talk with some people and also, If you're in London and you're at the watch party that's happening, what's happening?

And thank you Raza for making that happen. Thank you for sponsoring this event. It is so cool to have you here. And of course, thank you for the wisdom and thank you. Thanks so much for having me, and thanks for organizing an awesome conference. There we go. All right, man. I'll see you later.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 2.3K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Fine-Tuning LLMs: Best Practices and When to Go Small
Posted Jun 01, 2023 | Views 2.2K
# Large Language Models
# LLM
# AI-powered Product
# Preemo
# Gradient.ai
Finetuning Open-Source LLMs // LLMs in Production Conference 3 Keynote 1
Posted Oct 09, 2023 | Views 7.5K
# Finetuning
# Open-Source
# LLMs in Production
# Lightning AI