MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Building Defensible Products with LLMs

Posted Apr 27, 2023 | Views 1.3K
# LLM
# LLM in Production
# Humanloop
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Share
speaker
avatar
Raza Habib
CEO and Co-founder @ Humanloop

Raza is the CEO and Co-founder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Before Humanloop, Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a Ph.D. in Machine Learning from UCL.

+ Read More
SUMMARY

LLMs unlock a huge range of new product possibilities but with everyone using the same base models, how can you build something differentiated? In this talk, we'll look at case studies of companies that have and haven't got it right and draw lessons for what you can do.

+ Read More
TRANSCRIPT

Link to slides

All right. Well, thanks everyone for having me. Um, and we talking a little bit today about what we've been seeing at Humanloop uh with people building uh LLM applications. I'll have a little bit of introduction about um you know, who, what human loop is and, and sort of why we, why we have an insight into this question. And then I'll talk a little bit about some of the more successful LLM applications that have been built and how they've done that. What makes them sort of particularly good and what makes them defensible as well?

Chat through some of the challenges to building LLMS that we've seen and then talk a little bit at the end about defensible and how you can think about making your LLM applications, uh something that you can defend over time. So that's the that's the plan. I mean, if I could have the next slide, please? Thank you. Um So yeah, I'm sort of at a high level.

What is Humanloop? We build developer tools to help you build useful applications with LLMs focused around prototyping and then understanding how well your apps are working in production and being able to use that uh evaluation data to improve them over time.

And so because of that, we've seen a lot of people go on this journey from idea to deployed application and we started to see emerging best practices and what is and isn't working for people. And so that's what I'm going to try and use to inform the talk today. And I'm gonna try and pick up on a couple of specific instances of applications that have been very successful and kind of talk through how they've managed to, managed to achieve that success and the components of an LLM application.

Um and then kind of build into that sort of best practices that we're seeing and a little bit of a discussion about how you can improve your own applications. And so if I may, may have the next slide, please?

The first kind of app I want to talk about today is GitHub Co-pilot because I think this is by far and away, the most successful and most used maybe other than chat GP T um sort of large language model application that's got mass adoption, right? It's uh it's got over a million users, I think now and it's growing pretty fast and it's been able to win over a historically, quite challenging audience, right? I think developers are particularly picky about the tools they use, especially to have an AI PRO programmer.

So what I want to dig into is the anatomy of GitHub Co-pilot as an application. How is it built and then chat a little bit about what makes it so good because, in some senses, it's built on the same base models that we all have access to. Right. Everyone has access now to the GPT3 model suite, which is what it's kind of based on. Um, and yet somehow they've been able to make an application that, you know, I think it's significantly better than a lot of the competitors out there and it's not just distribution that's allowed them to do this. So I wanna, I wanna chat through that. So that's the, that's the first thing I want to talk about if I may have the next slide. Um Sorry, back one. Uh OK. Can you go forward to? I think we're missing a slide. Here we go. Um And so when I actually, yeah, sorry, it was the previous slide. Thank you. Sorry, for not having control of us. Yeah.

So the first thing I want to talk about before I dive into the specific instance of GitHub Co-pilot as an app is the kind of components of a large language model application. And I think of LLM apps as being composed of what I think it was LLM blocks.

So you might have many of these repeated in sequence or put together by an agent. But at its core, each piece has the same three components, which is some kind of base model. Um So that could be GPT3. It could be an open-source model like Llama, but this is a pre-trained language model that is generic and can be used for many different tasks. There's a prompt template which is the structure um of the sort of message. So this is an instruction to the model with maybe some extra gaps where you're gonna have either user-defined input or data that's gonna be fed in. And then some data selection strategy for how you're going to populate those at test time. And so if we go, if we look at a specific instance of this, in the case of GitHub Co-pilot, if I could have the next slide, um then GitHub Co-pilot has as its base model, right? That first component is a 12-billion parameter GPT model. Um So it's a code-trained pre-trained model, but it's significantly smaller than say GPT3, which was maybe 10X in size.

And the reason for that is in this instance, they wanted to have something custom fine-tuned for code, but also that was small enough to have low latency. So that when you're trying to get suggestions from it, you're not having to wait for a long time. And so that's a critical component, right? They've chosen carefully an appropriate base model for this use case where latency concerns guide what's possible. You would probably get better codes generated from a larger model. But at the expense of having um you know, much higher latency, and then they have an interesting and quite complicated strategy for figuring out what to feed into a prompt template. So what they're doing is they're looking at where your cursor is looking at neighboring files um that you've recently touched and trying to find based on the code that's just behind your cursor. They compare the like edit distance or jack similarity of different parts of code and other files and then use that to extract those sections and put them into a prompt that they then feed to this 12 billion parameter model.

And after that, they then also have a system set up for very careful evaluation capture. um that you've recently touched and trying to find based on the code that's just behind your cursor. They compare the like edit distance or jack similarity of different parts of code and other files and then use that to extract those sections and put them into a prompt that they then feed to this 12 billion parameter model. And after that, they then also have a system set up for very careful evaluation capture. So the way they do this is they look at the acceptance rate of your suggested code, but they don't look at just, did you accept a suggestion from GitHub Co-pilot? They also look at um whether that code stays in the code base, whether it stays there after a few minutes, whether it stays there after a few seconds. And so they're able to understand whether the suggestion was good in production across millions of users.

And so they're able to understand whether the suggestion was good in production across millions of users. And these components taken together start to give them both an excellent application and also something that becomes defensible over time because they can run this regular loop of basically putting something into the hands of millions of developers watching them use it and because they have excellent evaluation data, they're able to then work out what worked well and retrain and fine-tune on a regular basis. And so the model can get better and better at this task over time. And so I think Co-pilot has the anatomy of what I think of as like a pro, you know, a really excellent example of what you can do here where they've made a very careful decision about an appropriate based model that makes sense for their use case, they've iterated on and done a lot of experimentation on what is the right strategy for what I should include in my prompt. Um What data should I be pulling? Where should it be coming from? They're deeply integrating with a particular user's code base. So it's not just generic but actually sort of is able to be adapted to that user and then they're, they're using um evaluation to improve models over time.

And so if we kind of step a slide forwards, uh if we could step up. Thank you. I think that this sort of like highlights the, you know, if you think about those three pieces that make up an LLM app block or makeup GitHub Co-pilot, we need to find a way to make each of these excellent. So we have to have like an appropriate base model. That's the right size that's been fine tuned for the task or that has good sort of performance. We need a way to get the prompt engineering to work well. And I think sort of on this side, I try to put together some of the challenges that you face when you're doing that.

And then finally, you need a way to measure performance that you can improve things over time. And so what we've seen when people come to do this in production um is that there's kind of a few challenges that come up again and again. And the first is that prompt engineering is still a bit of an art. So small changes in prompt templates can have surprisingly big outcomes on performance. Um And that means that it has to be very, very iterative. You have to have a fast feedback loop and you need to be able to experiment a lot, you sort of can get a first version very, very quickly, but then getting towards something that's truly excellent takes time. And so what we've seen when people come to do this in production um is that there's kind of a few challenges that come up again and again. And the first is that prompt engineering is still a bit of an art. So small changes in prompt templates can have surprisingly big outcomes on performance. Um And that means that it has to be very, very iterative. You have to have a fast feedback loop and you need to be able to experiment a lot, you sort of can get a first version very, very quickly, but then getting towards something that's truly excellent takes time. Another problem that we see pretty consistently. And I think like others will, you know, have spoken about this as well is the need to try and make L L MS more factual and finding ways to overcome their, you know, the fact that they hallucinate and make things up. Um evaluation is another one that's particularly challenging for L L M apps. And I think this is different for L L MS than most traditional software because we're beginning to use these things for applications that are much more subjective. Um Another problem that we see pretty consistently. And I think like others will, you know, have spoken about this as well is the need to try and make L L MS more factual and finding ways to overcome their, you know, the fact that they hallucinate and make things up. Um evaluation is another one that's particularly challenging for L L M apps. And I think this is different for L L MS than most traditional software because we're beginning to use these things for applications that are much more subjective. Um than we might have in the past. So if you're generating marketing copy or you're sending an email, then there isn't a ground truth answer that you can just look at and say, OK, that's the correct thing. You need some way to measure performance based off what your users think is the right answer. Um, than we might have in the past. So if you're generating marketing copy or you're sending an email, then there isn't a ground truth answer that you can just look at and say, OK, that's the correct thing. You need some way to measure performance based off what your users think is the right answer. Um, latency and cost, you know, is something that we have to figure out ways to choose on the appropriate app. And then I think the one that we'll talk about a little bit when we come to defensible is if you've just got GP T plus a simple prompt, then it's a very thin mode and you need some way to overcome that. Um And so I just stepped through the center of one by one. Uh So we go to the next slide, please just wanted to explain a little bit more about like why prompt engineering is so important and the kinds of small changes that you know, make a big difference. So, latency and cost, you know, is something that we have to figure out ways to choose on the appropriate app. And then I think the one that we'll talk about a little bit when we come to defensible is if you've just got GP T plus a simple prompt, then it's a very thin mode and you need some way to overcome that. Um And so I just stepped through the center of one by one. Uh So we go to the next slide, please just wanted to explain a little bit more about like why prompt engineering is so important and the kinds of small changes that you know, make a big difference. So, no, I think last year, one of the, there was a paper that captured a lot of attention and that has become kind of very commonly known amongst the community. Now showing that chain of thought prompting had a huge impact on the performance of question answering models and other reasoning models. So if you simply ask a model, not just to answer a question, but to provide a reasoning trace, suddenly you get not just a little bit better performance, but significantly better, like many many accuracy points uh better than you would get just from a base question prompt. And no, I think last year, one of the, there was a paper that captured a lot of attention and that has become kind of very commonly known amongst the community. Now showing that chain of thought prompting had a huge impact on the performance of question answering models and other reasoning models. So if you simply ask a model, not just to answer a question, but to provide a reasoning trace, suddenly you get not just a little bit better performance, but significantly better, like many many accuracy points uh better than you would get just from a base question prompt. And you know, this is now well known amongst the community. But the surprising thing is that there are still many more tricks like this out there to be discovered. People are constantly finding new ways, whether it's the format, you know, asking model to be format in certain ways to click on a particular role, you know, this is now well known amongst the community. But the surprising thing is that there are still many more tricks like this out there to be discovered. People are constantly finding new ways, whether it's the format, you know, asking model to be format in certain ways to click on a particular role, there's a lot of changes or tweaks that you can make that have a surprisingly large impact on performance. And so one of the things that we've seen when it comes to trying to find ways to build defensible apps is having a very fast way to iterate on your prompt templates, get feedback from that and tweak them and change them is actually critical to getting good performance. Um The next thing I wanted to chat about when it comes to prompt engineering as well is uh if I could have the next slide, there's a lot of changes or tweaks that you can make that have a surprisingly large impact on performance. And so one of the things that we've seen when it comes to trying to find ways to build defensible apps is having a very fast way to iterate on your prompt templates, get feedback from that and tweak them and change them is actually critical to getting good performance. Um The next thing I wanted to chat about when it comes to prompt engineering as well is uh if I could have the next slide, is that something that we've seen people sort of start to have patterns around but has historically been challenging, is finding ways to get factual information into the a into large language models. And so one thing that we think is gonna be super critical here and again, we're seeing examples of this in practice is giving L L MS access to tools and there's been a few other talks touching on this today, is that something that we've seen people sort of start to have patterns around but has historically been challenging, is finding ways to get factual information into the a into large language models. And so one thing that we think is gonna be super critical here and again, we're seeing examples of this in practice is giving L L MS access to tools and there's been a few other talks touching on this today, but uh a common emerging pattern for this is to sort of take the documents that you want to give the model access to split them into pieces, embed them with a large language model and then make those embeddings available uh to your model when it's doing generations. And what we've been looking at is sort of finding ways to make this accessible in an interactive environment so that you can experiment with that much the same way that you would with your prompt templates themselves. but uh a common emerging pattern for this is to sort of take the documents that you want to give the model access to split them into pieces, embed them with a large language model and then make those embeddings available uh to your model when it's doing generations. And what we've been looking at is sort of finding ways to make this accessible in an interactive environment so that you can experiment with that much the same way that you would with your prompt templates themselves. And uh if I could have the next slide, oh yeah, actually, sorry. Can you go back one second for me to the the previous slide? And uh if I could have the next slide, oh yeah, actually, sorry. Can you go back one second for me to the the previous slide? Um And yeah, so this is this is a common pattern. And it sort of again if we think about the pieces of say the github Copilot App, which is one of the more successful ones. This is another area where they've clearly spent a lot of time thinking about the right strategy for doing retrieval. So there's different methods for doing retrieval into your prompt template to make it factual. Um And they have a big impact on performance Um And yeah, so this is this is a common pattern. And it sort of again if we think about the pieces of say the github Copilot App, which is one of the more successful ones. This is another area where they've clearly spent a lot of time thinking about the right strategy for doing retrieval. So there's different methods for doing retrieval into your prompt template to make it factual. Um And they have a big impact on performance question from the chat. What technique can be used to feed back the evaluation data into the model as with copilot, potentially a form of R H F or something else? OK. Great. So that takes us, that's, that's a great question. And if you can take me to the next slide, I'll uh I'll expand on this a little bit. So the third component of what I think makes a really good L M app. Once you've figured out good prompt engineering, you've maybe found an appropriate base model is having a way both to measure feedback and understand how well it's doing and then use that feedback for continuous improvement. question from the chat. What technique can be used to feed back the evaluation data into the model as with copilot, potentially a form of R H F or something else? OK. Great. So that takes us, that's, that's a great question. And if you can take me to the next slide, I'll uh I'll expand on this a little bit. So the third component of what I think makes a really good L M app. Once you've figured out good prompt engineering, you've maybe found an appropriate base model is having a way both to measure feedback and understand how well it's doing and then use that feedback for continuous improvement. And there's basically three things that we've seen people use successfully to do this. So the first thing that we see people do as a very common workflow is they will use the feedback data they're collecting and you know, all of the best in class apps now capture end user feedback in some way. So I mentioned how github copilots looking at suggested code being accepted at regular intervals, 15 seconds, 30 seconds, two minutes, 10 minutes. Um Chat GP T has this thumbs up, thumbs down, followed by various forms of natural language feedback. And there's basically three things that we've seen people use successfully to do this. So the first thing that we see people do as a very common workflow is they will use the feedback data they're collecting and you know, all of the best in class apps now capture end user feedback in some way. So I mentioned how github copilots looking at suggested code being accepted at regular intervals, 15 seconds, 30 seconds, two minutes, 10 minutes. Um Chat GP T has this thumbs up, thumbs down, followed by various forms of natural language feedback. And in general is the best practice that we're seeing is people capture three different types of feedback. And you know, so actions, what does the user do after they see my generation in the application issues uh and votes. And those are sort of very common that's become a common framework for types of feedback. We see people collecting and then once you've collected this feedback, the things that we see people do to improve their applications. One is look at the cases that are failing form a hypothesis about why And in general is the best practice that we're seeing is people capture three different types of feedback. And you know, so actions, what does the user do after they see my generation in the application issues uh and votes. And those are sort of very common that's become a common framework for types of feedback. We see people collecting and then once you've collected this feedback, the things that we see people do to improve their applications. One is look at the cases that are failing form a hypothesis about why and then try to edit and try to prompt engineering to improve that. And so that might be realizing that actually your retrieval step is failing. It's not giving the model the right um the right section of the code, it might be realizing that actually you need to encourage the model to be less repetitive or you need to tweak it a little bit in some way. So we see a very common loop of people putting something in production, filtering the data to see the failure cases, inspecting them manually and then trying to do some round of prompt engineering to improve that. and then try to edit and try to prompt engineering to improve that. And so that might be realizing that actually your retrieval step is failing. It's not giving the model the right um the right section of the code, it might be realizing that actually you need to encourage the model to be less repetitive or you need to tweak it a little bit in some way. So we see a very common loop of people putting something in production, filtering the data to see the failure cases, inspecting them manually and then trying to do some round of prompt engineering to improve that. The second kind of step beyond that is when people actually come to fine tune their models. Um And we see two forms of fine tuning being used and if I could have the next slide, The second kind of step beyond that is when people actually come to fine tune their models. Um And we see two forms of fine tuning being used and if I could have the next slide, um so the first is actually pretty straightforward supervised fine tuning. um so the first is actually pretty straightforward supervised fine tuning. So the idea here is that you're just filtering the data set by things that have worked well for other customers or that you have some reason to believe have worked well, fine tuning and then repeating that process. And so there's this cycle that we see really commonly, which is to generate data from a model, whether that's in production. So you're running GP T three or in the case of copilot, you're running this 12 billion parameter model in production for some time capturing all this feedback data filtering down to some subset that's worked well. So the idea here is that you're just filtering the data set by things that have worked well for other customers or that you have some reason to believe have worked well, fine tuning and then repeating that process. And so there's this cycle that we see really commonly, which is to generate data from a model, whether that's in production. So you're running GP T three or in the case of copilot, you're running this 12 billion parameter model in production for some time capturing all this feedback data filtering down to some subset that's worked well. And then fine tuning the model on that and repeating this process and as you do that you can get better and better performance on your specific subset of tasks. Um And that's something that we're seeing in production, but it's also been demonstrated in the academic literature as well. So there was a paper called star that was looking at doing this for reasoning. So they took a um a set of reasoning tasks used a model to generate chain of thought prompts, filter it and retrain the model. And they're able to show that models get better at reasoning. And there's quite a few instances like that. And then fine tuning the model on that and repeating this process and as you do that you can get better and better performance on your specific subset of tasks. Um And that's something that we're seeing in production, but it's also been demonstrated in the academic literature as well. So there was a paper called star that was looking at doing this for reasoning. So they took a um a set of reasoning tasks used a model to generate chain of thought prompts, filter it and retrain the model. And they're able to show that models get better at reasoning. And there's quite a few instances like that. And then the third way of doing this as someone asked about is R L H F, we see fewer people doing R L H F in the wild because it's more complicated to do. I can think of a few startups that have done it. But actually, the gap between doing no fine tuning and even just supervised fine tuning is really large. And then the third way of doing this as someone asked about is R L H F, we see fewer people doing R L H F in the wild because it's more complicated to do. I can think of a few startups that have done it. But actually, the gap between doing no fine tuning and even just supervised fine tuning is really large. You can get significant reductions in cost and latency if you're able to fine tune a smaller model. And you also get an application that's more customized for your specific use case. And so when it comes to defensible, we've seen that fine tuning for performance can actually be a very, very significant advantage. And one thing I wanted to chat about if I could have the next slide is a common question we get, You can get significant reductions in cost and latency if you're able to fine tune a smaller model. And you also get an application that's more customized for your specific use case. And so when it comes to defensible, we've seen that fine tuning for performance can actually be a very, very significant advantage. And one thing I wanted to chat about if I could have the next slide is a common question we get, which is when to use prompt engineering versus fine tuning. Because I think there's some skepticism about the benefits of fine tuning or it requires a lot of extra work and you can get quite far just by adjusting prompts. And so when should you think about, should I prompt, uh should I sort of do prompt engineering or should I fine tune? which is when to use prompt engineering versus fine tuning. Because I think there's some skepticism about the benefits of fine tuning or it requires a lot of extra work and you can get quite far just by adjusting prompts. And so when should you think about, should I prompt, uh should I sort of do prompt engineering or should I fine tune? And so what I would say is the advantages of prompt engineering and that it's very fast to adjust your prompts, there's less work needed. Ultimately, you're not having to host models or figure out how to fine tune yourself or even munge data into the right format with the fine tuning API S. And so what I would say is the advantages of prompt engineering and that it's very fast to adjust your prompts, there's less work needed. Ultimately, you're not having to host models or figure out how to fine tune yourself or even munge data into the right format with the fine tuning API S. Um After experimentation, you can get good performance. And if what you're trying to do is get factual knowledge into the models that's changing fast, then prompt engineering and retrieval is the right way to go. So fine tuning, I don't think is something you should be doing to try and get factual information into your models. But what does, what fine tuning does allow you to do is to get smaller models to have similar performance on your specific use case and to get performance on your task that might be better than any other model out there. Um After experimentation, you can get good performance. And if what you're trying to do is get factual knowledge into the models that's changing fast, then prompt engineering and retrieval is the right way to go. So fine tuning, I don't think is something you should be doing to try and get factual information into your models. But what does, what fine tuning does allow you to do is to get smaller models to have similar performance on your specific use case and to get performance on your task that might be better than any other model out there. So if you have access to some kind of special data set, whether that's private data or you have access to feedback data from running a model in production for some time, then fine tuning is a way of really building something that's significantly more differentiated than what anyone else out there can have, it can allow you to bring down latency significantly. So, in the case of copilot, we saw that they're using a 12 billion I parameter model and that's primarily a latency concern. But also because you're training smaller models, you can get lower cost. So if you have access to some kind of special data set, whether that's private data or you have access to feedback data from running a model in production for some time, then fine tuning is a way of really building something that's significantly more differentiated than what anyone else out there can have, it can allow you to bring down latency significantly. So, in the case of copilot, we saw that they're using a 12 billion I parameter model and that's primarily a latency concern. But also because you're training smaller models, you can get lower cost. And then it also opens up the door to doing local deployments or private models in the tone of voice of a particular company. So there are significant advantages to fine tuning even though it might be harder. And the journey that we've seen most customers go on is almost everyone starts with prompt engineering. They get applications to a certain level of um performance and then they sort of start to fine tune later because prompt engineering is so much faster to get to a first, a first version of a model. And then it also opens up the door to doing local deployments or private models in the tone of voice of a particular company. So there are significant advantages to fine tuning even though it might be harder. And the journey that we've seen most customers go on is almost everyone starts with prompt engineering. They get applications to a certain level of um performance and then they sort of start to fine tune later because prompt engineering is so much faster to get to a first, a first version of a model. And so in terms of, you know, recommended best practice, I would say push prompt engineering as far as you possibly can and then think about how do I fine tune to optimize performance but don't optimize prematurely. Um May I have the next slide? And so in terms of, you know, recommended best practice, I would say push prompt engineering as far as you possibly can and then think about how do I fine tune to optimize performance but don't optimize prematurely. Um May I have the next slide? Cool. OK. So we've spoken a little bit about github Copilot. Um We've spoken about the anatomy of an L L M app, the three parts of the base model template for a prompt structure, like a strategy for getting it in and strategy for evaluation and the fact that those things get chained together. But the title of this talk was how do you build defensible apps with L L MS? Um And so I want to chat a little bit about how you can sort of actually get defensible and differentiation. And obviously fine tuning we have hinted at is, is part of that Cool. OK. So we've spoken a little bit about github Copilot. Um We've spoken about the anatomy of an L L M app, the three parts of the base model template for a prompt structure, like a strategy for getting it in and strategy for evaluation and the fact that those things get chained together. But the title of this talk was how do you build defensible apps with L L MS? Um And so I want to chat a little bit about how you can sort of actually get defensible and differentiation. And obviously fine tuning we have hinted at is, is part of that before I dive into this. I do want to say that I think this has become a hot button topic of conversation amongst people in the sort of builder community who are building with L MS thinking about and, and I guess in investors as well, like how do I build applications that are differentiated and they're gonna be defensible over time. before I dive into this. I do want to say that I think this has become a hot button topic of conversation amongst people in the sort of builder community who are building with L MS thinking about and, and I guess in investors as well, like how do I build applications that are differentiated and they're gonna be defensible over time. And I think that we should be careful not to over index on this. And so that's why I put this quote here before I chatted about it um from Y C who, you know, great investor into a lot of startups where they say you should ignore your competitors because you're more likely to die of suicide than homicide. And I think we shouldn't lose sight of the fact that if we're building L M applications, the number one thing we should be thinking about is how do we solve a real user need and do that quickly and successfully. But that said And I think that we should be careful not to over index on this. And so that's why I put this quote here before I chatted about it um from Y C who, you know, great investor into a lot of startups where they say you should ignore your competitors because you're more likely to die of suicide than homicide. And I think we shouldn't lose sight of the fact that if we're building L M applications, the number one thing we should be thinking about is how do we solve a real user need and do that quickly and successfully. But that said defensible is still something that people will worry about on a longer time horizon. And there are things we can do to make our apps more defensible. And so if I may have the next slide, defensible is still something that people will worry about on a longer time horizon. And there are things we can do to make our apps more defensible. And so if I may have the next slide, um um what I wanted to chat about sort of as an example here is so the, the first thing I'll say when it comes to defensible is I don't think that large language model applications or companies built on large language models are fundamentally different to other software businesses. Right? All the things that you would be thinking about if you were trying to make a company defensible as a software business still hold, you're still thinking about scale, you're still thinking about switching costs, network effects, brand, what I wanted to chat about sort of as an example here is so the, the first thing I'll say when it comes to defensible is I don't think that large language model applications or companies built on large language models are fundamentally different to other software businesses. Right? All the things that you would be thinking about if you were trying to make a company defensible as a software business still hold, you're still thinking about scale, you're still thinking about switching costs, network effects, brand, et cetera, the things that you would think about always, but maybe there are specific things that matter or that you can do with L L MS that, that you might not be able to do in other circumstances. And I think it's it's instructive to consider this example um of considering two different companies that have been quite successful as one of the, you know, amongst the first applications of L L MS in production, which are Jasper and writer. So these are both companies that have um be developed, writing assistance for marketers. et cetera, the things that you would think about always, but maybe there are specific things that matter or that you can do with L L MS that, that you might not be able to do in other circumstances. And I think it's it's instructive to consider this example um of considering two different companies that have been quite successful as one of the, you know, amongst the first applications of L L MS in production, which are Jasper and writer. So these are both companies that have um be developed, writing assistance for marketers. Um They both had significant success early amongst the first L M companies to really get to scale. But they took really different approaches and they've taken different approaches to becoming defensible companies. So Jasper focused on scaling really, really quickly, they had quite a high marketing spend, they captured a large fraction of the Um They both had significant success early amongst the first L M companies to really get to scale. But they took really different approaches and they've taken different approaches to becoming defensible companies. So Jasper focused on scaling really, really quickly, they had quite a high marketing spend, they captured a large fraction of the the market. Um and their main approach was to build on the open model so sorry, build on closed source models from open A I. So primarily initially building on GP T three but scale very, very fast. Um and try and get defensible that way. Whereas writer took a very different approach, the market. Um and their main approach was to build on the open model so sorry, build on closed source models from open A I. So primarily initially building on GP T three but scale very, very fast. Um and try and get defensible that way. Whereas writer took a very different approach, um and actually focused on building with custom fine tuned models and that allowed them to counter position against open A I and also their competitors like Jasper because they were able to promise their customers that they wouldn't store any of their data, that they would be able to give them customized tone of voice and that everything could run on their machines, which allowed them to access an audience that just wasn't accessible to Jasper um and actually focused on building with custom fine tuned models and that allowed them to counter position against open A I and also their competitors like Jasper because they were able to promise their customers that they wouldn't store any of their data, that they would be able to give them customized tone of voice and that everything could run on their machines, which allowed them to access an audience that just wasn't accessible to Jasper to some people who are building on open A I. And so I think there's an illustrative lesson to take from um all three of these applications, Copilot, Jasper and writer about things that we can be doing to make L M applications more defensible. So in the case of Copilot, beyond just building an excellent product, I think they have this incredible data flywheel built in where they're able to capture feedback, data, use that feedback data to improve a model and get it better at that specific task. So that over time, it becomes harder for others to catch up to them. to some people who are building on open A I. And so I think there's an illustrative lesson to take from um all three of these applications, Copilot, Jasper and writer about things that we can be doing to make L M applications more defensible. So in the case of Copilot, beyond just building an excellent product, I think they have this incredible data flywheel built in where they're able to capture feedback, data, use that feedback data to improve a model and get it better at that specific task. So that over time, it becomes harder for others to catch up to them. In the case of Jasper, they really went for some form of bit scaling by getting to AAA size and brand awareness very quickly over others. They were able to I think, establish themselves with a very, very high, like large share of the market versus writer who counter positioned against everybody else and was able to offer something to customers that others couldn't without changing their products significantly. Um In the case of Jasper, they really went for some form of bit scaling by getting to AAA size and brand awareness very quickly over others. They were able to I think, establish themselves with a very, very high, like large share of the market versus writer who counter positioned against everybody else and was able to offer something to customers that others couldn't without changing their products significantly. Um And if we, if I could have the next slide, please, And if we, if I could have the next slide, please, I sort of, you know, I try to jot down different things that we've seen in terms of strategies for making L M apps defensible and how they map onto maybe some of the more traditional views about what makes apps dement in general. And so one of these is, you know, I think a lot of software applications have this and I think this will be particularly true of large language model applications is having high switching costs from integrating deeply with private knowledge sources. I sort of, you know, I try to jot down different things that we've seen in terms of strategies for making L M apps defensible and how they map onto maybe some of the more traditional views about what makes apps dement in general. And so one of these is, you know, I think a lot of software applications have this and I think this will be particularly true of large language model applications is having high switching costs from integrating deeply with private knowledge sources. So if you are able to get, you know, private customer information, a particular company's knowledge base, if you're doing customer service or code, if you're indexing someone's code base or something like that, then that can be a big advantage because the base, large language models don't know anything that wasn't available on the public web. So if you're gonna be able to answer questions or deliver a service that requires private information, you will have to build a lot of integrations. And with those integrations come high switching costs and that's a form of defensible. So if you are able to get, you know, private customer information, a particular company's knowledge base, if you're doing customer service or code, if you're indexing someone's code base or something like that, then that can be a big advantage because the base, large language models don't know anything that wasn't available on the public web. So if you're gonna be able to answer questions or deliver a service that requires private information, you will have to build a lot of integrations. And with those integrations come high switching costs and that's a form of defensible. A second form of defen ability that we've discussed is the ability to build a, a flywheel through feedback capture. So this is the kind of github copilot example. But we see this a fair amount with people fine tuning through human loop where you gather a lot of data and production, you get feedback on how well it's working. You filter fine tune repeat and are able therefore to start getting an application that is our model, that is better than what others can train. Um And it, A second form of defen ability that we've discussed is the ability to build a, a flywheel through feedback capture. So this is the kind of github copilot example. But we see this a fair amount with people fine tuning through human loop where you gather a lot of data and production, you get feedback on how well it's working. You filter fine tune repeat and are able therefore to start getting an application that is our model, that is better than what others can train. Um And it, because it's getting better and better over time, you have to have this sort of data sensibility through this network effect. And the final one I've spoken about is uh because it's getting better and better over time, you have to have this sort of data sensibility through this network effect. And the final one I've spoken about is uh and the final one I've spoken about is um counter positioning. So can you find ways to do things that maybe people who are building on other models can't? And so writer was a good example, and the final one I've spoken about is um counter positioning. So can you find ways to do things that maybe people who are building on other models can't? And so writer was a good example, privacy and having their own models fine tuned on customer data, they were able to do things that weren't accessible to their competitors. And then the final one is obviously like in some sense, less theoretically sound, it doesn't map on to any of these traditional ones, but it uh is just focused on building a really great product that solves a real prob problem and has a distinct U X. I think Gib co-pilot is another example of this where they thought about building something that had fault tolerant U X privacy and having their own models fine tuned on customer data, they were able to do things that weren't accessible to their competitors. And then the final one is obviously like in some sense, less theoretically sound, it doesn't map on to any of these traditional ones, but it uh is just focused on building a really great product that solves a real prob problem and has a distinct U X. I think Gib co-pilot is another example of this where they thought about building something that had fault tolerant U X because um large language models we know are not com completely reliable. And so they thought about how can we provide something really useful to people whilst knowing that it can't get everything right all the time and completion and suggestion with sort of having your code and context works really, really well for that. Um So I think that's my last slide and I will end there. I don't know if there are any questions. Can you just hit the next slide just to make sure. because um large language models we know are not com completely reliable. And so they thought about how can we provide something really useful to people whilst knowing that it can't get everything right all the time and completion and suggestion with sort of having your code and context works really, really well for that. Um So I think that's my last slide and I will end there. I don't know if there are any questions. Can you just hit the next slide just to make sure. Yeah, that's, that's it for me. So, thank you very much, Yeah, that's, that's it for me. So, thank you very much, Lee. I don't know if I'm supposed to be able to hear you, but I, I can't hear you, Lee. I don't know if I'm supposed to be able to hear you, but I, I can't hear you, man. It is a lesson that I, I refuse to learn. man. It is a lesson that I, I refuse to learn. Um All right, let me look real quick if there are some questions. So I think that there are questions in the chat. Um but I do need to bring up our next speaker. So, Raza, if you wouldn't mind, sort of checking out the chat after this and maybe answering some of those questions. That would be Um All right, let me look real quick if there are some questions. So I think that there are questions in the chat. Um but I do need to bring up our next speaker. So, Raza, if you wouldn't mind, sort of checking out the chat after this and maybe answering some of those questions. That would be wonderful. Yeah, I will jump into the chat and do that there. Thanks. wonderful. Yeah, I will jump into the chat and do that there. Thanks.

+ Read More

Watch More

Building Defensible AI Apps
Posted Nov 13, 2023 | Views 441
# Defensible AI Apps
# AI Gateway
# Data Independent