Integrating LLMs Into Products
Emmanuel Ameisen is a Research Engineer at Anthropic. Previously, he worked as an ML Engineer at Stripe, where he built new models, improved existing ones, and led ML Ops efforts on the Radar team. Prior to that, he led an AI education program where he oversaw over a hundred ML projects. He's also the author of Building ML-Powered Applications.
Learn about best practices when integrating Large Language Models (LLMs) into product development. We will discuss the strengths of modern LLMs like Claude and how they can be leveraged to enable and enhance various applications. The presentation will cover simple prompting strategies and design patterns that facilitate the effective incorporation of LLMs into products.
Emmanuel Ameisen [00:00:10]: Hi, everyone. Yeah, thanks for coming here. If you're wondering why I'm special and I have this little laptop, it's because I finished my slides at the last minute. So you get to see something real fresh. Yeah, I guess as a little check in before I get started, who would say that part of their job is to integrate LLMs into products like who's building? Okay, it's about half. That's pretty good. All right. So for that half, I am looking forward to your hot takes, and you can tell me whether you agree or not with me.
Emmanuel Ameisen [00:00:41]: And for the other half, hopefully this is informative and new. So who am I? Data scientist. I've been working in ML for a while. I was doing data science. I did ML education for a bit. I've been thinking for a while about how to build practical ML applications, applications that actually use ML, you know, in something that enhances the user experience as opposed to just as an add on. I was working at Stripe, where I built ML models, and right now I work at Anthropic, where I worked both on fine tuning models like our Claude three models. And currently I'm helping out with our interpretability efforts.
Emmanuel Ameisen [00:01:19]: So, one quick observation. Using Llmsdev means becoming an ML team. I think this is something that maybe is underappreciated by most, because LLMs are just an API you can call and you get text back. But they are fundamentally machine learning models. And so the challenges that come with using them are actually very similar to the challenges that come with building your own machine learning models and then shipping them. There's some good and some bad. Then the good part means that there's a lot of resources about how to build ML products. Well, this chart on the right is the chart I was going to use for this presentation.
Emmanuel Ameisen [00:02:00]: I was actually in the process of remaking it when I had a flashback and I was like, wait a minute, I made this chart in 2018, like before LLMs existed, because basically they serve the same iteration loop that I would recommend. Like, once you use LLMsdev, you are in an experimentation loop where you're trying a prompt, seeing how it works, trying some other retrieval method, seeing how it works, and you need to have sort of like ML chops. The other good thing is there are a lot of books about ML best practices. I've listed a few. Selfishly, I've added mine, but there's a lot of stuff online as well. There's a lot of blog posts. And so this talk is absolutely not about that. We're going to talk only about LLMs, specifically how to use LLMs that you don't train yourself.
Emmanuel Ameisen [00:02:47]: So no fine tuning and specifically focusing on how you integrate them into products. I think it's worth just taking a moment to appreciate that we live in unprecedented times. This is my rough graph based on vibes only of like out of all of the people that do machine learning, how many trained models from scratch, how many fine tune existing models, and how many prompt existing models over time? And you can see that it used to be the case that everybody trained models. There was no such thing as fine tuning a model. Then fine tuning became something you could do, but it still looked a lot, and looks a lot like training models. You're taking an existing model and you're training it from there. It's still training. And now prompting is something you can do.
Emmanuel Ameisen [00:03:34]: You can take some model that somebody else trained to not think about machine learning at all and just get outputs out of it and try to build something on top of it. And that's really never happened before. And so you have the power of machine learning without having to do a lot of the work of machine learning. And so that opens a lot of cool product opportunities. All right, so here's roughly what I'm going to talk about. This will be pretty quick. My hope is that we can finish pretty early, and then we can just have a discussion. We'll see famous last words, but basically we'll go through, I think, kind of six iterative, deeper and deeper integrations of LLMs.
Emmanuel Ameisen [00:04:11]: Let's start with chatbots, the primordial form of an LLM. They're conversational AI interfaces, usually stateless, meaning like states between chats don't persist, usually multi turn chat GPT.com, comma cloud Aihdem. They're really nice. Here is an example of me asking Claude to write a pitch for my new startup, an ostrich egg subscription service. It writes something completely reasonable, certainly better than what I would write. I'm not good at marketing, and I would say that the way I would describe them, these chat uis, is that they're super versatile, super easy to use. This is literally the same interface you used to text your friends or email them. Super customizable.
Emmanuel Ameisen [00:04:55]: You can do whatever you want with it. You can ask it for math questions, you know, therapy. You can ask it who the next, who's going to win the next NBA championship. It can be helpful in a range of domains. The drawback, though, is as a product, it's not very diversified. You already have online websites where you can do this sort of like building a new one is not really like a clear way to win some users. Why would they go to yours instead of the existing ones? I think another drawback of these products is that they lead to the perception that sometimes people have that LLMs give you bland output because they're so general that if you go to cloud AI and you ask it, oh, give me the five most important truths about life, it'll give you some cartoonish nonsense that's kind of like, okay, this is, I guess, a truism. This is fine because it'll give you what probably the average person wants, but you're not the average person, and certainly your users aren't going to be the average sample or like a random subsample of the population.
Emmanuel Ameisen [00:05:56]: And so you should think about like, how do I make this less bland? And so here comes the first thing, personalization. Enhancing chatbots with contextual information. In this example, I asked the exact same question, but I paste the Harvard library style guide, which has a bunch of guides about just general writing, but also apparently, entrepreneurship. And here's what it gives us. We have like a way more formatted output on the right where it's like, ah, our mission, how it works, a little like, ooh, this is unbeatable quality. And of course, a call to action at the bottom to get cracking. All right, great. So if you add context, you can get increased helpfulness for a specific task.
Emmanuel Ameisen [00:06:37]: You can get improved accuracy in a given domain, and you can tailor your user experience so far. That's like, hopefully something that people are familiar with. And I would say you can take this as far as you want. So here on the left, I ask it for the same pitch for the same startup, but I give it a comedy guide. And on the right, I tell it to give this pitch, but I give it information about my background and I say, ah, sell me. As the CEO of this startup, and it's like, ah. As a machine learning engineer, I've spent years optimizing complex systems to deliver maximum efficiency so I can sell you eggs, etcetera. So a lot of papers that you read about LLMs and just a lot of companies and a lot of blog posts basically like, help you do this.
Emmanuel Ameisen [00:07:19]: I would put most of the sort of like, valuable work that you can do in this category, which is improving your prompting, whether it's a system prompt or an actual prompt, adding examples, making it clear, you know, anything that goes with, like, making the LLM perform better on the task by just kind of making that task more crisp is a part of that category of improvements. But then there's things like rag like using proprietary data that has information about your product or whatever you're doing or information about your users. So if I can talk to a model, but then it knows that actually I like crisp answers. I have a background in Stem. I don't like the Lakers. I don't know, it can kind of like personalized itself to me. So I would say that's like v one of personalizing a model at all and of integrating it at all. Cool.
Emmanuel Ameisen [00:08:15]: So then you might say, okay, well that's great, but that's still just a chat bot. I'm still just talking to it. I mean like, again, how often do I just want to like open a chat window and chat for a while? You know, that's not really part of my job. I actually, I have stuff to do, help me do the stuff. And so I think this gets interesting when you think of like LLMsdev as application components. And so instead of a chat window, can you use an LLM to power some functionality of what you're building? I'm going to pause here, just make the point that fundamentally LLMs are text in text out black boxes basically. And so yeah, you can chat with them and get an answer, but you can also imagine that they can read any text that you can give them and produce any text that you can give them. And so the space of the element is much broader than just a chat application.
Emmanuel Ameisen [00:09:07]: As a small little toy example, I actually had our most recent model, Sonet 3.5, just build a little web app for me, a really crude one where instead of asking it, hey, can you give me a pitch for this startup? I pasted the pitch it gave me and then I made a little functionality where I can highlight parts of the output and ask it for specific edits. So it's like a very simple improvement, but all of a sudden you're beyond answer, response, answer response, and you're like, ah, now this is an AI powered edit feature. And so I can highlight a bunch of parts of the text, like the title or various qualifications, and ask it to reword them and to change them. And here I highlight an example where it does it well and an example where it does it poorly to show like, ah, you can do this n times. And I can keep using this model as sort of like an editor. I think the benefits of things like this is that you have much more ux versatility. So you can do way more things, but also you have way more quality control. In the first model, you just were supposed to have a model that can answer any question and give you an answer that's reasonable.
Emmanuel Ameisen [00:10:14]: Whereas here you can say, ah, my model is now an editor. It reads documents and it provides edits. I can make an evaluation for that. I can clearly define what success is like, and that means that you can take whatever your current model is and improve your prompt and see if your product is actually getting better, because you can now measure, calling back to the talk two talks ago, the impact of the model on the product experience you've integrated into your product. The drawback is that this costs more tokens, usually because you are repeating this prompt for each of the edits. For example, there's a lot of integrations in this category. I would say you can use an LLM to write comments, edit comments. You can have edit content.
Emmanuel Ameisen [00:10:56]: You can have it classify text, you can have classify emails, classify geotasks. You can have it do structured data extraction where it takes in some set of unstructured data and gives you a JSON object that represents what's in it. You can use it for autocompletes, copilot like applications and suggestions. This is where I think the space of products starts to open up quite a bit. So it does seem like a lot of work because this is great and all, but it still means that in a lot of these examples I have to ask the model to like, oh, here's the specific piece of text, can you edit it for me? Why do I have to do that? Ideally, the model could do that for me. And so you can build product experiences where the model indeed does that for you. And I think of this as pull versus push. So pull is like, I go and ask the model like, hey, can you specifically reword the title? I don't like the title.
Emmanuel Ameisen [00:11:51]: Whereas push is like, the LLM might proactively nudge me and like, hey, your title sucks. Like, let's rewrite it. And so there's sort of like an activation energy that I need to use the LM at all. And as much as we can, why don't we remove it? And so here I made like another little mock up where instead I ask the model to just continuously read the text as I write it and just simplify any part that's stereotypical or full of cliches. And so as I'm editing this document, the model will just like highlight some parts and be like, oh, like this part, that's a cliche. This part, you should rewrite it for this. This part you should rewrite it in that way. And now it takes no effort for me.
Emmanuel Ameisen [00:12:31]: I just write like I always do, and I get the benefits from these suggestions, which I can accept or refuse based on whether I actually agree with them or not. So the advantage is that these methods don't require a user to take the first step, and they allow for strict quality filtering because you can also only show the suggestion to the user. If your model is really confident, you can ask the LLM, make some suggestions, and then decide whether these suggestions are really valuable and only show the most valuable ones. So as a user, I get fewer suggestions and they're higher quality. It lowers the barrier to entry in that way. The drawback, of course, is that now you require some orchestration. It used to be that, like in the push model, I just have to handle like, oh, what happens if the user clicks on this button? In the pull model, I have to have something that runs in the background that decides when I'm going to make a suggestion of a given kind or when I'm going to perform an action. If I actually need to do it, perform the action and then potentially have a feedback loop for the model.
Emmanuel Ameisen [00:13:25]: One thing I'll call out here is that a lot of these complex prom chains and orchestration systems actually are also a pro because replicating them is complex, because building them is complex, and so integrating them deeply into your experience could be a real differentiator. And so I like to think of this as a double edged sword where it's like, yes, it's a huge investment to do this well, but then that also means it's much harder for others to do it as well as you do. So another downside of models, even though this is getting better, LLMs are faster, is that they can be slow. And especially if you have a big prompt with many examples, you can get time to first tokens that are in the range of half a second, a second for the bigger models. And so for a lot of UI UX interactions, that's a really high cost. I want to press a button and get an answer now. So this is the second to last thing I want to talk about here is synchronous versus asynchronous, and this is slightly related to push versus pull, but it's actually a little bit different, which is synchronous is like you process my request in real time and you give me a response immediately. Asynchronous is as much as you can in the background.
Emmanuel Ameisen [00:14:35]: You're going to think about what I'm doing, and you're going to give me more formatted results. So examples here for our editor would be you do asynchronous document review instead of actually like giving me edits as I write, maybe I write something, I go to bed the next day, you've looked up a bunch of sources, given me some things that I could change, and made a nice little concise suggestion. You can imagine generating a report on a topic that's something that if you wait for the LM to do it, it's probably going to be at least a dozen calls, and so it'll be like a full minute, which is not worth waiting on. Coding agents writing pull requests also are in that category where I, I've seen some example products that try to coding agents that write pull requests live. But that's actually currently a slow process with current models, and so you're much better served by saying, cool, here's a bug, come back in an hour, we'll run our agents in a loop, try a bunch of things, and then come back to you with a pull request. And finally, large batch jobs, if you're of course running many, many examples. I think the benefits for this are severely underestimated in that once you go asynchronous, you can use the most powerful LLM that you want. You're not bound by serve like latency anymore.
Emmanuel Ameisen [00:15:50]: And you can use complex orchestrations, such as there's a missing p here, prompt chains, meaning you can imagine a ten step process where some model generates suggestions, another model grades them, a third model investigates, each writes it up. You have a filtering step, it enables. Basically, some of you have called this the unhobbling of models, but the ability of models to basically just exceed their intelligence threshold and do much better than just a base model. One prompt approach. The drawback of this is that it requires more tokens, and yeah, we're doing pretty well. Last but not least, this is maybe my favorite extension of models that I wish we'd see much more. I worked on tool use for the cloud, three models, so I'm a little bit biased, but it's giving a model tools generally what we mean by tools here is letting the model, for example, call a Python function or call a calculator or call an API. You can imagine asking a model, hey, how long would it take me to get from here to the anthropic office? And I'll call it the Google Maps API, and actually get me an answer, rather than just trying to reason from first principles.
Emmanuel Ameisen [00:17:10]: This actually has huge benefits because you can basically just only use DLLM for what it's good at, which is kind of understanding the user intent, making a plan, and then leveraging existing code and existing tools to do everything else. It enables agency behavior, meaning longer horizon multi step behavior, where the model will have first let me look up this email in the database, and then based on this email like let me apply a coupon, that sort of stuff. The drawback is that with great power comes great responsibility. And once you start giving API access to your models, then there's a chance that they call these APIs and that they make mistakes. So you kind of want to be very careful as you integrate these tools and try to think, hmm, if my model made like the worst mistake, how bad could this get? For example, if you're going to give it database access, I would give a model read access. I think write access is a little scary. If you give it write access, you'd have to think about, ok, how do I mitigate the model deleting every single table? I think an example to go back to our ostrichegg subscription service of the impact of tools is you take the exact same model that gave us the ostricheg pitch and you're like cool. But now on top of the pitch also make me a website and you give it a tool to render websites and you get this, look how cute that is.
Emmanuel Ameisen [00:18:30]: It just made a little egg website. This is just the same model, but now it can render HTML and so it makes you a little website. I think that there's many, many examples where basically the response the model gives you in text is just three API calls away from being the full thing that you need. Like oftentimes I'll use cloud to draft my emails and it's like well, why not just give it an email tool? It's like okay, here's a draft of the email. Cool, I like it. Send the email, you're done. Write the website, write the code and run it. So I think tool use in general is sort of under invested in and I would recommend exploring it as well.
Emmanuel Ameisen [00:19:11]: One thing that I want to call out here is that, you know, tokens are sort of like the commodity of LLMs, right? You pay per output token for sort of like most use cases. You reason in like ah, like how much is this user request going to cost me in terms of tokens which you know, are like roughly correspond to like how many words your LLM reason and writes. And a lot of the improvements I've suggested tend to require more tokens, right? Like the chat window is sort of like the most minimalistic approach. You just like tokens in, tokens out, ask your question, get an answer. But if you want the model to read and edit a whole document, and as it reads every comment, it has to remember the guidelines you gave it and remember the example, then you need a lot more tokens. And so the reason why these approaches are sort of like worth pursuing now, is that wherever you think we are on this chart, like if you think right now you can spend 100 tokens per user for your product, token prices are dropping exponentially. This plot is the price of a model at fixed intelligence, basically a GPT-3 level model from 2021 to 2024. And for a million tokens of what I call blended tokens, which is, I think four tokens of input, one token of output.
Emmanuel Ameisen [00:20:33]: But you can pick whatever you want. The price has gone from $50 over $50 to under $0.50 with this is cloud haiku here. And so what you could afford to do, or product applications where you could only afford like 100 tokens in 2021, you can now afford 10,000. And so I think that like, that's sort of a meaningful thing to keep in mind as you build these applications, is that like the trend is towards tokens becoming more cost efficient, context size increasing and latency decreasing, and so you should just build for the future rather than build for the present situation. Cool. So in summary, as you figure out your use cases, specialization benefits from deeper integrations. And so as you integrate deeper and deeper, there's this trade off between the flexibility that you lose as you integrate more deeply and the ability to do quality control. And finally use more tokens.
Emmanuel Ameisen [00:21:30]: Models are getting cheaper, faster and better. I think that generally it's worth sort of like investing in a really good product experience at the cost of more tokens. And I'm happy to talk about more methods to do so. Before doing a lot of this, I was actually using alums heavily and spent a bunch of time prompting them. So happy to talk about prompting as well. Cool. Thank you very much, Emmanuel. And now we have time for a few questions.
Emmanuel Ameisen [00:21:59]: Let's raise your hand and we'll bring you the mic.
Q1 [00:22:02]: Thank you for the excellent talk. My question is about context windows because we're seeing models now with from 100k context window to a million context window. With the increase in context window, obviously you fit more information, but there is, you know, trade offs with that. Curious in regards to are the larger context windows seeing kind of a proportion trade off, like only 10% it's forgetting. You say you pack it full of information, or are you seeing other types of trade offs? Like it's starting to confuse or hallucinate even facts that are synthesized from that context.
Emmanuel Ameisen [00:22:40]: Yeah, so the question is about trade offs in context windows. And so I think actually more recent models, if you look at their performance over large context windows, actually are just extremely good. Not getting confused. It used to be the case that you kind of like, didn't want to use too much context because the model would like. There was this lost in the middle effect where the model would lose information in the middle. Now, that's much less true. But there's still downsides to stuffing the context window. I would say that there's a few that come to mind.
Emmanuel Ameisen [00:23:07]: One is oftentimes the approach of stuffing the context window comes with just, say, throw everything in the kitchen sink at it. And that just in its nature, you might just. If you retrieve every version of your documentation, including conflicting versions, then the model will get confused, because version three says this, but a version four says that it doesn't know how to answer your question. And the other thing is just latency and cost. I think cost, as I mentioned, is coming down pretty quickly and latency is coming down as well. But at any given point in time, you're paying a latency cost as you add things to the context window as much as you can, pare down what you put in the context window, usually through things like rag or just like better selection of what you ingest, the better it is.