Ensuring Accuracy and Quality in LLM-driven Products
Adam is the CTO and Co-founder at Autoblocks. Autoblocks provides you with Enterprise-grade features to build, deploy, and monitor LLMs at scale. Previously he was an engineering leader at Front and project44 where he designed and built large scale event driven systems.
Adam will highlight potential negative user outcomes that can arise when adding LLM-driven capabilities to an existing product. He will also discuss strategies and best practices that can be used to ensure a high-quality user experience for customers.
Thanks Lily. Um Yeah, everybody. I'm super excited to talk to you all today um about ensuring accuracy and quality and L L M uh driven products. I'm gonna walk through a couple of examples of how you might use um L L MS and your products a more simple example as well as a more advanced one. Um And hopefully leave you with some questions and thoughts on how you might be able uh that you can bring into your, your own products. Thanks Lily. Um Yeah, everybody. I'm super excited to talk to you all today um about ensuring accuracy and quality and L L M uh driven products. I'm gonna walk through a couple of examples of how you might use um L L MS and your products a more simple example as well as a more advanced one. Um And hopefully leave you with some questions and thoughts on how you might be able uh that you can bring into your, your own products. A quick introduction on myself. Um So I'm the CEO and co-founder of Auto Block. We're building debugging and monitoring tools for companies that are building M driven products. Um Before auto blocks. I was a software engineering consultant at West Monroe and also spent time on the engineering teams at project 44 in front. Um Both platforms dealing with large amounts of unstructured data. A quick introduction on myself. Um So I'm the CEO and co-founder of Auto Block. We're building debugging and monitoring tools for companies that are building M driven products. Um Before auto blocks. I was a software engineering consultant at West Monroe and also spent time on the engineering teams at project 44 in front. Um Both platforms dealing with large amounts of unstructured data. So we'll kick things off by just defining what is an L L M driven product. Um So of course, I went to chat GP T GP T four. Um And he said an L L M driven product refers to a product or service that is powered or significantly enhanced by a large language model like GP T four. Of course, from open A I large language models are artifi advanced artificial intelligence systems that excel at understanding and generating human like text based on the context and input provided, So we'll kick things off by just defining what is an L L M driven product. Um So of course, I went to chat GP T GP T four. Um And he said an L L M driven product refers to a product or service that is powered or significantly enhanced by a large language model like GP T four. Of course, from open A I large language models are artifi advanced artificial intelligence systems that excel at understanding and generating human like text based on the context and input provided, which makes sense. Um And I'm sure you're all really familiar with. Um And then it also gave a a handful of examples like virtual assistance, content generation, translation, sentiment analysis, text summarization, coding, customer support, educational tools that I'm sure you're all all familiar with a lot of a lot of tech space. I mean, there's more and more which makes sense. Um And I'm sure you're all really familiar with. Um And then it also gave a a handful of examples like virtual assistance, content generation, translation, sentiment analysis, text summarization, coding, customer support, educational tools that I'm sure you're all all familiar with a lot of a lot of tech space. I mean, there's more and more use cases emerging as well. And behind the scenes, we'll actually look at an example of of one of those and you've probably seen some across all of the talks today. A lot of folks are, are really thinking about where do we, where do we go from here? Past just the the regular text box. use cases emerging as well. And behind the scenes, we'll actually look at an example of of one of those and you've probably seen some across all of the talks today. A lot of folks are, are really thinking about where do we, where do we go from here? Past just the the regular text box. So first what we'll look at is a product support chat bot. So a simple question and answer you probably have this hosted. People can come in, ask questions, they'll look over your knowledge base um and help your users out and also save save time for your support folks. So first what we'll look at is a product support chat bot. So a simple question and answer you probably have this hosted. People can come in, ask questions, they'll look over your knowledge base um and help your users out and also save save time for your support folks. So someone comes in, they say, how do I set up a workflow says I'm sorry, workflow is, is not a feature that we support. And they ask, how did I do so? Very simple. You've probably seen this 1000 times thumbs up, thumbs down, you know, give me feedback and this is really the basic way of, of incorporating feedback. You know, whenever users do thumbs down, you can start to look at similarities on those inputs when they are having negative interactions and, you know, start to dig in and understand how you might be able to improve So someone comes in, they say, how do I set up a workflow says I'm sorry, workflow is, is not a feature that we support. And they ask, how did I do so? Very simple. You've probably seen this 1000 times thumbs up, thumbs down, you know, give me feedback and this is really the basic way of, of incorporating feedback. You know, whenever users do thumbs down, you can start to look at similarities on those inputs when they are having negative interactions and, you know, start to dig in and understand how you might be able to improve thumbs up. Um You know, you can start to use R L H F and reinforce um and go from there. But you know, a lot of times um especially if you're like myself, sometimes you may not even use a thumbs up, thumbs down. So it can be pretty basic and not always super reliable. So we'll come back to this example towards the end um and think a little bit about how we might be able to improve um the quality here and incorporating feedback. thumbs up. Um You know, you can start to use R L H F and reinforce um and go from there. But you know, a lot of times um especially if you're like myself, sometimes you may not even use a thumbs up, thumbs down. So it can be pretty basic and not always super reliable. So we'll come back to this example towards the end um and think a little bit about how we might be able to improve um the quality here and incorporating feedback. Next, we'll jump into an example where you might have a large language model working behind the scenes and it's not going to be so obvious to the end user. So in this example, we're ticketing software, think like a Jira or a linear that you're all familiar with. And in this example, we probably have and so we've added a bunch of A I capabilities. So our A I can help us autocomplete the title and description as we type it out. Next, we'll jump into an example where you might have a large language model working behind the scenes and it's not going to be so obvious to the end user. So in this example, we're ticketing software, think like a Jira or a linear that you're all familiar with. And in this example, we probably have and so we've added a bunch of A I capabilities. So our A I can help us autocomplete the title and description as we type it out. It can also auto assign. So here you can see if this is assigned to Wade. Um So it can use what you've typed in the description and title and then look at all the historical context and pick the right assigning for you. So you don't even have to think about it. It can also auto assign. So here you can see if this is assigned to Wade. Um So it can use what you've typed in the description and title and then look at all the historical context and pick the right assigning for you. So you don't even have to think about it. It could also add a label for you. In this example here, you can see that it's labeled as engineering. Um So it can take a look, use all this context, historical context, auto label for you, auto label for you. Um and really just make you more efficient. Um because it's always a pain trying to figure out what you need to label all your tickets as and keeping everything organized then last, which is really cool. It can also assign a due date for you. So it knows how long it usually takes way to work on these kind of features and it can suggest a due date um of next week for you. It could also add a label for you. In this example here, you can see that it's labeled as engineering. Um So it can take a look, use all this context, historical context, auto label for you, auto label for you. Um and really just make you more efficient. Um because it's always a pain trying to figure out what you need to label all your tickets as and keeping everything organized then last, which is really cool. It can also assign a due date for you. So it knows how long it usually takes way to work on these kind of features and it can suggest a due date um of next week for you. Um And so what we wanna do, you know, we've added all these features in and we want to really understand like, how well are they performing? How can we improve them? How can we get better? Well, if we take the chat example, we would just paste a bunch of thumbs up, thumbs down everywhere, which obviously is ridiculous and would lead to a pretty poor user experience. We wouldn't want to do this. And I mean, it wouldn't, you know, frankly, it would be really difficult to understand like, OK, the thumbs up, thumbs down different parts. Like how did that? Work. Um So obviously, we need to think a little bit Um And so what we wanna do, you know, we've added all these features in and we want to really understand like, how well are they performing? How can we improve them? How can we get better? Well, if we take the chat example, we would just paste a bunch of thumbs up, thumbs down everywhere, which obviously is ridiculous and would lead to a pretty poor user experience. We wouldn't want to do this. And I mean, it wouldn't, you know, frankly, it would be really difficult to understand like, OK, the thumbs up, thumbs down different parts. Like how did that? Work. Um So obviously, we need to think a little bit about how we might, how, how can we um understand and improve in these examples? about how we might, how, how can we um understand and improve in these examples? Another way would be maybe just one thumbs up, thumbs down with um a user can input feedback directly, but then you have to manually go through and read and look at all this feedback yourself and it's not gonna be super insightful. Another way would be maybe just one thumbs up, thumbs down with um a user can input feedback directly, but then you have to manually go through and read and look at all this feedback yourself and it's not gonna be super insightful. So what we really need to start thinking about is kind of back to more product analytics type questions. So we need to ask ourselves, you know, what, what questions? Um So what do we really wanna know? We wanna know? Did they update the title or description after we did it autocomplete? If so, what did they update it to? Is it close to what we provided? Is it very different? Um You know, those are all really important questions that we need to understand So what we really need to start thinking about is kind of back to more product analytics type questions. So we need to ask ourselves, you know, what, what questions? Um So what do we really wanna know? We wanna know? Did they update the title or description after we did it autocomplete? If so, what did they update it to? Is it close to what we provided? Is it very different? Um You know, those are all really important questions that we need to understand for the assignee? We need to understand. Did they update the assignee? Did they update it to? Who did the person they updated it to actually end up working on it or maybe a week later, it got switched back. You know, if we started um retraining our large language model and all the context based on who they updated in the moment, we may actually, you know, update it incorrectly if it's always going back to Wade maybe later on. And there's some other user training, we might need to do for the assignee? We need to understand. Did they update the assignee? Did they update it to? Who did the person they updated it to actually end up working on it or maybe a week later, it got switched back. You know, if we started um retraining our large language model and all the context based on who they updated in the moment, we may actually, you know, update it incorrectly if it's always going back to Wade maybe later on. And there's some other user training, we might need to do same thing with the engineering with the label. So do they update the label? If so is a new label similar, maybe there's too many similar labels and we can actually assist the user in using our product better and consolidating their labels. Um You know, if we start asking these kind of questions, you know, even if when they do update it, we can start to provide a really a better user experience and also um improve the quality of our models. same thing with the engineering with the label. So do they update the label? If so is a new label similar, maybe there's too many similar labels and we can actually assist the user in using our product better and consolidating their labels. Um You know, if we start asking these kind of questions, you know, even if when they do update it, we can start to provide a really a better user experience and also um improve the quality of our models. The same with the due date. You know, how much did they update it by how accurate was the due date? Uh Maybe they updated it and then our original due date actually was correct. Um And so we don't want to take into account that they changed it. Um We wanna actually just know when it moved to the state was our original. The same with the due date. You know, how much did they update it by how accurate was the due date? Uh Maybe they updated it and then our original due date actually was correct. Um And so we don't want to take into account that they changed it. Um We wanna actually just know when it moved to the state was our original. Um Do they correct? So these are all like questions we should be asking and start thinking about as we're building these O M driven products. If we circle back to that chat example that I showed early on what we actually really want to know is that they actually solved their issue. Um Do we actually not support workflows? Maybe that was a hallucination. What did the user do next? You know, think with me, for example, if I came in here, I said, how do I set up a workflow? Um Do they correct? So these are all like questions we should be asking and start thinking about as we're building these O M driven products. If we circle back to that chat example that I showed early on what we actually really want to know is that they actually solved their issue. Um Do we actually not support workflows? Maybe that was a hallucination. What did the user do next? You know, think with me, for example, if I came in here, I said, how do I set up a workflow? And the chatbot spit out a list of steps and if you're actually able to measure in our product. Did the user complete those steps like we intended? And was that a successful experience for the user that's gonna give us way more information than just the thumbs up, thumbs down? And the chatbot spit out a list of steps and if you're actually able to measure in our product. Did the user complete those steps like we intended? And was that a successful experience for the user that's gonna give us way more information than just the thumbs up, thumbs down? Which brings me to, to truly ensure accuracy and quality in your L and driven products. You need to understand human behavior beyond the thumbs up, thumbs down. Which brings me to, to truly ensure accuracy and quality in your L and driven products. You need to understand human behavior beyond the thumbs up, thumbs down. So how do you do this? Well, like I said before, you need to sit and think deeply about the failure modes that can occur in your product. What's most likely to cause a negative user experience? What are the positive outcomes? And once you have all this information, you can add tooling to like measure those human outcomes and how users are using it, then use that to optimize your A MA A I models. So you end up in this feedback loop of presenting the A I features to the user collecting analytics and continuing to optimize and you wanna make this feedback loop as tight as possible. So how do you do this? Well, like I said before, you need to sit and think deeply about the failure modes that can occur in your product. What's most likely to cause a negative user experience? What are the positive outcomes? And once you have all this information, you can add tooling to like measure those human outcomes and how users are using it, then use that to optimize your A MA A I models. So you end up in this feedback loop of presenting the A I features to the user collecting analytics and continuing to optimize and you wanna make this feedback loop as tight as possible. Um So auto we're here to help. Um So you can sign up for our private data at our website here. Um We're um thinking deeply about these problems and how we can help you understand the entire user journey through your app and then incorporate that back into your um A I features and optimization and things like that. So we'd love to chat with you if you're building um L L M drive products or even just thinking about it. Um Like I mentioned, feel free to sign up for our private beta at our website or um shoot me an email directly. Love to chat. Um So auto we're here to help. Um So you can sign up for our private data at our website here. Um We're um thinking deeply about these problems and how we can help you understand the entire user journey through your app and then incorporate that back into your um A I features and optimization and things like that. So we'd love to chat with you if you're building um L L M drive products or even just thinking about it. Um Like I mentioned, feel free to sign up for our private beta at our website or um shoot me an email directly. Love to chat. So I really appreciate um all your time today and uh look forward to chatting. So I really appreciate um all your time today and uh look forward to chatting. Awesome. Thank you so much, Adam, I think. Awesome. Thank you so much, Adam