Building a Product Optimization Loop for Your Llm Features
Jeremy is an engineer with expertise building production ML/AI solutions. He's currently focused on building tooling to help product and dev teams ship reliable, value generating LLM based products
A collaborative, self-reinforcing product optimization loop for your LLM features that empowers you to ship with confidence. If that sounds less believable than full AGI at this point, we hear your. Freeplay aims to change that by building a platform to enable this exact kind of optimization cycle. A platform that facilitates collaboration between Product, Eng and Ops so a whole team can monitor, experiment, test, and deploy their LLM products as a team in a repeatable and reliable way. All while equipping the team with what they need to build expertise around LLM development and learn how to move your metrics in the right direction.
Jeremy Silva [00:00:10]: Hello, everyone. My name is Jeremy. I'm an AI engineer at Freeplay. Freeplay is a product that helps product teams ship better LLM features. We give them the power to experiment, test, monitor, and deploy their LLM features. Hence the name of this talk, building a product optimization loop for your LLM features. So I want to start with a little just background on myself and how that kind of informs my view of this space. I think we're all collectively trying to figure out what AI engineering really means, and my career path is kind of a manifestation of that.
Jeremy Silva [00:00:42]: I started out as a data scientist, building custom NLP models in the medical field, then moved into machine learning engineering, building out model scaling pipelines, and now I'm an AI engineer helping kind of productize AI features. And so I've seen, like, the LLM, the development lifecycle in all those different areas. And so I've kind of felt how the LLM development lifecycle is, like, on one hand, kind of an amalgamation of all those things, and on another, you know, kind of an entirely new development cycle that product teams are still trying to, like, find their footing in. And so that's what I want to talk about today and kind of help people find that their footing. So let's start out with why a product optimization loop? Data science software engineering like these have always been iterative endeavors, but LLM development, like, really cranks that up a few notches. Due to stochasticity of the underlying models and the breadth and dynamic nature of the resulting customer facing uxs, your product quality becomes purely a function of your ability to iterate and experiment quickly and having a platform that supports that. And so this is what we call product optimization loop. And so there's three broad stages, monitoring, experimentation, and testing and deployment.
Jeremy Silva [00:01:52]: So let's look at what that looks like in practice. So, everything starts with kind of phase one, which is capturing logs, and this is just table stakes for understanding how your system is actually performing. And then that feeds into phase two, which is human review and labeling. And this is critical. Some product teams think they're beyond this, but this is, like, one of the most important parts of this process is actually getting eyes on your data and understanding the nuances of it all. The best product teams we know spend hours and hours every week actually reviewing data. The two artifacts that we want to come out of here are eval optimization, building your eval suite, and dataset curation. Those two things will ultimately feed your ability to actually launch experiments that you can truly validate.
Jeremy Silva [00:02:37]: From there, we're going to get our experiment results and ultimately move into the testing and deployment phase. And so let's jump into what this looks like in each of these phases. So, like I said, capturing logs kind of table stakes. I think when people think of logs like, they think of, you know, stack traces, encrypted error messages, like, these things are the domain of engineering, but a lot of that changes with LLMs. Like the I O of your system is actually plain text. And so it becomes, like, really important to have logging that is accessible to people that don't actually understand or need to understand the true, like, full internals of the system. So to have a product optimization loop, there's three things that are like, critical that are true of your logs. One, they should be clear and informative.
Jeremy Silva [00:03:20]: So, like I said, they should be understandable by people who don't understand the full inner workings of the system. They should be immediately actionable in some sort of integrated platform. So, like, you can see how we kind of realize this in free play. Like the I O is very clear, and then it's also actionable. You can do your curation, your review right there, you can pull it into data sets, or you can even pull it into like an interactive prompt editor, and then they should be informative. So there's more than just the input and output of the system, right? There's things like cost latency, and then tracking things all the way through model version and all these kinds of things. And so once we have our logs that can kind of feed this human review process, and we're actually seeing what's happening with our system in production. And now this is where I, we get our ops people, our product people, all on board, and the collaboration really kicks in.
Jeremy Silva [00:04:09]: Some teams think of this as like, oh, the engineers are the ones who start out doing this. And it's really important for engineers to be doing human review and labeling. Like, they need to have field and nuances of the data, but often, like, they're not actually the SME's that have the full domain expertise to evaluate quality. And you need to pull in product ops, QA, all these different folks, and get them on a single platform collaborating on this to really assess, like, the quality of the system. And once you get that cross functional collaboration, that's when this optimization loop can really kick in. And so, like I said, part of this is also the process of eval optimization. It is important at the outset to set these broad metrics about how you're going to evaluate your system, but ultimately, a lot of that evaluation suite is going to be created during that review process, because you're going to see edge cases, you're going to see points of failure, and that's going to determine what evals you need to create. And then also, like, how you need to update certain evals.
Jeremy Silva [00:05:08]: And so, like, one of the things we've one of the. And so that your ability to iterate becomes really important. Like, so one of the ways we approach that in free play is we give this basically this, like, eval playground for you to align your SME's reviews with your auto evaluators, the things that you can actually run at scale. And so that once you get those eval suite that you can trust, that can kind of feed some of the later stages. And then also during this human review process is when data curation comes into play. As you're reviewing data, you don't want it to just be one off. You want to be building datasets that you can use in your experimentation and your testing and validation phases later. And so, again, having a platform that enables you to do that kind of review and building these data sets is important because it's not just about having a single data set.
Jeremy Silva [00:05:52]: What we find a lot of the best teams do is they might have one kind of what we call like a golden set, which is thousands of examples that they have labeled their testing against, but that becomes impractical for testing in quick experiments. So they're also curating smaller ones to target specific key risk areas and specific failure modes that when you're addressing, let's say, a hallucination, you can test that against just your hallucination set. Understand that's how that's performing before moving into, like, broader validation of that experiment. Now that we have our eval suite in place and we have our datasets, now we can really experiment with ease. There's these two levels of experimentation, and we manifest that in the product a couple different ways. What you're seeing in the background there is this concept of a prompt playground. What we allow you to do is take your prompt and model config and pull that into a playground and actually run it against real production data that are coming from your data sets. And this is where that quick iteration happens.
Jeremy Silva [00:06:51]: Just like workout kinks, do the vibe, check, all that good stuff. But then ultimately, you want to actually validate that experiment about some broader swaths. And so once you have a change that you think might be good, you might run it against, like I said, maybe one of your smaller data sets to target the specific thing you're trying to address. We happen to manifest that in these things called test runs, which are basically these batch tests that you can kick off either via the app if they're model and prompt changes, or via an SDK if you're testing and exercising your whole rag pipeline. And so we see what that looks like there, where it looks like we're comparing sonnet against GPT four, and we see that just, this is our eval suite here, and we're using our eval suite against one of our data sets to quantify that experiment. Now, once we have that experiment in hand, we want to move to that test and deploy phase. So this is when we're going to take our golden set or whatever this broader set of data is. We've probably been addressing some key risk area in the experimentation phase.
Jeremy Silva [00:07:48]: But like we all know with LLMs, everything has unintended consequences. So we want to now validate that and do kind of like regression testing and understand. Okay, so hopefully I, you know, I improved on this key risk area, but did I actually improve on my system broadly? Are there any regressions, and this is where you want to do this kind of final testing to actually say, like, okay, did I truly improve my product? Are there any regressions, anything that I'm kind of missing by making this change? And then ultimately a system to be able to deploy those changes very easily and hopefully be able to deploy them in a way that empowers both product ops and these other folks and is not always entirely locked into engineering. And so the key thing here is the cycle, right? This cycle never ends. Now that we've tested and deployed, it's right back to the top, constantly looking for new areas of improvement. The LLM development cycle really never ends, right? You're just always looking for new ways to make the product better. And so having a repeatable platform and a repeatable framework for you to iterate quickly and run these kinds of tests and run through your cycle becomes really critical for ensuring product quality. So, to review the product optimization loop, what that looks like for LLMs, but really, like, the key thing I want to leave you all with is that building great AI products is a lot more than just capturing logs and creating some evals.
Jeremy Silva [00:09:13]: It's about creating a repeatable framework to test and validate your LLM development cycle. So thanks for tuning in. I'm Jeremy. Jeremy AI. And check out freeplay AI for more info.