Beyond Benchmarks: Measuring Success for Your AI Initiatives
Salma is a co-founder of Remyx AI, leading the development of agent-guided MLOps. Previously, she worked at Databricks where she helped customers architect their ML infrastructure. She also runs a research blog, smellslike.ml, where she shares and open sources experiments in applied ML.
Join us as we move beyond benchmarks and explore a more nuanced take on model evaluation and its role in the process of specializing models. We'll discuss how to ensure that your AI model development aligns with your business objectives and results, while also avoiding common pitfalls that arise when training and deploying. We'll share tips on how to design tests and define quality metrics, and provide insights into the various tools available for evaluating your model at different stages in the development process.
Slide deck: https://docs.google.com/presentation/d/1vcbGzbCP4Obr4X4W4Fxz9gvwfqklIFs8o88b973oMI0/edit?usp=drive_link
Salma Mayorquin [00:00:09]: All right, well, thank you everyone. It's a pleasure to be among such great company. I'm here presenting beyond the benchmarks. I represent Remix AI, I'm one of the co founders and we are building AI agent guided ML development platform. So if you want to, on your spare time, check this out. But I think the topic for today should be kind of a culmination of all the stuff that's been happening in this awesome conference. And hopefully this kind of helps wrap some of those ideas together. So if you pay no more attention to the rest of this talk, these are the key takeaways I'd love to leave with you.
Salma Mayorquin [00:00:46]: Some of those are quality and evaluation are essentially the same two sides of the same coin. They're very much related and you want to keep them in mind throughout the entire process of the ML development process. You also want tailored evaluations at all stages of the development process. So not just post training a model, once you fine tuned it or anything like that, you want to have different kinds of systems that could evaluate the process all through the data curation process, the model selection, training and the deployment. And then another sad but very true truth is that offline metrics often don't line up with online metrics when you're deploying an AI application. So what you want to do is be able, over time, shorten that gap between the two so that you have a good way to measure if you're going to be successful in your AI development early on. All right, so some of the lessons that we've learned in the past in ML development and other industries, I think are very much applicable now to being able to develop genai applications, and that's following the scientific method to developing applications. And kind of what does that look like? You want to start first by identifying what problem you want to go ahead and develop or what solution you have in mind that could help essentially lift the KPI's for your business.
Salma Mayorquin [00:02:11]: One of the big themes in this talk will be how do we align what you're building your goals in development to align with the goals of the business overall. You want to make sure that you bring value and that the applications you're developing are very much aligned with that, both to make a sustainable AI product over the long run and obviously help out the business. Right. Everybody is looking for that. So once you identify a potential solution, you want to bring a hypothesis about what will happen if you develop and you deploy that application. It might be, hey, if we make this chatbot more empathetic, we think that there's going to be less people bouncing off of our application and we're going to lift traffic. So a hypothesis about what you think will happen that will then impact your downstream KPI's in the offline section, you want to make sure that you gather different kinds of attributes, skills or qualities about your application that you think could impact those downstream metrics, right? You want to make sure that those are aligned over time. And the idea here is, because online metrics are a bit expensive, you kind of want to justify why you're developing, what you're developing and why you think it's going to impact the downstream tasks.
Salma Mayorquin [00:03:25]: Once that's justified, then you can move forward to an online evaluation to expose it to your users or online, and then hopefully you get a validation or even a rejection of your hypothesis. So you get to learn essentially how to improve that application. All right, so how do you actually design these metrics? Also very, very vague in the Genai world, right? We can't rely on simple metrics like accuracy or RoC curves like you can with image classifiers or like that. So what's a good way to ground your development process into a positive change? So offline metrics can include different types like benchmarks. You can ground your selection process on benchmarks like Chatsys LLM or the chatbot arena. You could also use LLMs themselves as judges to then be able to identify skills or gaps in skills, skills, skills, gaps in your application. And these can be also used both in model selection, model training and data curation. Emphasis on all kinds of processes.
Salma Mayorquin [00:04:33]: You could use these solutions instead of just at the very end. And online metrics, these are kind of the big time metrics, the kinds of, I think, behaviors that you only see once you deploy an application. For example, whether your app increases the viewing time on a streaming platform, or if it increases the ability to close a sale. If you're doing an e commerce kind of company, this may look like human evaluation. So folks who are actually going into the app and then manually reviewing, is this good or is this bad? Or a b testing where you're separating your users into cohorts and then exposing them to different versions of your model or your application and seeing how they respond. So again, this is more expensive. So you want to save this till the end. You essentially want to use those offline metrics to justify being able to make an online test.
Salma Mayorquin [00:05:26]: And how do you actually go about designing offline metrics? I think you want to start again closely aligning with the KPI's. What are you actually driving towards. If it's increased time on streaming and being able to view over time, maybe you need to improve the recommendations of the videos or the content that you're showcasing, and then you have hypotheses about, well, what makes good content, maybe content that is exciting or I interesting, exotic, all kinds of other skills or attributes that you can then start brainstorming and then put to the test. And then you also want to define requirements for your application. So not just how they'll behave actually being deployed, but maybe even constraints on the engineering side about how your application should be designed. So say that you have constraints of an LLM rag based system, and maybe that needs to be on prem or it needs to be locally. Because you have your GPU poor, you can't afford very large models to be deployed. So maybe some of your criteria would be finding models that are performant at the three b to eight b parameters.
Salma Mayorquin [00:06:34]: Maybe you want to make sure you can use peft so parameter efficient fine tuning techniques like Lora, and possibly want to see if you can also quantize and then also make sure you can use that model permissively. So I think theme of this time and time again, even though in classical machine learning and also beyond in generative AI, we still haven't moved away from the main components, just data and data quality over Trump's quantity. So you want to make sure that the quality is as high as you can get it, and you may not get it right the first time. When you deploy an application that'll take time over iterations and deployments, you'll get more clues. Oh, did I lose my screen? I messed up. Sorry about that. Are we back? We're back. All right, so you want to make sure that essentially you get great metrics offline and then online, so you can learn what makes a good data sample that you want to then use to improve your application.
Salma Mayorquin [00:07:36]: And some hacks that you could use too, which are materializing in a bunch of different works, are data synthesis is a big area to explore. I highly recommend that. And then also indexing. So there's a way to be able to take a really large data lake of resources and then filter that down to the best resources that you can think of, or the best samples, and they may be samples that you synthesize because you understand exactly what that data should look like. Or maybe it's stuff that you've learned from the wild after you deployed your application. All right, I think I'm also losing track of time. Almost done. Um, and generally speaking, when you're developing this, you also want to make sure that you build trust with your organization.
Salma Mayorquin [00:08:17]: As an engineer and a developer, you probably know all the ins and outs of the systems you're developing, but it's kind of hard to translate that to folks who are maybe more decision makers who are at the business level. So another great reason why trying to align your offline metrics to be as close or as predictive to your online metrics makes sense here. Essentially, you can show how your offline metrics are evident points that help you say that the online metrics that we actually care about, like that streaming time or viewing time is actually going to go up. That means plus dollars or maybe productivity is going to be higher, so it means less time spending time processing something. So lower Tco. All right, so again, hammering at yeah, but measuring every stage is super useful and helps you catch problems early on. You can design your system with the most evidence that you might have and then over time iterate and feed that back to the very beginning process. So feed that back to your data curation, to your model selection once you understand how the system should behave for fine tuning and then also deployment.
Salma Mayorquin [00:09:24]: And also big fan of version control, hence why I'm super big fan of Docker. Big advocate for being able to use tools like this to essentially capture everything about the environment that you use to develop your applications and then also to deploy, deploy them, the data assets. There's been cases where I've had customers that I've worked with that something in their data lake changed and so now they can't tell why their application doesn't work anymore. So that's a loss of time and also loss of money, as well as just the typical things that you might log about in ML experiment, like the parameters, the architectures you're using, the versions of those checkpoints, all that good stuff. All right? And that's me. I'd love to connect with all of you all. Center one is for the remix site and on the left hand side is me and then right hand side is my co founder. So love to chat more with you all.