Evaluation of ML Systems in the Real World
Mohamed is passionate about advancing AI capabilities, leading a career marked by pivotal roles in some of the world's top tech companies. He is the CTO and Co-founder of Monta AI, a product studio for global AI solutions.
Formerly, as the Head of Alexa Speaker Recognition at Amazon, he led teams that delivered groundbreaking developments in voice AI and redefined customer interaction with the technology. Earlier at Cisco, he was Director of AI, following the acquisition of Voicea (a company he co-founded) to unlock insights and actions otherwise lost in enterprise conversations.
At LinkedIn and Microsoft, Mohamed led AI and big data teams, serving hundreds of millions of users through his work on LinkedIn's newsfeed personalization, Outlook.com, and Visual Studio. His academic pursuits at Stanford University as a student and a part-time Teaching Assistant for Machine Learning (CS229) and Deep Learning (CS230) reflect his commitment to nurturing the next generation of AI experts.
Mohamed co-authored Computing with Data, a guidebook to wrangling big data effectively, and is currently working on another book titled Machine Learning in the Real World, set to hit bookshelves in 2024. Through his career and writings, Mohamed is driven by a mission to empower people around the globe to collaborate more effectively using software that provides great experiences.
Evaluation seeks to assess the quality, reliability, latency, cost, and generalizability of ML systems, given assumptions about operating conditions in the real world. That is easier said than done! This talk presents some of the common pitfalls that ML practitioners ought to avoid and makes the case for tying model evaluation to business objectives.
Mohamed El-Geish [00:00:10]: So I wrote a haiku for you using chat GPT, and I wanted to make sure that it actually works. So I validated the metering. But today it's all about the issues with validation. So try to spot if it's actually a haiku or nothing. Here it goes. Genai or not, evaluation is a maze that borders minefields. Today I will share lessons we learned out there, evading the minds. It works.
Mohamed El-Geish [00:00:43]: 575 I checked. All right, so today I want to share with you a few stories, real hard lessons that we learned applying evaluation, especially at big companies, with keeping anonymity. I'm going to share the details, but changing some of them just to make sure you cannot tell exactly which one is which. The first principle I want to uphold here, I want to have a mirror, so I want to hold a mirror to myself first, before anyone else. And that mirror will basically allow us to reflect on the biases, the issues that happen, the bugs we have in our own thinking that we inflect upon our code. So the first principle, as Richard Feynman said, is that you must not fool yourself, and you are the easiest person to fool. Human judgment has a lot of flaws. We'd like to think of ourselves as rational beings that make decisions, given the data that we have.
Mohamed El-Geish [00:01:55]: But the reality is, it's not just issues in the data or issues in the environment that we do not know that cause us to make the wrong decisions for evaluation. It's also the bugs in our own thinking. After we have checked everything and double checked it, we still are susceptible to these errors in human judgment. And we make a ton of decisions, whether in model development or model evaluation. So given that we cannot really fully trust our own judgment, how trustworthy are our evaluation protocols and the decisions we made to achieve them? We have a lot of errors that come from our incentives. Whether it's the incentive structure itself, misalignments, and incentives, primarily the agent principal dilemma. What works for the employee might not be the best for the company. And how can companies actually deter these kind of structures of incentives and fault in them, but also the value systems.
Mohamed El-Geish [00:02:58]: Companies have different value systems. It's the collective personality of each company that drives how people make decisions. We take actions, and we make decisions based on our own values. And evaluation is a reflection of what we care most about, whether it's performance, ethics, latency. All of these are reflections of our own value systems that we reflect upon our code. We also have biases. We have cognitive biases that affect our own judgment. And we have unwanted variants, noise in our thinking.
Mohamed El-Geish [00:03:33]: You still want some diversions. You want a level of variance in the way people think around the team, around the organization, but you don't want them to be so divergent that you have a team in a boat that's rowing in opposite directions. For these kind of issues, I highly recommend two books that I found very useful, thinking fast and slow and noise. Let's dive in. Here's a common tale that happened, and it's always going to happen. You will get some incentive that an SVP or some XYZ company who wants to ship an LLM based application ASAP. I try to jam in as many abbreviations in this as possible. And then what's the value system or the top priority of the value system that's being emphasized here? It's moving fast.
Mohamed El-Geish [00:04:28]: Don't get me wrong, moving fast is a virtue, especially in a startup or a big company that wants to reduce the time to market and time to value. But there's also a notion of moving fast by having smooth operations and by avoiding mistakes that will lead you to repeat or redo work, especially in evaluation. Who is watching the watchman? That's usually a question we ask about evaluation. It's your last line of defense in making go or no go decisions, like launching a model to production to millions of users. So with the mounting pressure to deliver value, you lean on the default choice. So there's a bias here besides the bandwagon, which we're going to talk about in a second, but there's the default choice bias. If you're doing classification, you're probably going to go for standard metrics and classification. If you're doing something that looks like some task, you're probably going to do some literature search and figure out what are the standard benchmarks in terms of standard datasets and standard metrics that you will use to get to value as fast as possible.
Mohamed El-Geish [00:05:35]: You want to check that box. That said, you've done evaluation and you've now able to say, yes, you have the green light to ship this. But there is also the novelty of the bandwagon effect. There are a lot of applications these days, especially with LLMs and Genai, that are using the latest and greatest, the shiniest ways of developing evaluation methods. So what could go wrong? When the time comes to evaluate the system, you'll probably do something like this. You will go and synthesize a dataset using an LLM to judge another LLM. So what could really be a problem here? It's insufficient, probably. So you move on to the next step, which is you ask your staff to label some data.
Mohamed El-Geish [00:06:19]: Whether you're using human annotation through hit workers or using your own staff, the idea is you don't know yet exactly how this is going to be. So you're still in the synthesis stage. You're coming up with the standard expected in domain use cases for you, for your own way of thinking. But the question is, if you have high scores now, is that sufficient? Can you move on to production with this level of confidence? You want to ship it, but the reality is that's not sufficient. And the reason is you still want to link this to your business KPI's and your business metrics. How do you align what you have in the lab with what you will have in production? How do you align your proxy metrics with your North Star metrics? So what usually happens is that after you've hypothized how your customers are going to use this, you end up with a huge variance in the way or drift in the way customers are using it. Whether it's a concept drift or a data distribution shift, what's going to happen eventually is that the operating conditions, the environment in which you're deploying your system is bound to change. It's never going to stay the same.
Mohamed El-Geish [00:07:38]: So how do you keep that loop? What are the things that you could do to avoid falling into that trap? So the lesson here is that evaluation is about trusting the system. It's about making decisions with confidence. It's about taking some insights from the output of the evaluation and going back as a feedback loop into your development, back to the whiteboard and understanding where are the issues, the shortcomings of your system, and making decisions based on that. So you could trust it in that sense, the same way you would trust a pilot to make decisions about your flight, the same way you would trust a baker in a bakery to deliver the best base trace. However, you wouldn't trust them cross domain. If you take that actor or person and you put them in another operating environment, you have to understand the different out of domain issues that could happen. So in the previous example, what happened is we've synthesized some data set that we thought was in domain, and this is the assumption that usually comes with a standard evaluation. It's IID, it's independent and identically distributed, with the emphasis on the identically distributed part.
Mohamed El-Geish [00:08:53]: However, you also want to have out of distribution Ood, but to what extent? So first you want to check the box that says, yes, I've gone and done the work to see the city stream the standard way my customers are going to use this. That's the IID part. And youre evaluation set or your testing should reflect what the customers are going to be doing in production. And ideally you're also training and doing all the different things, all the processes like model selection and hyper parameter tuning your dev set, it's also reflecting the same distribution. But then you also want to have an Ood. And that's the problem from a previous lesson that we started with the Ood and we relied on it too much. You still want that. It's sufficient to have both, but it's insufficient to only have one.
Mohamed El-Geish [00:09:40]: Or why just one aspect? The IID aspect alone is not sufficient because you don't know how the model is going to react. In this example, like when you put a pilot in a bakery and other way or the other way around, if you only have out of distribution, you haven't really tested the main way you're going to be using your model. So just giving a high score, getting the high score doesn't mean much. It's really about understanding the gap. Where are the errors? Have you done error analysis or not? And then this will tell us whether you are falling into the trap of the type three error. The type three error is solving the wrong problem the right way, or it's giving a very good answer, but to the wrong question. So in this lesson, the high accuracy we achieved on the out of distribution data was a mirage. It wasn't actually representative of the customer use cases.
Mohamed El-Geish [00:10:38]: And then we also have some issues that you could explore when you have incorrect labels in your data set. Happens almost always guaranteed to be the case. To what extent? That's the question you want. I also would like to point you to a nice paper from University of Washington on this, which is the inoculation by fine tuning. It's a way to figure out whether you have a problem with the data set or a problem with your model capability or capacity to actually have a solution. Or is there an inherent deficiency in your model that it cannot overcome? Second lesson, the streetlight effect. So looking at what is hidden, where it's convenient to look, is what we call the straight light effect. It happens a lot in our industry and for different reasons.
Mohamed El-Geish [00:11:36]: Here's an example that happened. Another anecdote. In real life, a team has access to some data sources, and instead of going and surveying all the data sources that they need to survey for representativeness, one of the core pillars of evaluation, representative values or representative data points, have to be there to do statistical coverage. The other pillar being you have enough confidence in making the inferences you're making. So what happened here is they sampled from data sources that they have access to, and they didn't want to go through the pain of requesting access to the rest of the warehouses or the data lakes they require to actually be representative. Don't get me wrong, the assumption here is that, and this is the invalid assumption, that the data is identically distributed. So you keep seeing these issues a lot with the assumptions about ID or Ood, and we have to be very careful about when to do what. So here it was insufficiently represented.
Mohamed El-Geish [00:12:36]: The data was assumed to be id, but the reality was there's a little bit of data dredging that's happening accidentally, no malice required, but sometimes it happens. So how do you go from the street light to sunlight? It's okay to start with a stopgap solution. So ship something first, but admit and declare that you're lacking the representativeness, and make sure that you're working towards how to pay that tech debt. You're bootstrapping an evaluation process. You're gradually shipping very carefully to dock food first within the company. You're shipping to beta for early adopters, and you're trying to collect as much representative data as possible. Meanwhile, you're also building other out of distribution data sets that will help you understand what is the upper bound of my errors? What are the things that could go wrong very horribly. And you're building guardrails for those.
Mohamed El-Geish [00:13:37]: So you're not shipping blindly. You're shipping with the assumption that you know, or the guardrail that you can pick a safety net to understand how bad this could go. And then you use that mechanism iteratively to collect more data that's representative from early adopters. As we mentioned earlier, it's super important to analyze errors because the number you get for any metric you get, like, let's say you got 97% accuracy. Where is that 3%? Which categories of error do you actually need to uncover? That's the error analysis and its value. And then you also need to refine and realign. As you are learning more about the problem statement and how customers are changing or drifting away, you need to keep refining your navigation apparatus. Your metrics are your compass.
Mohamed El-Geish [00:14:29]: They will lead you to the next decision you have to take. So measure the internal ones that you have. Measure the standard ones that you know of, the task that comes with the literature search that you've done, but also have some sort of user feedback loop, whether it's click through rate or some feedback, implicit or explicit, but try to link as much as possible the downstream metrics, and try to optimize to these metrics from the evaluation and development. And finally, nothing stays constant. Revisit all your beliefs, all your assumptions, the data selection process, your labeling instructions. We've seen more often than not that the drift between the labeling instructions and how customers are expecting the system to work, work leading to catastrophes. And finally, there was one more thing, back testing. So when you back test your assumptions, you will find more often than not that you have made the wrong assumptions in the prior iteration.
Mohamed El-Geish [00:15:32]: And that's a good thing, because if you haven't been assuming some level of like distribution or some level of knowledge about how the system is going to work, you're not going to be able to ship the first iteration. But in return, you know that now that you've shipped this, the previous one was bad, and now you're improving. So it's a good thing to learn from that failure and to improve or sharpen the apparatus that you have. Third and the last lesson was a lesson of reverse incentives. Another one, the team wanted to formulate a goal that they can control. The value system here is more about impactful results. So not as the previous one, moving fast, but you want to ship something that will get you promoted. That's basically what the incentive for the agent principle in Emma is.
Mohamed El-Geish [00:16:26]: And your goal is to launch to millions of users if the performance of your model improved, even if it's a stale test set, disregarding any A B test. So the A B tests require a lot of customer interactions, they require some time, they're lagging, but you cannot control it. So how do you balance control and achieving something that will get you promoted, basically, and having an A B test that is more powerful because it's actually much bigger size of sample that you work with. So you have higher confidence in making decisions. You can observe smaller effects, smaller sways or changes in the metric. However, the metrics themselves are not something you can see in the lab at the time of development and shipping things. So you need to take this very carefully. As textbook example of Godhard's law.
Mohamed El-Geish [00:17:26]: When a measure becomes the goal it sees to be a good measure, it becomes a bad one. So here's a very quick recipe for aligning your proxy metrics that you could use in the lab with your north star, or the ones that you cannot measure in the lab because they're lagging and costly to measure, sometimes hard, prohibitively hard to measure your north star matrix. You need to keep tabs on them. You need to figure out every six months or so, maybe using a holdout in your A B test that's basically measuring the long term impact. You see how all of the collective changes you've been making are impacting your metrics. But at the same time, you model, or you using intuition and using some knowledge from prior experiments, you model these metrics on ones that you can control. However, I would highly suggest that your compass for success comes from metrics that actually align very well with the business metrics. That's where the real incentives are backtesting again.
Mohamed El-Geish [00:18:32]: So we've noticed actually in backtesting, one of our A B testing experiments showed that the previous models that were launched based on offline metrics or the sales set was directionally degrading the performance of the North Star metrics we collected online. And that's the cautionary tale. Always try to see how your metrics are aligned with the north star. So after that, you can start improving your metrics by changing the actual formulation of the goals, by changing the weights. If you have multiple segments and you're doing some sort of weighted sum, you can focus more on, let's say, launching in a new market, so you give it more weight if there's a strategic direction you're pivoting to. That's another thing that you can also use to change the weights. Once you start drilling into the sub components of your metric, you will see that there are some exciting things that you can highlight and you can align with. And then beware of the pseudo certainty effect.
Mohamed El-Geish [00:19:36]: A lot of times when we launch models, they come from high level of certainty. That is not true in the lab. So you can say, we've done evaluation offline and we picked the top five models or the top three models. The reality is there is not a single point estimate here for evaluation on these metrics or performance on these metrics. It's an estimate. It's a range with error bars. So there's a level of uncertainty that you have to account for in the next stage of deployment, which is maybe a beta or an A B test. So don't assume that just because these are the top three, that the reality is these are the ones that are going to have the best performance in the bake off online.
Mohamed El-Geish [00:20:19]: Leave some room for your uncertainty. Try to measure it. So it's not a random number that you can come up with for the number of things that you can put into production. And there's also some binding constraints you cannot just put all of your models into production in a b test. If you can, that's great. But the idea is usually there's traffic limitations and what could go wrong could be harder to measure. So with that, leave some room for the uncertainty that you have in prior stages before deploying to the next ones. And of course, ramp up gradually.
Mohamed El-Geish [00:20:52]: So feature flags are your friend. You can have a list of early adopters that you give that experience to smoke. Test the models to see how the changes are, monitor the effects. We've seen this in a lot of companies where you automatically can roll back if you set up guardrails and you trip them. It's very important to know how to put these safety nets and make sure that you do not do something that you cannot predict to the fullest extent. And the this also goes back to the adversarial testing. So we talked about out of distribution testing before. There's cross bias, there is cross domain, and there's also adversarial testing.
Mohamed El-Geish [00:21:34]: Cross bias is basically when you want to see if the features you're using are not necessarily the ones you want the model to rely on. So the standard evaluation will have some behavioral testing. Here's some input, here's some output, and you see how much it matched the output. And then you have the cross bias where you start looking into the structure of the model. Does it really understand the concept of a word or a concept of an adjective, or maybe the concept of some high level abstraction in an image? And if you can see that it relies on the right things in an image detection, or it relies on the right things in language problem, then you have more confidence and it understands these kind of compositionalities. There's also the cost domain, which we talked about. So that's basically the baker being a pilot. You want to see sometimes what happens if the operating conditions have changed, what is the worst thing that could happen? And then there's adversarial testing, which is basically, I'm trying to trick the system.
Mohamed El-Geish [00:22:37]: It doesn't have to be malicious, it doesn't have to be security oriented, but you're trying to see what if this happens. How bad would it be if the system is being tricked? And finally, pre mortem things that could go wrong because you don't know how it's going to be. We ran out of time because of the delay, so we're good. Thank you. Thanks so much.