The machine learning monitoring landscape is evolving fast. You may be tempted to use the latest tool and hope that it works out of the box. However, this could lead to receiving many false alerts or missing issues.
I present an easy-to-implement prioritization approach that youcan use with eitheryour own backend monitoring tools ora vendor monitoring tool. It is based on more than 30 large-scale models I have run in production over the last ten years.
Note: As the image below shows, machine learning monitoring should be added on top of typical backend monitoring. For engineers and data scientists without production experience, read the gentle introduction to backend monitoring.
When you apply only traditional backend monitoring to machine learning applications you will experience silent failures. These failures have a massive negative impact on the quality of your application’s response which adversely affects user experience or the company’s revenue.
Some examples of silent failures I’ve personally observed are:
These examples are by no means exhaustive, but they do highlight the need for additional monitoring. To make matters worse, many of these bugs are permanent, degrading our performance by as much as 10 or 15%, while only massive problems (>50% worse) will be detected by users or stakeholders.
“ We introduced a machine learning observability tool and now we get several alerts per week that an input field’s distribution changed. The reasons are mostly upstream business changes or unexplained changes in the input data. We did not take any action on the alerts.”
Source: Data scientist working at a market-leading financing platform
Alerts that don’t prompt any clear actions will be ignored in the not-so-distant future. Many tools and vendors offer input data monitoring as a major component. While input data monitoring is valuable, I advise against it as a first step or a standalone measure.
Instead, I advocate taking a page out of site reliability engineering’s book and recommend focusing on customer impact-based metrics. You prioritize backward from the output:
I will cover the top priorities in this article. For a longer form of this article with all the steps, watch my PyData talk.
For some machine learning applications, you get to know the true value of your prediction, usually with a delay. For example: Predict the delivery time of food. After the food arrives, you can compare your prediction to the actual observed value. The metrics are then calculated over many examples. You can compare them to metrics measured on historical data during model development.
To monitor the evaluation metrics in production take the following steps:
You should also monitor your application’s response distribution. The response is the return value after all postprocessing steps and business rules. For classification models, this can be a prediction score. For regression models, it is also a numerical value.
The response value is an excellent proxy for quality monitoring. It does not measure how well the model fits its target function, like evaluation metrics. However, it does change when the quality goes down (e.g., an aggressive filter removes many good quality predictions or an important input variable changes the output score drastically).
Measuring the response distribution offers many significantbenefits:
How you collect the response value:
Bonus tip: Monitor negative user experiences in a separate metric, e.g. your service returns empty, a low certainty response, or a fallback. Brainstorm proxies for your use case. Alert on the percentage of bad responses. Downside monitoring is very important. Don’t wait until your stakeholders or users notify you.
It is worth mentioning that today’s machine learning monitoring methods will notalert you to all individual bad predictions. It works instead on the whole traffic or on segments. If you run anything where even a single failure is potentially catastrophic like health-related predictions, consider measures like easy-to-find objection mechanisms for end-users, partial automation over full automation, and humans in the loop.
Did I forget anything? drop me a line at [email protected].
Thank you to @bocytko, Eval Simpson, and Vlad Minzatu for their great reviews.