## Monitoring Regression Models Without Ground-Truth

### Deploying a machine learning model to production is just the first step in the model’s lifecycle

November 22, 2022Deploying a machine learning model to production is just the first step in the model’s lifecycle. After the go-live, we need to continuously monitor the model’s performance to make sure the quality of its predictions stays high. This is relatively simple if we know the ground-truth labels of the incoming data. Without them, the task becomes much more challenging. In one of his previous articles, he shows how to monitor classification models in the absence of ground truth. This time, we’ll see how to do it for regression tasks.

## It sounds like magic, but it’s pretty simple.

**Performance monitoring, again**

You might have heard me use this metaphor before, but let me repeat it once again since I find it quite illustrative. Just like in financial investment, the past performance of a machine learning system is no guarantee of future results. The quality of machine learning models in production tends to deteriorate over time, mainly because of data drift. That’s why it is essential to have a system in place to monitor the performance of live models.

**Monitoring with known ground-truth**

When ground-truth labels are available, performance monitoring is no rocket science. Think about demand prediction. Each day, your system predicts the sales for the next day. Once the next day is over, you have the actual groud-truth to compare against the model’s predictions. This way, with just a 24-hour delay, you can calculate whichever performance metrics you deem essential, such as the Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). Should the model’s quality start to deteriorate, you will be alerted the following day.

**Monitoring with direct feedback**

In other scenarios, we might not observe the ground-truth directly; instead, we receive some other form of feedback on the model’s performance. In this case, performance monitoring is still a relatively easy task. Think about a recommender system that you use to suggest to your users the content they would like best. In this case, you don’t know whether users enjoyed each piece of content that was suggested to them. Measuring the concept of enjoyment is quite a challenge on its own. But you can easily monitor how often the users consume the suggested content. If this frequency stays constant over time, the model’s quality is likely stable. As soon as the model breaks, you can expect the users to start ignoring the suggested content more often.

**Monitoring without ground-truth: classification**

Then, there are situations when ground truth is unavailable, or at least not for a long time. In one of my previous projects, my team and I predicted users’ locations to present them with relevant marketing offers they could take advantage of on the spot. Some business metrics could be computed based on how often the users found the offers interesting, but there were no ground-truth targets for the model — we never actually *knew *where each user was.

Performance monitoring in this scenario has long been a challenge. NannyML, an open-source library for post-deployment data science, recently proposed a clever method called Confidence-Based Performance Estimation (CBPE) for classification tasks. Based on the assumption that the classifier is calibrated, it delivers reliable performance metrics even when no ground-truth labels are available. I have explained CBPE in detail in a previous article — do check it out if you have missed it.

**Monitoring without ground-truth: regression**

Unfortunately, the CBPE approach is specific to classification problems. It works thanks to the fact that calibrated classifiers provide a probability distribution (or a predictive posterior, as a Bayesian would have it). In other words, we know all possible outcomes and the associated probabilities. Most commonly used regression models don’t provide such insights. That’s why regression problems require a different approach.

**Direct Loss Estimation**

To use CBPE for regression, the question to be answered is how to obtain the probability distribution of the prediction in a regression task. An obvious approach that comes to mind is to use Bayesian methods. NannyML’s developers have tested this approach, but it turned out it had some convergence issues and, as is typical for Bayesian posterior sampling, it took a long time to produce the results. Finally, one would be limited to only using Bayesian models should one want to do performance estimation, which is quite restrictive. Luckily, they came up with a much simpler, faster, and more reliable approach which they dubbed Direct Loss Estimation or DLE.

**The DLE algorithm**

The DLE method is brilliant in its simplicity. It boils down to training another model, referred to as a *nanny model*, to predict the loss of the model being monitored or the *child model*. If it brings gradient boosting to your mind, you are quite right. The idea is similar, but with one twist. The nanny model predicts the *loss* of the child model rather than its *error* as boosting methods would*. *It will become clear why this is the case shortly. But first, let’s go through the algorithm step by step.

The DLE method is brilliant in its simplicity. It boils down to training another model, referred to as a nanny model, to predict the loss of the model being monitored or the child model.

We need three subsets of data. First, there is *training data* on which the child model is trained. Then, there is *reference data*, which we will use for training the nanny model. Both training and reference data have targets available. Finally, there is *analysis data, *which is the data fed into the child model in production. For these, there are no targets.

Only analysis data have no targets. Image by the author.

First, we train the child model on the training data. The child can be any model solving a regression task. You can think of linear regression, gradient-boosted decision trees, and whatnot.

Step 1: Train the child model on the training data. Image by the author.

Then, we pass reference features to the child model to get the predictions for the reference set.

Step 2: Get the child’s predictions for the reference data. Image by the author.

Next, we train the nanny model. It can be any regression model. In fact, it can even be the same type of model as the child, e.g. linear regression or gradient-boosted decision trees. As training features, we pass the features from the reference set as well as the child’s predictions for the reference set. The target is the child’s loss on the reference set, which can be expressed as the absolute or squared error, for instance. Notice that as a result, the nanny model is able to predict the child’s loss based on its predictions themselves and the features used to generate them.

Step 3: Train the nanny model to predict the child’s reference data loss. Image by the author.

Once the analysis features are available in production, they are passed to our child model and we receive the predictions. We would like to know how good these predictions are, but there are no targets to compare them against.

Step 4: Pass production features to the child model and obtain predictions. Image by the author.

In the final step, we pass the child’s predictions for the analysis data and the analysis features to the nanny model. What we obtain as output is the predicted loss for the analysis data. Notice that even though we don’t know y_analysis, we can predict the monitored model’s loss. Magic!

Step 5: Get predictions of analysis data loss from the nanny. Image by the author.

All the algorithm steps are straightforward, except for Step 3, training the nanny model. How come this model is able to accurately predict the child’s loss? And how is it possible that we don’t need the nanny to be a more complex model than the child to predict it? Let’s find out!

**Why the loss and not the error**

The key trick of the DLE approach is the realization that as long as we are using absolute or squared performance metrics such as MAE, MSE, RMSE, and the like, we are not interested in the model’s *error *at all but rather in its *loss.*

The trick is to notice that as long as we are using absolute or squared performance metrics, we are not interested in the model’s error but in its loss.

The error is simply the difference between the ground-truth target value and the model’s prediction. It is signed— a positive sign when the target is larger than the prediction and a negative sign otherwise. The loss, on the other hand, is unsigned. Absolute loss metrics such as the MAE remove the sign by taking the absolute value of the error, while squared loss metrics such as the MSE or RMSE raise the error to the second power, which always yields a positive result.

Predicting the unsigned loss is a lot easier than predicting the signed error. Here is one way to think about it. Being able to accurately predict a model’s error is the same as predicting the ground-truth: since the error is the ground-truth minus the prediction, if we knew the error, we could add it to the ground-truth and voila, there we have the targets. No need to have two models at all. Usually, however, predicting the error is not possible.

On the other hand, predicting the loss only requires guessing *how wrong* the model was, not *in which direction* it was wrong*. *The error provides more information about the model’s performance than the loss, but this information is not needed for f RMSE-based or MAE-based performance estimation.

**Go nanny yourself**

As we have said, loss prediction can be achieved by the same type of model as the one used initially. Let’s illustrate how it works with a simple example.

Take a look at these randomly generated data. We have one feature, x1, and the target, y. The data were generated in such a way that there is a linear relationship between the feature and the target, but the larger the feature’s value, the stronger the noise. We can fit a linear regression model to these data and it captures the linear trend well. Notice, however, that the model’s errors are small for small values of x1 and larger when x1 is large.

Regression fitted to generated data with heteroskedastic noise. Image source: NannyML.

Imagine we would like to predict these *errors *using another linear regression model, based on the predictions (the red regression line) and x1. This would not be possible (go ahead and verify it yourself!). For instance, for an example where x1 is 1, and the model’s prediction is around 2, the ground truth can be as different as 5 or -1. There is no way to predict the error.

Now think about predicting the *loss *instead. Say, the MAE. It’s as though all the blue dots below the red line have disappeared. This task is pretty simple! The larger x1, the higher the loss, and the relationship is linear; a regression line would fit in quite well. If you’re interested in proof, you will find it in this Nanny ML’s tutorial, where they actually generate the data and fit two linear regression models to show how the latter can easily predict the former’s loss.

To sum up, any regression model can “nanny itself”; that is: the same type of model can be used both as the child model (the one in production, to be monitored) and the nanny model (the one predicting the child’s loss).

Any regression model can “nanny itself”: the same type of model can be used both as the child model and the nanny model.

**Danger zone: assumptions**

You know there are no free lunches, right? Just like most other statistical algorithms, the DLE comes with some assumptions that need to hold for the performance estimation to be reliable.

First, the algorithm works as long as there is no *concept drift*. Concept drift is the data drift’s even-more-evil twin. It’s a change in the relationship between the input features and the target. When it occurs, the data patterns learned by the model are no longer applicable to new data.

DLE only works in the absence of concept drift.

Second, predictive models are often more accurate for some combinations of feature values than for others. If this is the case for our child model, then DLE will need enough data in the reference set for the nanny model to learn this pattern.

DLE only works when the nanny has enough data to learn the feature combinations for which the child is more and less accurate.

For example, when predicting house prices based on square footage, say our model works better for small houses (where each squared foot adds much to the value) than for very large ones (where the exact squared footage is a weaker price driver). We need enough examples of small and large houses of various prices in the reference set so that the nanny model can associate the relationship between the house’s squared footage and the loss in the child model’s price prediction.

**DLE with nannyML**

Let’s get our hands dirty with some data and models to see how easy it is to perform the DLE with the nannyml package!

For this demonstration, we will be using the Steel Industry Energy Consumption dataset available freely from the UCI Machine Learning Repository. The dataset contains more than 35,000 observations of electricity consumption in a Korean steel-industry company. The electricity consumption in kilowatt-hours (kWh) is measured every 15 minutes in the year 2018. The task is to predict it with a set of explanatory features such as reactive power indicators, the company’s load type, day of the week, and others.

Let’s start with the boring but necessary part: loading and cleaning the data. We will use the first nine months for training, then two months for reference, and treat the last month of December as the analysis set.

Now we can fit our child model to the training data and make predictions for the reference and analysis sets. We’ll use linear regression for simplicity.

Finally, we can fit NannyML’s DLE estimator. By default, it will use LightGBM as the nanny model. As arguments, we need to pass the original feature names, the features holding ground-truth and predicted values for the reference set, the metrics we are interested in (let’s go with RMSE), and optionally the time feature, which will be used for plotting.

Once the estimator has been fitted, we can neatly visualize the estimated performance with its plot method.

Estimated performance for the analysis period. Image by the author.

As we can see from the plot, the estimated performance in the analysis period is quite similar to what the model has shown before. Actually, it even depicts a slight improvement trend (recall that the lower the RMSE, the better). Okay, but how good is this performance estimation? Let’s find out!

We can do it quite simply in this case since, in reality, we do have the ground-truth target values for the analysis period — we didn’t use them. So we can calculate the actual, realized RMSE and plot it against the DLE estimation.

To compute the realized performance, we can use nannyml‘s PerformanceCalculator. The plotting part is more involved; it requires us to write some glue code, but I expect the package developers to make it easier in future releases.

Realized versus predicted performance. Image by the author.

As we can see, the DLE performance estimation is pretty decent. The algorithm has even correctly predicted the performance improvement at the end of the analysis period.*This article was originally published on **the author’s Medium blog**.*

Thanks for reading!

If you liked this post, why don’t you **subscribe for email updates** on my new articles? And by **becoming a Medium member**, you can support my writing and get unlimited access to all stories by other authors and myself.

Need consulting? You can ask me anything or book me for a 1:1 **here**.

You can also try one of my other articles. Can’t choose? Pick one of these:

**Author’s Bio:**Michał Oleszak is a Machine Learning Engineer with a statistics background. Has worn all the hats, having worked for a consultancy, an AI startup, and a software house. A traveler, polyglot, data science blogger and instructor, and lifelong learner.