Shipping Your NLP Sentiment Classification Model With Confidence
This code-along walks through ingesting embedding data and looking at embedding drift
October 13, 2022This code-along walks through ingesting embedding data and looking at embedding drift.
This blog was written in partnership with Nate Mar, Founding Engineer at Arize
by Francisco Castillo Carrasco
Increasingly, companies are turning to natural language processing (NLP) sentiment classification to better understand and improve customer experience. From call centers to loyalty programs, these models inform important business decisions across a widevariety of customer touch-points. Unfortunately, many ML teams today do not have reliable ways to monitor these models in production and stay on top of things like embedding drift. This post is designed to help them get started.
Let’s say you are in charge of maintaining a sentiment classification model. This simple model takes online reviews of products from your U.S.-based store as the input and predicts whether the reviewer’s sentiment is positive, negative, or neutral. You trained your sentiment classification model on English reviews. However, once the model was released into production, your colleagues start to notice that the performance of the model has degraded over a period of time.
Monitoring is critical to surfacing the reason for this performance degradation. In this example, maybe a sudden influx of reviews written in Spanish impacts the model’s performance. You can surface and troubleshoot this issue by analyzing the embedding vectors associated with the online review text. By inspecting embedding drift, you can also surface problems with your data before it causes performance degradation.
This blog covers how to start from scratch. We will:
- Download the data
- Preprocess the data
- Train the model
- Extract embedding vectors and predictions
- Log the inferences
For the purposes of this example, we will be using Hugging Face’s open source libraries and Arize for monitoring (full disclosure: I work for Arize).
In particular, we will use:
- Datasets: a library for easily accessing and sharing datasets and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
- Transformers: a library to easily download and use state-of-the-art pre-trained models. Using pre-trained models can lower your compute costs, reduce your carbon footprint, and save you time from training a model from scratch.
- Monitoring: if this is your first time with Arize, it’s worth signing up for a free account or consulting this tutorial on sending data before continuing.
Let’s get started!
Step 0: Setup and Getting the Data
The preliminary step is to install datasets
and transformers
libraries mentioned above. In addition, we will import some metrics from scikit-learn
.
We’ll explain each of the imports below as we use them.
Install Dependencies and Import Libraries 📚
Check if GPU Is Available
Here we use Pytorch to check whether a GPU is available or not. When appropriate we will use PyTorch’s nn.Module.to()
method to ensure that the model will run on the GPU if we have one.
🌐 Download the Data
The easiest way to load a dataset is from the Hugging Face Hub. There are already thousands of datasets in over 100 languages on the Hub. At Arize, we have crafted the arize-ai/ecommerce_reviews_with_language_drift
dataset for this example notebook.
Thanks to Hugging Face 🤗 datasets, we can download the dataset in one line of code. The Dataset
object comes equipped with methods that make it very easy to inspect, pre-process, and post-process your data.
Inspect the Data
It is often convenient to convert a Dataset
object to a Pandas DataFrame
so we can access high-level APIs for data visualization. 🤗 Datasets provides a set_format()
method that allows us to change the output format of the Dataset
. This does not change the underlying data format, an Arrow table. When the DataFrame
format is no longer needed, we can reset the output format using reset_format()
.
Step 1: Developing Your Sentiment Classification Model
Pre-Processing the Data
Before being able to input the data into the model for fine-tuning, we need to perform an important step: tokenization.
Transformer models like DistilBERT cannot receive raw strings as input. We need to tokenize and encode the text as numerical vectors. We will perform Subword Tokenization, which is learned from the pre-training corpus. Its goal is to allow tokenization of complex words (or misspellings) into smaller units that the model can learn, and keep common words as unique entities, keeping the length of the input to a reasonable size.
Hugging Face Transformers provides the AutoTokenizer
class, which allows us to quickly download the tokenizer required by the pre-trained model of our choosing.
In this case, we will use the following checkpoint: distilbert-base-uncased
.
Next, let’s define a processing function to tokenize the examples in the dataset. The padding
and truncation
options are added to keep the inputs to a consistent length. Shorter sequences are padded and longer ones are truncated. We can apply said processing function to entire dataset objects by using the map()
method.
Two columns have appeared in each dataset:
input_ids
: A numerical identifier to which each token has been mapped.attention_mask
: Array of 1s and 0s, allowing the model to ignore the padded parts of the inputs.
We can display the dataset changes as it was shown above:
Build the Model
Similar to how we obtained the tokenizer, Hugging Face Transformers provides the AutoModelForSequenceClassification
class, which allows us to quickly download a pre-trained transformer model with a classification task head on top. The pre-trained model to use in this tutorial is DistilBERT. The weights of the classification task head will be randomly initialized.
It is important to pass output_hidden_states = True
to be able to compute the embedding vectors associated with the text (explained below). First, let’s download the pre-trained model.
We then use the TrainingArugments
class to define the training parameters. This class stores a lot of information and gives you control over the training and evaluation.
Next, define a metrics calculation function to evaluate the model.
Finally, fine-tune the model using the Trainer
class.
Step 2: Post-Processing Your Data
Here, we will extract the prediction labels and the text embedding vectors. The latter are formed from the hidden states of the pre-trained (and then fine-tuned) model.
Step 3: Prepare Your Data To Be Sent for Monitoring
From this point forward, it is convenient to use Pandas DataFrames. This can be done easily using the format methods already covered.
Update the Timestamps
The data that you are working with was constructed in April of 2022. Hence, we will update the timestamps so they are current at the time that you’re sending data to Arize.
Map Labels To Class Names
We want to log the inferences with the corresponding class names (for predictions and actuals) instead of the numeric label. Since we used 🤗 Datasets to download the dataset, it came equipped with methods to do this.
The dataset we downloaded defined the label to be an instance of the datasets.ClassLabel
class, which has the convenient method int2str
(visit Hugging Face documentation for more information).
Add Prediction IDs
The Arize platform uses prediction IDs to link a prediction to an actual. Visit the Arize documentation for more details. You can generate prediction IDs as follows:
Step 4: Sending Data Into Arize 💫
The first step is to set up the Arize client. After that we will log the data.
Import and Setup Client
Copy the Arize API_KEY
and SPACE_KEY
from your Space Settings page (shown below) to the variables in the cell below. We will also be setting up some metadata to use across all logging.
Now that our Arize client is setup, let’s go ahead and log all of our data to the platform. For more details on how arize.pandas.logger
works, visit our documentation.
Define the Schema
A Schema instance specifies the column names for corresponding data in the DataFrame. While we could define different Schemas for training and production datasets, the DataFrames have the same column names, so the Schema will be the same in this instance.
To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our DataFrame. Embedding features, however, are a little bit different.
Arize allows you to ingest not only the embedding vector, but the raw data associated with that embedding, or a URL link to that raw data. Therefore, up to three columns can be associated to the same embedding object*. To be able to do this, Arize’s SDK provides the EmbeddingColumnNames
class, used below.
*NOTE: This is how we refer to the 3 possible pieces of information that can be sent as embedding objects:
- Embedding
vector
(required) - Embedding
data
(optional): raw text associated with the embedding vector - Embedding
link_to_data
(optional): link to the data file (image, audio, …) associated with the embedding vector
Learn more here.
Log Data
Step 5: Confirm Data Is In There and Get Started ✅
While the model should appear immediately, the data will not show up until the indexing is complete.
You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, day, or week.
An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.
Check the Embedding Data
First, set the baseline to the training set that we logged before.
If your model contains embedding data, you will see it in your Model Overview page.
Click on the Embedding Name or the Euclidean Distance value to see how your embedding data is drifting over time. In the picture below, Arize represents the global euclidean distance between your production set (at different points in time) and the baseline (which we set to be our training set). We can see there is a period of a week where suddenly the distance is remarkably higher. This shows that during that time text data sent to our model that was different than what it was trained on (English). This is the period of time when reviews written in Spanish were sent alongside the expected English reviews.
In addition to the drift tracking plot, you can also find the Uniform Manifold Approximation and Projection (UMAP) visualization of your data under the point in time selected. Notice that the production data and our baseline (training) data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.
Next, select a point in time when the drift was high and select a UMAP visualization in two dimensions (2D). We can see that both training and production data are superimposed for the most part, but another cluster of production data has appeared. This indicates that the model is seeing data in production qualitatively different to the data it was trained on, and in this case causing performance degradation.
For further inspection, select a three-dimensional (3D) UMAP view and click Explore UMAP to expand the view. With this view, you can interact in 3D with the dataset. Zoom, rotate, and drag to see the areas of the dataset that are most interesting.
In the UMAP display, there are several coloring options that are worth using in different circumstances:
- By Dataset: The coloring distinguishes between production data versus baseline data (training in this example). This is specifically useful to detect drift. In this example, we can see that there is some production data far away from any training data, giving an indication of severe dataset drift. We can identify exactly what datapoints our baseline is missing so that re-train effectively.
- By Prediction Label: This coloring option gives an insight on how a model is making decisions. Where are the different classes located in the space? Is the model predicting one class in regions where it should be predicting another?
- By Actual Label: This coloring option is great for identifying labeling issues. If other colors are visible inside an orange cloud, for instance, it is a good idea to check and see if the labels are wrong. Further, we can use the corrected labels for re-training.
- By Correctness: This coloring option offers a quick way of identifying where the bulk of your model’s mistakes are placed, giving you an area to pay attention to. In this example, we can see that the Spanish reviews are almost all red.
- By Confusion Matrix: This coloring option allows you to select a
positive class
and color the data-points asTrue Positives
,True Negatives
,False Positives
,False Negatives
.
Final Note
If you want to remove this example model from your account, just click Models -> NLP-reviews-demo-language-drift -> config -> delete
Conclusion
As teams deploy more NLP sentiment classification models into production, having model monitoring in place to track embedding drift and root cause issues that arise is critical for staying ahead of potential performance degradation in the real world. By taking a few simple steps outlined in this blog, your team can have an easier and more automated way to tackle these challenges head-on!
Questions? Reach out on the Arize community. For additional Colabs, check out the Arize Docs.