MLOps Lab #3 : Continuous Delivery a TensorFlow Model on Red Hat OpenShift (OKD) with SAS Model Manager and Workflow Manager

Premises: The scope of the article is to summarize a proof of concept I’ve been working on over last month

October 27, 2020

Ivan Nardini

Premises: The scope of the article is to summarize a proof of concept I’ve been working on over last month. It involves different concepts and technologies. Depending on your interest, I would consider a virtual sharing to answer all your questions. So, feel free to leave comments.

That’s my third article of MLOps Lab Series.

But, compared to the previous ones,

MLOps Lab #1 : Batch scoring with Mlflow Model (Mleap flavor) on Google Cloud Platform

MLOps Lab #2 : Deploy a Recommendation System as Hosted Interactive Web Service on AWS

this time it’s a bit different. At least for three reasons.

Indeed, I feel more involved. It lets me show what I do at job everyday. What it means to be a person passionate who develops analytics application with open sources at SAS.

Also, because I work with partner in crimes Artjom Glazkov, it’s an real example of how collaboration is one of the powerful value at that company.

Last but not least, It’s my first article for MLOps Community. Thanks Demetrios Brinkmann.

Said that, below the Table of Content:

The Scenario

The Project: Business Case, Process, Environments and Tools

SAS® for Continuous Delivery Machine Learning Models3.1SAS® Model Manager as Model Registry3.2SAS® Workflow Manager for Automation

Final Considerations

Summary

References

So, let’s jump into the scenario.

1. The Scenario

Since I start working on ModelOps, customer asks for integratingMachine Learning Environments.

For what I care, ModelOps should solves Machine Learning System Integration Challenges!

No matter where they are (on-premise, on-cloud) or what technologies are involved (Free and open or Non-free software), the conversation goes kinda like that:

Assume that I orchestrate Model Training on scale with <Orchestrator> in <Training environment>…
Now Model Deployment is in <Production environment>…
But, I do need a Model Governance framework to manage the entire model lifecycle automatically between <Training environment> and <Production environment>…

In particular, in this scenario he says

..But I do need a Model Registry to version Tensorflow models and deploy the associated serving docker images on Openshift. And once in production, I want to monitor them as well.

So the question:

Let’s see if we can do that 😉

2. The Project: Business Case, Process, Environments and Tools

For POC purpose, I consider a credit scoring business application where the consumer credit department of a bank wants to automate the decision making process for approval of home equity lines of credit. The model is based on HMEQ data collected from recent applicants granted credit through the current process of loan underwriting. It is composed by a subset of 12 predictor (or input) variables and the response (or target) variable (BAD) indicates whether an applicant defaulted or not.

Below the high-level architecture of the solution I propose.

Then, I assume that

Data Scientist runs TensorFlow model experiments in Development environment and track them usingMlflow.

He/She registers the Champion candidate in SAS® Model Manager with SAS pzmm and sasctl library. The Champion model is subjected to a validation process. If it passes, the model is deployed on RedHat Openshift (OKD) thanks to SAS Workflow Manager using Google’s Tensorflow serving image in a OKD project previously created by IT Cluster Admin.

Because the demo, IT deploys an clientapplication stack to simulate scoring requests. It includes a dedicated sidecar container for pushing logs directly to a backend Logs are store in a PostgresSQL database.

Then, Logs are consumed by performance monitoring service that sends a notification in case model underscores.

Time goes and model starts underperformed thenSAS® Workflow Manager triggers automated retraining based on the field data and sends a message in Microsoft Teams

Data scientist receives the notification and he/she starts a new training process.

Now that you know more about the project, we can dive into the role of SAS® Model Manager and SAS® Workflow Manager.

3. SAS® for Continuous Delivery Machine Learning Models

In a famous blog article, Martin Flower states that

A Continuous Delivery orchestration tool…governs how models and applications are deployed to production

At SAS, we have two guys that do that job

SAS® Model Manager is a ModelOps platformto register, validate, deploy in production, monitor and retrain your models

SAS® Workflow Manager is the ModelOps orchestrator. It provides tasks automation (for example, you can send email notification or executes specifics jobs) thanks to workflow definitions. They represent both directed acyclic (DAG) and cycle graphs. It is important, in sense of making some part of the process repeatable (such as constant reviewing model performance).

The SAS Persuaders for CD4ML!

And, because their integration, they allow to cover the Continuous Delivery for Machine Learning end-to-end process described by Flower.

3.1 SAS® Model Manager as Model Registry

In our scenario, because Tensorflow model was previously validated, SAS® Model Manager represents just a Model Registry.

To version the Champion model, I create a zip package of the model with minimum requirements (model variables and model properties) with SAS® pzmmmodule and then I register it with SAS® sasctl, a package that enables easy communication between theSAS® Viya platform and a Python runtime.

Here the register function I code

Below the Tensorflow_BoostedTreesClassifier model versioned in the sas_modelops_tensorflow_openshift project.

3.2 SAS® Workflow Manager for Automation

Once I version the model, the rest of the demo is “orchestrated” by SAS® Workflow Manager.

In fact, for continuous delivery, we need to automate a process that

Allow user to validate the model as Champion.

Build an serving image with validated Champion Model using Google’s Tensorflow serving base

Deploy the image via registration in RedHat Openshift (OKD)’s docker registry

Monitor the model in production

Retrain the model in case if model starts underperforming.

And, of course, send mail and Microsoft’s Teams notification for each of the steps to Data Scientist and IT people.

Below you can see the workflow definition that wehave built for covering all these steps.

Just to give you some elements, the workflow is represented by

sequence flows(arrows) between workflow diagram elements that indicate the order in which tasks are executed.

processesthose are a collection of activities designed to produce a specific output for a particular objective, potentially involving both human (user task, gray box with little man) and system interactions (service task, gray box with gear). In particular, service tasks invoke an external actions like REST Web Service and Job Execution.

subprocesses (yellow boxes)which are compound activities or workflow to deal with complexity

all them are finally controlled by

gateways(diamond with X) those control the execution through an instance. In our case, we have exclusive gateway for “if-then-else” logic.

For each project, you can start the workflow instance of a workflow definition that executes each task.

Now assume that we start the workflow above in our project.

Because this article don’t want to be annoying, let me focusing on its core components.

Those are:

Pre-Build TF Serving Image task

Build TF Serving Image task

Deploy TF Serving on Openshift (OKD) task

Production stage subprocess

STEP 1: Pre-Build TF Serving Image task

The “Pre-Build TF Serving Image” Task executes 0_prebuild.sh which is paired with prebuild.py custom package using SAS Job. The packages downloads the model artifact on the server using ModelRepository REST service and a configuration file environment.yaml.

Below you can find the SAS Job

that wraps the 0_prebuild.sh

And this is a view of the “Pre-Build TF Serving Image” service task in SAS® Workflow Designer.

As you might have guessed, all services tasks exploit SAS® Viya capability to controls the call of OS executables from SAS Job code (XCMD property).

Then, I’ll not explicit it furthermore.

STEP 2: Build TF Serving Image task

The “Build TF Serving Image” Task executes 1_build.sh which builds the image on the local registry based on Model artifact name and a temporary Tensorflow serving container.

Here the content of 1_build.sh script.

STEP 3: Deploy TF Serving on Openshift (OKD) task

The “Deploy TF Serving on Openshift (OKD)” Task executes 2_deploy.sh which pushes the new Champion Model Serving image on OpenShift Container platform remotely.

That’s an example of a particular registration.

Below the notification I receive on Teams and the image registered in OKD’s docker registry.

So,

Tensorflow model is in production on OKD!
Thanks SAS Workflow Manager =)

STEP 4: Production stage subprocess

Now the model is in production. It scores and logs are store in a PostgresSQL container on OKD.

We should be happy, right?

Hell, NO! =)

Indeed,customer challenges us asking if SAS can monitor the model and trigger retrain if needed?

It’s time to monitor the model and retrain if needed…

This sounds like a bit complex right?

But, fortunately for us, we have subprocess those are useful when you have to deal with complexity in a workflow.

In our case, we define a subprocess to handle the production stage operations you can see below.

In particular,

It runs a model performance monitoring job

If the job successes, then it stores the value of one particular statistics (KS in this case).

Finally, if the KS value is under a minimum threshold (0.45 in the example), it means that model is underperforming. Then, a notification is sent and the model retraining is automatically triggered.

Of course the “run retraining” task trains the model on the data collected in the PostgresSQL database for performance monitoring and registers a new version of the model once the retrain ends.

Here you have an example of Performance Monitoring dashboard we get.

Below the build_train_pipeline function of train.py

and some code of register.py

executed by the run retraining task thanks to the associated job.

At the end, because our model underscores, retrain is triggered and the new version of the model is registered in SAS® Model Manager. And of course, the production stage ends successfully and a new model life cycle may starts again!

4. Final Considerations

Honestly, I’m speechless for final considerations this time. Then I’ll jump to the summary.

Just let me say one thing

What a hell of project!

5. Summary

I start the article with the customer’s question:

Can SAS operationalize Tensorflow Models on Openshift?

Well, the answer is

Indeed, I show how the integration between SAS® Model Manager and SAS® Workflow Manager allows us to cover the Continuous Delivery for Machine Learning end-to-end process.

As always, I personally learn a lot of new things: