MLOps Community
+00:00 GMT

A Gentle Introduction to Backend Monitoring

Everything you need to know if you don’t have production experience If you are a developer or data scientist without production experience , this article is for you

April 4, 2023
Lina Weichbrodt
Lina Weichbrodt
Lina Weichbrodt
Lina Weichbrodt

Everything you need to know if you don’t have production experience

If you are a developer or data scientist without production experience, this article is for you. I will give you a hands-on introduction to the foundations of backend monitoring based on the best practices of IT-first companies like Google. You will learn about metrics, logging, dashboards, and alerting. If you prefer to watch a video instead, check the first half of my PyData talk.

If you are also interested in machine learning monitoring, check the second part of this series: A Practitioner’s Guide to Monitoring Machine Learning Applications.

You cannot avoid problems, focus on the detection

If you are without production experience, you might wonder why monitoring is important. On-call engineers in companies of all sizes would tell you that everyone experiences problems on a regular basis. Need proof? Even highly professional companies like Google Cloud and AWS provide status dashboards for their users to track problems (Google, AWS).

No matter how hard you try (and you should add tests and ci/cd pipelines!), you cannot eliminate bugs or other human errors. Instead of striving for zero live problems, we focus on identifying problems in real-time, as soon as possible. We then categorize the problems based on their severity and fix them accordingly.

Measure the Four Golden Signals

Fortunately, established guidelines from Site Reliability Engineering (SRE) exist (read Google’s SRE book online), so we don’t need to reinvent the wheel; a best practice is to keep an eye on the four golden signals:

  1. Latency: the time it takes to serve a request
  2. Traffic: the number of requests per second/minute
  3. Errors: the number of requests that fail
  4. Saturation: the load on your network and servers

By prioritizing these four signals, we focus on end-user pain. These metrics are showing you that something is wrong: Your service is answering slowly, responding with errors or you suddenly get fewer user requests than normally (indicating a problem with the application that calls you).

Ok, now you might be wondering how to actually measure these signals. If you work in a company or team with existing backend services, there is a good chance that there is already a metric collection service in place that you can use. If that is not the case, I recommend that you use a hosted metric collection service. You pay a bit more but can get to production quicker and you don’t have to maintain a metric collection service. A popular tool for collecting metrics is Prometheus (you can use it as a hosted service on AWS or other vendors).

Metric collection is implemented in two steps:

  1. Add metrics to your code: Your metric collection provider typically offers a library in the programming language of your choice. You create metric objects that you call on each request. Here is a python example
from prometheus_client import Histogram
h = Histogram('request_latency_seconds', 'Description of histogram')
h.observe(4.7) # Observe 4.7 (seconds in this case)

Store the metrics: If you choose the metric service Prometheus, it calls your application every n seconds on an API endpoint /metrics. The endpoint lists the current metric values. Other metrics collection services like AWS Cloudwatch push the metrics actively. This is done by installing the Cloudwatch agent on the machine that runs your application. The agent regularly sends the observed metrics to the metric service, e.g. every minute.

Build dashboards and automated alerts

Once you have collected metrics, you create dashboards. The dashboards allow you to observe your metrics over time and to debug the service behavior once you notice a problem.

Now you create automated alerts. The alerts will notify you if your metric is below or above a defined threshold. You can get notified on a communication channel of your choice: email, SMS, messenger, or send the alerts to an app that is used by the on-call engineer.

What technologies should you use to create dashboards and alerts? Similar to the metric collection, it makes sense to use the existing dashboard and alerting tools of your company. If you are selecting these components, check if your metric collection service also supports dashboards and alerting (many of them do). If you, later on, prefer better dashboards I can recommend the separate dashboard service Grafana which can read from many metric collection services. But it’s ok to start simple.

Collect Logs

If you get an alert that your application responds with errors, wouldn’t it be nice to know why? Logs help you to answer that question.

Logs are text lines that print intermediate information during the processing of the request, e.g.

The printed lines are typically sent to a central location where you can query them. A nice service provider for centralized logging is DataSet:

How to implement logging:

  1. Add logs to your application: log errors (add diagnostic information to an exception), log warnings (for business logic problems), and log at the “info” level (progress updates with extra information, to later follow a request through the code)
  2. Store logs at a central, searchable location, e.g. AWS CloudWatch, DataSet
  3. Define alerts for too many warnings or errors in the log

Need more observability?

The sections are usually enough for most companies. If you need more advanced observability, e.g. to find bottlenecks in calls across multiple services, you can look into the topic “distributed tracing” with APM tools like AWS XRAY. However, most of you will not need it.

Key Takeaways

  1. Monitor the four golden signals latency, number of requests, number of errors, and saturation with automated alerts
  2. Add logging to your application code and store the logs in an easily searchable location
  3. If possible, use the existing metrics, logging, dashboarding, and alerting tools of your company. Otherwise, prefer managed solutions.

Did I forget anything? Leave your comment below or drop me a line at [email protected].

Dive in
A Practitioner’s Guide to Monitoring Machine Learning Applications
By Lina Weichbrodt • Apr 11th, 2023 Views 3
A Practitioner’s Guide to Monitoring Machine Learning Applications
By Lina Weichbrodt • Apr 11th, 2023 Views 3
A Simple ML Monitoring Blueprint
Dec 14th, 2022 Views 503
Domain-Specific Machine Learning Monitoring
By Demetrios Brinkmann • Jun 25th, 2021 Views 1
Challenges of Feature Monitoring for Real-Time Machine Learning
By Willem Pienaar • Feb 21st, 2023 Views 2