A Gentle Introduction to Backend Monitoring
Everything you need to know if you don’t have production experience If you are a developer or data scientist without production experience , this article is for you
April 4, 2023Everything you need to know if you don’t have production experience
If you are a developer or data scientist without production experience, this article is for you. I will give you a hands-on introduction to the foundations of backend monitoring based on the best practices of IT-first companies like Google. You will learn about metrics, logging, dashboards, and alerting. If you prefer to watch a video instead, check the first half of my PyData talk.
If you are also interested in machine learning monitoring, check the second part of this series: A Practitioner’s Guide to Monitoring Machine Learning Applications.
You cannot avoid problems, focus on the detection
If you are without production experience, you might wonder why monitoring is important. On-call engineers in companies of all sizes would tell you that everyone experiences problems on a regular basis. Need proof? Even highly professional companies like Google Cloud and AWS provide status dashboards for their users to track problems (Google, AWS).
No matter how hard you try (and you should add tests and ci/cd pipelines!), you cannot eliminate bugs or other human errors. Instead of striving for zero live problems, we focus on identifying problems in real-time, as soon as possible. We then categorize the problems based on their severity and fix them accordingly.
Measure the Four Golden Signals
Fortunately, established guidelines from Site Reliability Engineering (SRE) exist (read Google’s SRE book online), so we don’t need to reinvent the wheel; a best practice is to keep an eye on the four golden signals:
- Latency: the time it takes to serve a request
- Traffic: the number of requests per second/minute
- Errors: the number of requests that fail
- Saturation: the load on your network and servers
By prioritizing these four signals, we focus on end-user pain. These metrics are showing you that something is wrong: Your service is answering slowly, responding with errors or you suddenly get fewer user requests than normally (indicating a problem with the application that calls you).
Ok, now you might be wondering how to actually measure these signals. If you work in a company or team with existing backend services, there is a good chance that there is already a metric collection service in place that you can use. If that is not the case, I recommend that you use a hosted metric collection service. You pay a bit more but can get to production quicker and you don’t have to maintain a metric collection service. A popular tool for collecting metrics is Prometheus (you can use it as a hosted service on AWS or other vendors).
Metric collection is implemented in two steps:
- Add metrics to your code: Your metric collection provider typically offers a library in the programming language of your choice. You create metric objects that you call on each request. Here is a python example
Store the metrics: If you choose the metric service Prometheus, it calls your application every n seconds on an API endpoint /metrics. The endpoint lists the current metric values. Other metrics collection services like AWS Cloudwatch push the metrics actively. This is done by installing the Cloudwatch agent on the machine that runs your application. The agent regularly sends the observed metrics to the metric service, e.g. every minute.
Build dashboards and automated alerts
Once you have collected metrics, you create dashboards. The dashboards allow you to observe your metrics over time and to debug the service behavior once you notice a problem.
Now you create automated alerts. The alerts will notify you if your metric is below or above a defined threshold. You can get notified on a communication channel of your choice: email, SMS, messenger, or send the alerts to an app that is used by the on-call engineer.
What technologies should you use to create dashboards and alerts? Similar to the metric collection, it makes sense to use the existing dashboard and alerting tools of your company. If you are selecting these components, check if your metric collection service also supports dashboards and alerting (many of them do). If you, later on, prefer better dashboards I can recommend the separate dashboard service Grafana which can read from many metric collection services. But it’s ok to start simple.
Collect Logs
If you get an alert that your application responds with errors, wouldn’t it be nice to know why? Logs help you to answer that question.
Logs are text lines that print intermediate information during the processing of the request, e.g.
The printed lines are typically sent to a central location where you can query them. A nice service provider for centralized logging is DataSet:
How to implement logging:
- Add logs to your application: log errors (add diagnostic information to an exception), log warnings (for business logic problems), and log at the “info” level (progress updates with extra information, to later follow a request through the code)
- Store logs at a central, searchable location, e.g. AWS CloudWatch, DataSet
- Define alerts for too many warnings or errors in the log
Need more observability?
The sections are usually enough for most companies. If you need more advanced observability, e.g. to find bottlenecks in calls across multiple services, you can look into the topic “distributed tracing” with APM tools like AWS XRAY. However, most of you will not need it.
Key Takeaways
- Monitor the four golden signals latency, number of requests, number of errors, and saturation with automated alerts
- Add logging to your application code and store the logs in an easily searchable location
- If possible, use the existing metrics, logging, dashboarding, and alerting tools of your company. Otherwise, prefer managed solutions.
Did I forget anything? Leave your comment below or drop me a line at [email protected].