On-call Manual: Measuring the quality of the on-call
If you can't measure it, you can't improve it.
Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.
How does monitoring help?
At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.
At the more granular level, monitoring allows identifying areas that need attention the most, like:
noisy alerts
problematic dependencies
features causing customers’ complaints
repetitive tasks
Continuously addressing the top issues will gradually improve the overall on-call experience.
What metrics to monitor
There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:
outages of the products the team owns
external incidents impacting the products the team owns
the number of alerts, broken down by urgency
the number of alerts alerts acted on and ignored
the number of alerts outside the working hours
time to acknowledge alerts
the number of tickets opened by customers
the number of internal tasks
build breaks
test failures
How to monitor?
On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).
Quantitative metrics
Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:
the number of alerts
the number of tasks
the number of alerts outside the working hours
the noisiest alerts, tracked by alert ID
As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.
Qualitative metrics
Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.
On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:
Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
Additional comments (free flow)
We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.
The Additional comments question is my favorite as it provides insights no other metric can capture.
Call to Action
If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.
If you found this useful, please share it with a friend and consider subscribing if you haven’t already.
Thanks for reading!
-Pawel
You used the same title for both sections, 'Qualitative metrics.' The first one should say "Quantitative metrics."