On-call Manual: Boost your career by improving your team's on-call
Kill two birds with one stone.
I have yet to find a team maintaining critical systems that is happy with its on-call. Most engineers dread their on-call shifts and want to forget about on-call as soon as their shift ends. For some, hectic on-call shifts are the reason to leave the team or even the company.
But this is great news for you. All these factors make improving on-call a great career opportunity. Here are a few reasons:
Team-wide impact. Making the on-call better increases work satisfaction for everyone on the team.
Finding work is easy. No on-call is perfect. There's always something to fix.
No competition. Most engineers consider work related to on-call uninteresting, so you can fully own the entire area. As a result, your scope might be bigger than any other development work you own.
Getting started
It is difficult to propose meaningful improvements to your team's on-call before your first shift. You need to become familiar with your team's on-call responsibilities and problems before trying to make it better.
Once you have a few shifts under your belt, you should know the most problematic areas. Come up with a few concrete actions to remedy the biggest issues. This list doesn't have to be complete to get started. Some examples include tuning (or deleting) the noisiest alerts, refactoring fragile code, or automating time-consuming manual tasks.
Talk to your manager about the improvements you want to make. No manager who cares about their team would refuse the offer to improve the team's on-call. If the timing is not right (e.g., your team is closing a big release), ask your manager when a better time would be. Mention that you may need their help to ensure the participation of all team members.
Set your expectations right. Despite the improvements, don't expect your team members to suddenly start loving their on-call. It's a win if they stop dreading it.
Execution
From my experience, the two most effective ways to improve the on-call is to have regular (e.g., twice a year) fixathons combined with ongoing maintenance.
During a fixathon, the entire team spends a few days fixing the biggest on-call issues. In most cases, these will be issues that started occurring since the previous fixathon but weren't taken care of by on-calls during their shifts. You may need to work closely with your manager to ensure the entire team's participation, especially at the beginning.
Ongoing maintenance involves fixing problems as they arise, usually done by the person on call. As some shifts are heavier than others, the on-call may not always be able to address all issues.
Your role
Before talking about what your role is, let's talk about what your role isn't.
Your role isn't to single-handedly fix all on-call issues.
This approach doesn't scale. If you try it, you will eventually burn out, struggling to do two full-time jobs simultaneously: your regular responsibilities and fixing on-call issues. The worst part is that your team members won't feel responsible for maintaining the on-call quality. They might even care less because now somebody is fixing issues for them.
While you should still participate in fixing on-call issues, your main role is to:
organize fixathons - identify the most pressing issues and distribute issues for the team to work on, track progress, and measure the improvement
ensure on-calls are addressing issues they encountered during their shifts
build tools - e.g., dashboards to monitor the quality of the on-call or queries that allow to identify the biggest problems quickly
If you do this consistently, your team members will eventually find fixing on-call issues natural.
Skills you will learn
Driving on-call improvements will help you hone a few skills that are key for successful senior and even staff engineers:
leading without authority - as the owner of the on-call improvement area you're responsible for coming up with the plan and leading its execution
scaling through others - because you involve the entire team, you can get much more done than if you did it yourself
influencing the engineering culture of the team - ingraining a sense of responsibility for the on-call quality in team members is an impactful change
holding people accountable - making sure everyone does their part is always a challenge
identifying problems worth solving - instead of being told what problems to solve, you are responsible for finding these problems and deciding if they are worth solving
Expanding your scope
Once you start seeing the results of your work, you can take it further to expand your scope.
You can become the engineer who manages the on-call rotation for your team. This work doesn't take a lot of time but can save a lot of headaches for your manager. The typical responsibilities include:
managing the on-call schedule
organizing onboarding new team members to the on-call rotation
helping figure out shift swaps and substitutions
Another way to increase your scope is to share your experience with other teams. You can organize talks showing what you did, the results you achieved, and what worked and what didn't. You can also generalize the tools you built so that other teams can use them.
If you found this useful, please share it with a friend and consider subscribing if you haven’t already.
Thanks for reading!
-Pawel