One (selfish) reason to celebrate a new team member is that they will eventually join the on-call rotation. And when they do, the existing shifts will move farther apart. However, adding an unprepared engineer to the on-call rotation can be a disaster. This post describes what on-call onboarding looks like on our team.
The on-call onboarding process is the same for each new team member. It consists of the following steps:
Regular ramp-up
On-call overview
Shadow shift
Reverse shadow shift
First solo shift
Let's look into each of these steps in more detail.
Regular ramp-up
The regular ramp-up aims to help new team members familiarize themselves with the problems the team is solving and teach them how to work effectively in the team's codebase. We want new colleagues to work on the code they will be responsible for when they are on call later. This approach allows them to acquire basic context that will be useful for maintaining this code and troubleshooting issues.
On-call overview
Regular ramp-up is rarely sufficient for new people to grasp the entire infra the team is responsible for. And knowing this infra is just the tip of the iceberg. There is much more an effective on-call needs to be familiar with, for instance:
what are the dependencies, and what is the impact of their failures
how to find dashboards and use them for debugging
where to find the documentation (e.g., runbooks)
expectations, e.g., is the on-call responsible for alerts raised outside working hours
how to do deployments and rollbacks
tools used to troubleshoot and fix issues
standard operating procedures
and more
On our team, we organize knowledge-sharing sessions that give new team members an overview of all these areas. We record these sessions to make revisiting unclear topics easy.
Shadow on-call shift
During the shadow on-call shift, the on-call-in-training (a.k.a. secondary on-call) shadows an experienced on-call (a.k.a. primary on-call). Both on-calls are subscribed to all tasks and alerts, but resolving issues is the primary on-call's responsibility. The primary on-call is expected to show the secondary on-call how to deal with outages. This is usually limited to problems occurring during working hours. Finally, the primary on-call can ask the secondary on-call to handle non-critical tasks, providing guidance as needed.
Reverse shadow on-call shift
After the shadow shift, things get real: the on-call in training becomes the primary on-call. They are now responsible for handling all alerts, tasks, deployments, etc. However, they are not alone—they have an experienced on-call having their back during the entire shift.
We schedule shadow and reverse shadow shifts back-to-back. This way, everything the on-call-in-training learned during the first shift is fresh when they become the primary on-call.
First solo shift
Once shadowing is complete, we add the new team member to the on-call rotation. We add them to the queue's end, giving them additional time to learn more about our systems and the infrastructure.
In addition to training new on-calls, our team maintains a chat to discuss on-call problems and get help when resolving issues. Both new and experienced on-calls regularly use this chat when they are stuck because they know someone will be there to help them.
If you found this useful, please share it with a friend and consider subscribing if you haven’t already.
Thanks for reading!
-Pawel