This was my first solo on-call shift on my new team. I was almost ready to go home when a Critical alert fired. I acknowledged it almost instantly and started troubleshooting. But this was not going well. Wherever I turned, I hit a roadblock. The alert runbook was empty. The dashboards didn't work. And I couldn't see any logs because logging was disabled.
Some team members were still around, and I turned to them for help. I learned that the impacted service shipped merely a week before, and barely anyone knew how it worked. The person who wrote and shipped it was on sick leave.
It took us a few hours to figure out what was happening and to mitigate the outage. This work made one thing apparent - this service was not ready for the prime time.
In the week following the incident, we filled the gaps we had found during the outage. Our main goal was to ensure that future on-calls wouldn't have to scramble when encountering issues with this service.
But the bigger question left unanswered was: how can we avoid similar issues with any new service or feature we will ship in the future?
The idea we came up with was the Service Readiness Checklist.
What is the Service Readiness Checklist?
The Readiness Checklist is a checklist that contains requirements each service (or a bigger feature) needs to meet to be considered ready to ship. It serves two purposes:
to guarantee that none of the aspects related to operating the service have been forgotten
to make it clear who is responsible for ensuring that requirements have been reviewed and met
When we are close to shipping, we create a task that contains a copy of the readiness checklist and assign it to the engineer driving the project. They become responsible for ensuring all requirements on the checklist.
Having one engineer responsible for the checklist helps avoid situations where some requirements fall through the cracks because everyone thought someone else was taking care of them. The primary job of this engineer is to ensure all checkboxes are checked. They may do the work themselves if they choose to or assign items to people involved in the project and coordinate the work.
Occasionally, the checklist owner may decide that some requirements are inapplicable. For example, the checklist may call for setting up deployment, but there is nothing to do if the existing deployment infrastructure automatically covers it.
The checklist will usually contain more than ten requirements. They are all obvious, but it is easy to miss some just because of how many there are.
Example readiness checklist
There is no single readiness checklist that would work for every team because each team operates differently. They all follow different processes and have their own ways of running their code and detecting and troubleshooting outages. There is, however, a common subset of requirements that can be a starting point for a team-specific readiness checklist:
☐ Has the service/feature been introduced to the on-call?
☐ Has sufficient documentation been created for the service? Does it contain information about dependencies, including the on-calls who own them?
☐ Does the service have working dashboards?
☐ Have alerts been created and tested?
☐ Does the service/feature have runbooks (a.k.a. playbooks)?
☐ Has the service been load tested?
☐ Is logging for the service/feature enabled at the appropriate level?
☐ Is automated deployment configured?
☐ Does the service/feature have sufficient test coverage?
☐ Has a rollout plan been developed?
Success story
Our team was tasked to solve a relatively big problem on tight timelines. The solution required building a pipeline of a few services. Because we didn't have enough people to implement this infrastructure within the allotted amount of time, we asked for help. Soon after, a few engineers temporarily joined our team. We were worried, however, that this partnership may not work out because of the differences in our engineering cultures. The Service Readiness Checklist was one of the things (others included coding guidelines, interface-based programming, etc.) that helped set clear expectations. With both teams on the same page, the collaboration was smooth, and we shipped the project on time.
If you found this useful, please share it with a friend and consider subscribing if you haven’t already.
Thanks for reading!
-Pawel