Incident reviews
Nobody enjoys it when things go wrong, but we accept that it happens. When something goes wrong, we try to do an incident review so that we can learn from our mistakes. Incident reviews are not about attributing blame, and in doing them we follow the Prime Directive:
“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”
— Norm Kerth, Project Retrospectives: A Handbook for Team Review
Our reviews can come in one of two forms, depending on how serious the incident was:
An ‘Oops’ blog post
If the problem was something minor with only a small effect, we should write an ‘Oops’ blog post about it and share it in the tech blog. This should have three sections:
- What happened?
- Why did it happen?
- How are we making sure it doesn’t happen again?
A full incident review
If the problem was something more significant then we can choose to conduct a full incident review. We must always conduct a full review in the cases of security incidents.
A full incident review should be conducted with all relevant members of the team present, and have the following broad structure:
- What happened?
- What were the effects of it happening?
- Why did it happen?
- What would have prevented it happening?
- What are we changing to make sure it doesn’t happen again?
The incident report should then be circulated to the Technology Team for comments, and actions assigned where necessary.