Awesome
A curated list of awesome tech postmortem resources, inspired by and templated on awesome-python.
<!-- prettier-ignore-start --> <!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->Table of Contents generated with DocToc
<!-- END doctoc generated TOC please keep comment here to allow auto update --> <!-- prettier-ignore-end -->Awesome Tech Postmortems
The postmortems, studies, and resources curated here are those which enable learning from incidents.
PSA: Incident Analysis is completely different than the standard postmortem process that you see written about in the Google SRE book and other incident marketing materials. It is a whole field of study and practice on extracting valuable data from incidents focusing on how.
The postmortems linked generally do not meet the bar that Jones suggests for incident analysis -- because very few (no?) tech organizations publish them. However, many of the studies and resources linked do line up with the incident analysis perspective.
Postmortems
- Incident Report: Running Dry on Memory Without Noticing
(2019-11-06,
Honeycomb)
- I sat in on a Honeycomb incident review and you won't believe what we learned next (2019-11-08, Jacob Scott)
- The Consul outage that never happened (2019-08-07, Gitlab)
- We had issues with Monzo on 29th July (2019-07-29, Monzo)
- Details of the Cloudflare outage on July 2, 2019 (2019-07-02, Cloudflare)
- What We Learned from the Recent Mandrill Outage (2019-03-26, Mailchimp)
- Postmortem: Azure DevOps Service Outages in October 2018 (2018-10-16, Azure DevOps Service)
- Incident review: API and Dashboard outage on 10 October 2017 (2017-10-10, GoCardless)
- Postmortem of database outage of January 31 (2017-01-31, GitLab)
Studies
- Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems (2018, Marisa Grayson)
- Report from the SNAFUcatchers Workshop on Coping With Complexity (2017-03-16, SNAFUcatchers)
- Counterfactual Thinking, Rules, and The Knight Capital Accident (2013-10-29, John Allspaw)
Bulk incident analysis
- What bugs cause production cloud incidents (2019-05-13, various)
- Incidents — Trends from the Trenches (2019-02-26, Subbu Allamaraju/Expedia)
How to approach postmortems
Other lists of postmortems
- danluu/post-mortems on GitHub
- lorin/major-incidents on GitHub
Contributing
Your contributions are always welcome! Please take a look at the contribution guidelines first.
I will keep some pull requests open if I'm not sure whether those resources are truly awesome tech postmortems, you could vote for them by adding :+1: to them.
If you have any question about this opinionated list, do not hesitate to contact me @jhscott on Twitter or open an issue on GitHub.