We all know that the beautiful world of software is like a basketball balanced on a Jenga game. It's fine while it's fine but all it takes is a semi colon in the wrong place, a bad database failover or some unlucky road worker to cut a cable and the whole world comes crashing down. Fortunately we're all good people and when the pieces are on the floor we pull together and get it all balanced again. But then what?
It’s not good enough to just forget about it. Sweep it under the run and pretend it never happened. Production incidents are part of life. More than that, they are great learning opportunities. By examining what went wrong and identifying the contributing factors, we can not only learn how to prevent this happening again, but also pre-empt other issues within our system.
This talk looks at the strategies for Post Incident Review starting with the common place practices including the Fishbone Diagrams, Sticky Note Brainstorming and the Five Whys methods of root cause analysis. This talk will go beyond these practices to show how leveraging modern monitoring practices along side ChatOps we can the contributing factors, not just a single cause.
These contributing factors can be used to find actionable outcomes that not only help prevent the same thing happening again but feed back into the incident life cycle to make the team better prepared to Detect, Respond to, Remediate and Analyse the next incident.