Blameless Postmortem: Learning from Failure minus the BLAME!
As we all know that Post-mortem (meaning after death) is short for 'post-mortem examination', or autopsy, an examination of a corpse in order to determine cause of death.
What does this mean for us working in the technology space? or how did this term/process got etched in our books?
"We can't blame the technology when we make mistakes" - Tim Berners-Lee
Have you ever stumbled upon this quote? Yes we can't blame the technology also we must not blame the people working on that technology, yes you heard that right!
Incidents and mistakes happen. They just do. As our systems grow in scale and complexity, failures are inevitable. As IT professionals, we understand that failure is inevitable in complex systems. How we respond to failure when it occurs matters.
An engineer or team who thinks they're going to be reprimanded has no incentive to give a realistic, accurate account of the problem. Not understanding how an accident occurred all but guarantees that it will happen again, if not with the original engineer or team then with someone else.
Blameless Postmortem to the rescue!
Organizations that practice DevOps want to view mistakes and errors with a goal of learning. Having blameless postmortems on outages and accidents are part of that goal.
Having a just culture means that you're making an effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure and on the decision-making process of individuals close to the failure, an organization can come out safer than it would if it had punished the people involved.
A blameless post-mortem means that engineers whose actions have contributed to an accident can give a detailed account of:
What actions they took at what time.
What effects they observed.
What expectations they had.
What assumptions they made.
Their understanding of the timeline of events as they occurred.
It's important that they can give this detailed account without fear of punishment or retribution.
Allow engineers to own their own stories
A funny thing happens when engineers make mistakes and feel safe when giving details about it. They're not only willing to be held accountable, they're also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the ones with the most expertise when it comes to the error. They ought to be heavily involved in coming up with the remediation.
How do I enable a "just culture"?
Encourage learning by having blameless postmortems on outages and accidents.
Remind yourself that the goal is to understand how an accident could have happened in order to better equip ourselves from it happening in the future.
Gather details from multiple perspectives on failures, and don't punish people for making mistakes.
Instead of punishing engineers, give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
Enable and encourage people who do make mistakes to be the experts on educating the rest of the organization on how not to make them in the future.
Accept that there is always a discretionary space where humans can decide to act or not act, and that the assessment of those decisions lies in hindsight.
Accept that Hindsight bias can cloud our assessment of past events, so work hard to eliminate it.
Accept that the Fundamental attribution error is also difficult to escape, so focus on the environment and circumstances people are working in when investigating accidents.
Strive to make sure that the blunt end (for example, boards or senior leadership) of the organization understands how work is actually getting done. Compare this to how they imagine it's getting done, through Gantt charts and procedures on the sharp end (for example, engineers and technology).
The sharp end must inform the organization where the line is between appropriate and inappropriate behavior. This isn't something that the blunt end can come up with on its own.
Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.
ᶠᵘⁿᵈᵃᵐᵉⁿᵗᵃˡ ᴬᵗᵗʳᶦᵇᵘᵗᶦᵒⁿ ᴱʳʳᵒʳ
Learnt this concept while I was preparing for AZ-400 exam: https://docs.microsoft.com/en-us/learn/modules/manage-site-reliability/6-blameless-postmortem