• Products
  • Get started
  • Documentation
  • Resources

Jira Automation docs have moved

All content related to Jira Cloud Automation, previously under the Automate your Jira processes and workflows section, have moved to the new Cloud Automation docs.

Go to Cloud Automation documentation | Why did we do this?

Post-incident review best practices

The way you approach a post-incident review is just as important as the tasks that need to be ticked off. Tensions can run high in the wake of an incident. The key to getting people to come to the process engaged and ready to tackle a difficult problem is to give them a sense of psychological safety.

Best practices for post-incident reviews

Establish a blameless culture – Allow people involved in an incident to account for all their actions, their impact, and what they knew and when, without fear of punishment or retribution. This approach is key to making sure your teams openly share information and get to the root cause of an incident. If anyone fears rebuke they may hold back information or try to redirect blame. When this happens, people lose trust in each other.

Avoid pointing fingers – In your post-incident review meeting—and in the subsequent write-ups—avoid language that singles out individuals as personally responsible for the incident. Instead, focus on actions, results, and impact.

Keep critique constructive – While it’s important to keep the conversation safe and objective, getting to the root cause of the incident is critical to resolving it. Make sure the room doesn't try to steer away from an uncomfortable truth or try to reach an easy consensus. You can use a technique in your meeting called ‘The 5 Whys' to uncover all the deep factors contributing to the problem. Learn how to run a ‘5 Whys Analysis’ with the Atlassian playbook.

Review every post-incident review – An unreviewed post-incident review might as well not have been written. Once a post-incident review has been drafted, it’s important to review it to close out any unresolved issues, capture ideas to consider in the future, and finalize the report. It’s a good idea to schedule a recurring meeting with engineering (and anyone else who may have an interest, like customer support or account managers), at least monthly, to review your post-incident reviews. You can choose to look over recent reviews or older reports and share any relevant lessons.

Creating a post-incident review plan

In order for post-incident reviews to be effective—and allow you to build a culture of continuous improvement—you want to implement a simple, repeatable process that everyone can participate in. How you do this will depend on your culture and your team, but the key to conducting post-incident reviews that improve your team and systems is to have a process and stick to it. Learn how Atlassian runs its post-incident review process.

Here are some tips to get started:

1. Decide which incidents need review

Incidents in your organization should have clear and measurable severity levels. These severity levels can be used to trigger the post-incident review process. For example, any incident Sev-1 or higher triggers a post-incident review, while post-incident reviews can be optional for less severe incidents. Consider allowing team leads or management the opportunity to request a post-incident review for any incident they feel warrants it.

2. Draft your review within two days of the incident

It’s important to take a break and get some rest after an incident. But don’t delay writing the post-incident review. Wait too long and important details might be lost or forgotten. Ideally, it’s drafted immediately after a meeting with the incident team, within 24-48 hours (and not more than five business days) of the incident resolving.

3. Assign roles and owners

Have a meeting to hash out the details that will be recorded into the review. It’s a good idea to delegate drafting the review to a specific person, ideally someone familiar with the incident who has the required level of technical and organizational knowledge to understand the causes and mitigations.

4. Work from a template

A template can keep you from leaving out key details. And it’s a great way to build consistency throughout your postmortem. Check out this example post-incident review template to get started.

5. Include a timeline

A timeline is a very helpful aid in incident documentation. Often it’s the first place your readers’ eyes jump to when trying to quickly size up what happened. You can use the activity feed of an incident to help you see what happened when. Try to be as clear and specific as possible. For example, “11:14 am Pacific Standard Time,” not “around 11.”

Important times to include:

  • First alert or ticket

  • First comms announcement (internal and/or external)

  • Times of status page updates

  • Time of any remediation attempts (code rollbacks, etc.)

  • Time of resolution

6. Add as many details as possible

Leaving out details is a quick path to writing post-incident reviews that are unhelpful and unclear. Add as many details as possible about what happened and what was done during the incident. Instead of “then public comms went out,” say “We sent the initial public comms announcing the incident on our public status page and Twitter account.” Include as many links as possible to issues, status updates, documentation and monitoring charts, and don’t be afraid to attach relevant screenshots.

7. Capture incident metrics

When you capture metrics in your post-incident reviews you apply hard data to the issues and their impact. Having these data points helps you determine if your team is headed in the right direction; reducing the number of incidents, their severity, and downtime. With consistent metrics being measured, you can take a step back and look at incident trends over time.

Some metrics to consider:

  • The number of minutes of downtime, so you can track if this number is doing up or down.

  • The severity of the incident, so you can determine the relative reliability of your systems.

  • Mean Time to Resolution (MTTR), which measures the average time it takes to resolve an incident, from when it was initially reported.

 

Additional Help