Root Cause Analysis (RCA) / Post-mortem Template

While the Cloud team strives to keep Sourcegraph Cloud running smoothly, occassionaly outages do happen. When they do, we should respond with a clear post-mortem.

This is an industry norm and clear, high quality post-mortems can improve a company’s trustworthess.

Template

You should consider your audience when writing post-mortems. Is this internal facing or going to be delivered to customers?

  1. Incident Summary - briefly describe the incident and its impact on customers. Include aspects such as scope, duration and severity.
  2. Brief timeline of events
  3. Impact Analysis - Describe the impact of the incident on customers, including downtime or disruptions to the service
  4. Root Cause - explain the underlying cause of the incident. Optionally including technical, operational or processes issues that lead to the incident The best post-mortems usually contain a clear explanation of what happened.
  5. Remediation - describe the actions taken to fix the incident. Optionally include short-term workarounds.
  6. Follow-ups actions & lessons learned - describe either future steps that will be taken to prevent the issue from happening in the future or changes that have already been made.
  7. Closure - Communicate to the customer that we take service disruptions seriously and that we encourage them to reach out with any further questions.

Further steps for documents

If there any security concerns that arose during the incident have the #security-team approve the document.

If the document is intened to be viewed by customers, request a review from our communications team (#ask-internal-comms).

High-quality or interesting post-mortems should be also considered to be published to align with Sourcegraph’s Open And Transparent Culture.

Prior Art