Postmortems

Learn from incidents with structured, blameless postmortems that drive meaningful improvements.

What Is a Postmortem?#

A postmortem is a structured review conducted after an incident is resolved. Its purpose is to understand what happened, why it happened, and what changes will prevent it from happening again. Postmortems are the bridge between responding to incidents and actually improving your systems.

JustAnalytics automates much of the postmortem process -- pre-populating timelines, linking relevant data, and tracking action items to completion.

Blameless Culture#

Postmortems in JustAnalytics are designed to be blameless. The goal is to improve systems and processes, not to find someone to punish.

Principles#

Assume good intent -- people made the best decisions they could with the information available
Focus on systems -- if a human error caused an outage, ask why the system allowed that error to have such impact
No counterfactuals -- avoid "if only X had done Y." Focus on what actually happened and what systemic changes prevent recurrence
Celebrate detection -- the person who found the bug is a hero, not a suspect
Share openly -- postmortems should be visible to the whole engineering org

What Blameless Looks Like in Practice#

Instead of:

"Developer X pushed a bad config change without testing it."

Write:

"A configuration change was deployed that had not been validated by the staging environment. Our deploy pipeline does not currently enforce staging validation for config changes."

The second framing naturally leads to a systemic fix (enforce staging validation) rather than a punitive action.

Postmortem Workflow#

Step 1: Auto-Generation#

When an incident is resolved, JustAnalytics automatically creates a postmortem draft if auto-generation is enabled for the project.

The draft includes:

Incident metadata -- title, severity, duration, affected services
Timeline -- all entries from the incident timeline, formatted chronologically
Linked alerts -- which alerts fired and when
Linked error groups -- error details and stack traces
Linked releases -- deploys that occurred around the incident window
Metric snapshots -- error rate and latency charts for affected services during the incident

Navigate to the postmortem from the resolved incident page, or find it under Dashboard > Incidents > Postmortems.

Step 2: Fill in the Template#

The auto-generated draft provides the structure. Your team fills in the analysis:

Summary -- high-level description of the incident and its impact
Root cause -- the underlying issue that caused the incident
Contributing factors -- additional conditions that enabled or worsened the incident
Action items -- concrete tasks to prevent recurrence

Step 3: Review#

Schedule a postmortem review meeting within 3-5 business days of the incident. During the review:

Walk through the timeline together
Discuss the root cause analysis
Agree on action items and assign owners
Identify systemic improvements

Step 4: Publish#

After review, publish the postmortem to make it visible to the team. Published postmortems are searchable and linked from the incident record.

Step 5: Track Action Items#

Action items from the postmortem are tracked to completion. Each item has:

Description -- what needs to be done
Owner -- who is responsible
Due date -- when it should be completed
Status -- open, in progress, completed
Link -- optional link to a Jira/GitHub/Linear issue

JustAnalytics shows a dashboard-wide view of all open postmortem action items so nothing falls through the cracks.

Template Structure#

Every postmortem follows a consistent template:

Header#

Title:            [Incident Title]
Incident ID:      INC-2026-0042
Severity:         SEV1
Duration:         2h 14m (14:32 - 16:46 UTC)
Affected Services: payment-service, checkout-api
Commander:        Jane Smith
Date:             2026-03-15
Status:           Draft | In Review | Published

Summary#

A 2-3 paragraph description of the incident written for a broad audience. Answer:

What happened?
Who was affected and how?
What was the business impact?

## Summary

On March 15, 2026, the payment processing system experienced a 2-hour partial
outage affecting approximately 30% of checkout attempts. Users encountered
"Payment Failed" errors when attempting to complete purchases.

The incident was caused by a database migration that added a NOT NULL column
without a default value to the payments table. This caused INSERT operations
to fail for any payment that did not include the new field, which had not yet
been added to the API payload.

Estimated revenue impact: ~$12,000 in delayed transactions. All transactions
were recovered after the fix was deployed.

Timeline#

A chronological record of events. The auto-generated timeline includes all incident timeline entries. Add any additional context discovered during the review.

## Timeline (all times UTC)

| Time  | Event |
|-------|-------|
| 14:30 | Deploy v2.3.1 rolls out to production (contains migration) |
| 14:32 | First 500 errors appear in payment-service |
| 14:35 | Error rate alert fires (payment-service error rate > 5%) |
| 14:37 | Incident INC-2026-0042 declared (SEV2) |
| 14:38 | Jane Smith assigned as incident commander |
| 14:42 | Escalated to SEV1 after confirming 30% of checkouts affected |
| 14:45 | Root cause identified: migration added NOT NULL column |
| 14:50 | Decision: rollback migration rather than hotfix API |
| 15:02 | Rollback migration deployed |
| 15:05 | Error rates begin declining |
| 15:15 | Error rates return to baseline, incident mitigated |
| 16:46 | Confirmed all queued payments processed, incident resolved |

Root Cause#

A detailed technical explanation of the underlying cause. Be specific enough that someone unfamiliar with the service could understand.

## Root Cause

The database migration in commit `abc123` added a `fraud_score` column to the
`payments` table with a `NOT NULL` constraint and no default value:

    ALTER TABLE payments ADD COLUMN fraud_score FLOAT NOT NULL;

The payment-service API had not yet been updated to include `fraud_score` in
its INSERT statements. PostgreSQL rejected all INSERT operations that omitted
the column, causing 500 errors for ~30% of checkout attempts (those routed to
the updated database replicas).

Contributing Factors#

Other conditions that enabled or amplified the incident:

## Contributing Factors

1. **No staging validation for migrations** -- the migration ran successfully
   in CI against an empty test database but was never tested against staging
   with live traffic patterns.

2. **Canary deploy did not catch the issue** -- the canary only received 2%
   of traffic and the error rate threshold was set to 10%, so the 30% failure
   rate in canary was below the absolute threshold needed to trigger a rollback.

3. **Alert delay** -- the error rate alert had a 3-minute evaluation window,
   adding 3 minutes between first errors and notification.

Action Items#

Concrete, assignable tasks that address the root cause and contributing factors:

## Action Items

| # | Action | Owner | Due Date | Status | Ticket |
|---|--------|-------|----------|--------|--------|
| 1 | Add staging migration validation to CI pipeline | @alice | 2026-03-22 | Open | ENG-1234 |
| 2 | Lower canary error rate threshold to 5% | @bob | 2026-03-19 | Open | ENG-1235 |
| 3 | Add NOT NULL migration lint rule to prevent columns without defaults | @alice | 2026-03-25 | Open | ENG-1236 |
| 4 | Reduce alert evaluation window from 3m to 1m for payment-service | @carol | 2026-03-18 | Completed | ENG-1237 |

Lessons Learned#

What went well, what didn't, and where you got lucky:

## Lessons Learned

### What went well
- Alert fired within 3 minutes of the first error
- Root cause was identified quickly (8 minutes after declaration)
- Rollback procedure worked as expected

### What didn't go well
- Migration was not tested against staging with realistic traffic
- Canary thresholds were too loose to catch the issue
- Initial severity was SEV2; should have been SEV1 from the start

### Where we got lucky
- All failed payments were queued and retried successfully after the fix
- The incident occurred during low-traffic hours (2:30 PM UTC)

Postmortems can link to any observability data in JustAnalytics:

// Link a release to a postmortem via API
await fetch(`/api/dashboard/postmortems/${postmortemId}/links`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    type: 'release',
    targetId: 'rel_v2.3.1',
    relationship: 'caused_by',  // 'caused_by' | 'fixed_by'
  }),
});

Automated Postmortem Generation#

Enable auto-generation in Dashboard > Settings > Incidents > Auto-generate postmortems.

When enabled, resolving an incident automatically creates a postmortem draft with:

All timeline entries formatted as a table
Linked alerts, error groups, and releases pulled in
Metric snapshots (error rate, latency) for affected services during the incident window
Empty sections for root cause, contributing factors, and action items

You can customize the template under Dashboard > Settings > Incidents > Postmortem Template.

Template Variables#

The auto-generated template supports these variables:

{{incident.title}}          -- Incident title
{{incident.id}}             -- Incident ID (e.g., INC-2026-0042)
{{incident.severity}}       -- Severity level
{{incident.duration}}       -- Human-readable duration
{{incident.services}}       -- Comma-separated affected services
{{incident.commander}}      -- Commander name
{{incident.timeline}}       -- Formatted timeline table
{{incident.alerts}}         -- Linked alerts summary
{{incident.errors}}         -- Linked error groups summary
{{incident.releases}}       -- Linked releases summary
{{metrics.errorRate}}       -- Error rate chart embed
{{metrics.latency}}         -- Latency chart embed

Postmortem Dashboard#

List View#

View all postmortems under Dashboard > Incidents > Postmortems:

Status -- draft, in review, published
Incident -- linked incident ID and title
Severity -- incident severity
Date -- when the incident occurred
Action items -- count of open vs. completed items

Action Item Tracker#

The action item tracker aggregates all postmortem action items across your organization:

Filter by status (open, in progress, completed)
Filter by owner
Filter by due date (overdue, due this week, upcoming)
Sort by priority (SEV1 action items first)

This ensures postmortem follow-through. Action items without owners or due dates are flagged.

Best Practices#

Timing#

Generate the draft immediately after resolution while details are fresh
Hold the review meeting within 3-5 business days
Publish within 1 week of the incident

Writing#

Write for an audience that wasn't involved in the incident
Be specific about times, percentages, and user impact
Avoid jargon -- a product manager should be able to understand the summary
Include "where we got lucky" to surface hidden risks

Action Items#

Every postmortem should have at least one action item
Action items should be SMART: Specific, Measurable, Achievable, Relevant, Time-bound
Assign an owner to every item -- unowned items don't get done
Track completion -- schedule a follow-up review 2 weeks after the meeting

Review Meetings#

Keep it to 30-60 minutes
The incident commander facilitates
Focus on systems, not individuals
End with clear agreement on action items and owners
Record the meeting or share detailed notes for those who couldn't attend

Organizational Habits#

Make postmortems mandatory for SEV1 and SEV2 incidents
Make postmortems optional but encouraged for SEV3
Share published postmortems in a team channel or newsletter
Review postmortem action item completion rates monthly
Celebrate thorough postmortems -- they make your systems better

Postmortems

What Is a Postmortem?#

Blameless Culture#

Principles#

What Blameless Looks Like in Practice#

Postmortem Workflow#

Step 1: Auto-Generation#

Step 2: Fill in the Template#

Step 3: Review#

Step 4: Publish#

Step 5: Track Action Items#

Template Structure#

Header#

Summary#

Timeline#

Root Cause#

Contributing Factors#

Action Items#

Lessons Learned#

Alerts#

Error Groups#

Traces#

Releases#

Automated Postmortem Generation#

Template Variables#

Postmortem Dashboard#

List View#

Action Item Tracker#

Best Practices#

Timing#

Writing#

Action Items#

Review Meetings#

Organizational Habits#