Postmortems
Learn from incidents with structured, blameless postmortems that drive meaningful improvements.
What Is a Postmortem?#
A postmortem is a structured review conducted after an incident is resolved. Its purpose is to understand what happened, why it happened, and what changes will prevent it from happening again. Postmortems are the bridge between responding to incidents and actually improving your systems.
JustAnalytics automates much of the postmortem process -- pre-populating timelines, linking relevant data, and tracking action items to completion.
Blameless Culture#
Postmortems in JustAnalytics are designed to be blameless. The goal is to improve systems and processes, not to find someone to punish.
Principles#
- Assume good intent -- people made the best decisions they could with the information available
- Focus on systems -- if a human error caused an outage, ask why the system allowed that error to have such impact
- No counterfactuals -- avoid "if only X had done Y." Focus on what actually happened and what systemic changes prevent recurrence
- Celebrate detection -- the person who found the bug is a hero, not a suspect
- Share openly -- postmortems should be visible to the whole engineering org
What Blameless Looks Like in Practice#
Instead of:
"Developer X pushed a bad config change without testing it."
Write:
"A configuration change was deployed that had not been validated by the staging environment. Our deploy pipeline does not currently enforce staging validation for config changes."
The second framing naturally leads to a systemic fix (enforce staging validation) rather than a punitive action.
Postmortem Workflow#
Step 1: Auto-Generation#
When an incident is resolved, JustAnalytics automatically creates a postmortem draft if auto-generation is enabled for the project.
The draft includes:
- Incident metadata -- title, severity, duration, affected services
- Timeline -- all entries from the incident timeline, formatted chronologically
- Linked alerts -- which alerts fired and when
- Linked error groups -- error details and stack traces
- Linked releases -- deploys that occurred around the incident window
- Metric snapshots -- error rate and latency charts for affected services during the incident
Navigate to the postmortem from the resolved incident page, or find it under Dashboard > Incidents > Postmortems.
Step 2: Fill in the Template#
The auto-generated draft provides the structure. Your team fills in the analysis:
- Summary -- high-level description of the incident and its impact
- Root cause -- the underlying issue that caused the incident
- Contributing factors -- additional conditions that enabled or worsened the incident
- Action items -- concrete tasks to prevent recurrence
Step 3: Review#
Schedule a postmortem review meeting within 3-5 business days of the incident. During the review:
- Walk through the timeline together
- Discuss the root cause analysis
- Agree on action items and assign owners
- Identify systemic improvements
Step 4: Publish#
After review, publish the postmortem to make it visible to the team. Published postmortems are searchable and linked from the incident record.
Step 5: Track Action Items#
Action items from the postmortem are tracked to completion. Each item has:
- Description -- what needs to be done
- Owner -- who is responsible
- Due date -- when it should be completed
- Status -- open, in progress, completed
- Link -- optional link to a Jira/GitHub/Linear issue
JustAnalytics shows a dashboard-wide view of all open postmortem action items so nothing falls through the cracks.
Template Structure#
Every postmortem follows a consistent template:
Header#
Title: [Incident Title]
Incident ID: INC-2026-0042
Severity: SEV1
Duration: 2h 14m (14:32 - 16:46 UTC)
Affected Services: payment-service, checkout-api
Commander: Jane Smith
Date: 2026-03-15
Status: Draft | In Review | Published
Summary#
A 2-3 paragraph description of the incident written for a broad audience. Answer:
- What happened?
- Who was affected and how?
- What was the business impact?
## Summary
On March 15, 2026, the payment processing system experienced a 2-hour partial
outage affecting approximately 30% of checkout attempts. Users encountered
"Payment Failed" errors when attempting to complete purchases.
The incident was caused by a database migration that added a NOT NULL column
without a default value to the payments table. This caused INSERT operations
to fail for any payment that did not include the new field, which had not yet
been added to the API payload.
Estimated revenue impact: ~$12,000 in delayed transactions. All transactions
were recovered after the fix was deployed.
Timeline#
A chronological record of events. The auto-generated timeline includes all incident timeline entries. Add any additional context discovered during the review.
## Timeline (all times UTC)
| Time | Event |
|-------|-------|
| 14:30 | Deploy v2.3.1 rolls out to production (contains migration) |
| 14:32 | First 500 errors appear in payment-service |
| 14:35 | Error rate alert fires (payment-service error rate > 5%) |
| 14:37 | Incident INC-2026-0042 declared (SEV2) |
| 14:38 | Jane Smith assigned as incident commander |
| 14:42 | Escalated to SEV1 after confirming 30% of checkouts affected |
| 14:45 | Root cause identified: migration added NOT NULL column |
| 14:50 | Decision: rollback migration rather than hotfix API |
| 15:02 | Rollback migration deployed |
| 15:05 | Error rates begin declining |
| 15:15 | Error rates return to baseline, incident mitigated |
| 16:46 | Confirmed all queued payments processed, incident resolved |
Root Cause#
A detailed technical explanation of the underlying cause. Be specific enough that someone unfamiliar with the service could understand.
## Root Cause
The database migration in commit `abc123` added a `fraud_score` column to the
`payments` table with a `NOT NULL` constraint and no default value:
ALTER TABLE payments ADD COLUMN fraud_score FLOAT NOT NULL;
The payment-service API had not yet been updated to include `fraud_score` in
its INSERT statements. PostgreSQL rejected all INSERT operations that omitted
the column, causing 500 errors for ~30% of checkout attempts (those routed to
the updated database replicas).
Contributing Factors#
Other conditions that enabled or amplified the incident:
## Contributing Factors
1. **No staging validation for migrations** -- the migration ran successfully
in CI against an empty test database but was never tested against staging
with live traffic patterns.
2. **Canary deploy did not catch the issue** -- the canary only received 2%
of traffic and the error rate threshold was set to 10%, so the 30% failure
rate in canary was below the absolute threshold needed to trigger a rollback.
3. **Alert delay** -- the error rate alert had a 3-minute evaluation window,
adding 3 minutes between first errors and notification.
Action Items#
Concrete, assignable tasks that address the root cause and contributing factors:
## Action Items
| # | Action | Owner | Due Date | Status | Ticket |
|---|--------|-------|----------|--------|--------|
| 1 | Add staging migration validation to CI pipeline | @alice | 2026-03-22 | Open | ENG-1234 |
| 2 | Lower canary error rate threshold to 5% | @bob | 2026-03-19 | Open | ENG-1235 |
| 3 | Add NOT NULL migration lint rule to prevent columns without defaults | @alice | 2026-03-25 | Open | ENG-1236 |
| 4 | Reduce alert evaluation window from 3m to 1m for payment-service | @carol | 2026-03-18 | Completed | ENG-1237 |
Lessons Learned#
What went well, what didn't, and where you got lucky:
## Lessons Learned
### What went well
- Alert fired within 3 minutes of the first error
- Root cause was identified quickly (8 minutes after declaration)
- Rollback procedure worked as expected
### What didn't go well
- Migration was not tested against staging with realistic traffic
- Canary thresholds were too loose to catch the issue
- Initial severity was SEV2; should have been SEV1 from the start
### Where we got lucky
- All failed payments were queued and retried successfully after the fix
- The incident occurred during low-traffic hours (2:30 PM UTC)
Linking Related Data#
Postmortems can link to any observability data in JustAnalytics:
Alerts#
Link the alerts that detected the incident. The postmortem shows the alert configuration, when it fired, and the metric values at the time.
Error Groups#
Link error groups that were part of the incident. Stack traces and error counts are embedded in the postmortem view.
Traces#
Link specific trace IDs that demonstrate the failure path. The postmortem renders the trace waterfall inline.
Releases#
Link the release that caused the incident and the release that fixed it. This creates a clear causal chain from code change to impact to resolution.
// Link a release to a postmortem via API
await fetch(`/api/dashboard/postmortems/${postmortemId}/links`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
type: 'release',
targetId: 'rel_v2.3.1',
relationship: 'caused_by', // 'caused_by' | 'fixed_by'
}),
});
Automated Postmortem Generation#
Enable auto-generation in Dashboard > Settings > Incidents > Auto-generate postmortems.
When enabled, resolving an incident automatically creates a postmortem draft with:
- All timeline entries formatted as a table
- Linked alerts, error groups, and releases pulled in
- Metric snapshots (error rate, latency) for affected services during the incident window
- Empty sections for root cause, contributing factors, and action items
You can customize the template under Dashboard > Settings > Incidents > Postmortem Template.
Template Variables#
The auto-generated template supports these variables:
{{incident.title}} -- Incident title
{{incident.id}} -- Incident ID (e.g., INC-2026-0042)
{{incident.severity}} -- Severity level
{{incident.duration}} -- Human-readable duration
{{incident.services}} -- Comma-separated affected services
{{incident.commander}} -- Commander name
{{incident.timeline}} -- Formatted timeline table
{{incident.alerts}} -- Linked alerts summary
{{incident.errors}} -- Linked error groups summary
{{incident.releases}} -- Linked releases summary
{{metrics.errorRate}} -- Error rate chart embed
{{metrics.latency}} -- Latency chart embed
Postmortem Dashboard#
List View#
View all postmortems under Dashboard > Incidents > Postmortems:
- Status -- draft, in review, published
- Incident -- linked incident ID and title
- Severity -- incident severity
- Date -- when the incident occurred
- Action items -- count of open vs. completed items
Action Item Tracker#
The action item tracker aggregates all postmortem action items across your organization:
- Filter by status (open, in progress, completed)
- Filter by owner
- Filter by due date (overdue, due this week, upcoming)
- Sort by priority (SEV1 action items first)
This ensures postmortem follow-through. Action items without owners or due dates are flagged.
Best Practices#
Timing#
- Generate the draft immediately after resolution while details are fresh
- Hold the review meeting within 3-5 business days
- Publish within 1 week of the incident
Writing#
- Write for an audience that wasn't involved in the incident
- Be specific about times, percentages, and user impact
- Avoid jargon -- a product manager should be able to understand the summary
- Include "where we got lucky" to surface hidden risks
Action Items#
- Every postmortem should have at least one action item
- Action items should be SMART: Specific, Measurable, Achievable, Relevant, Time-bound
- Assign an owner to every item -- unowned items don't get done
- Track completion -- schedule a follow-up review 2 weeks after the meeting
Review Meetings#
- Keep it to 30-60 minutes
- The incident commander facilitates
- Focus on systems, not individuals
- End with clear agreement on action items and owners
- Record the meeting or share detailed notes for those who couldn't attend
Organizational Habits#
- Make postmortems mandatory for SEV1 and SEV2 incidents
- Make postmortems optional but encouraged for SEV3
- Share published postmortems in a team channel or newsletter
- Review postmortem action item completion rates monthly
- Celebrate thorough postmortems -- they make your systems better