Incident Management

Declare, investigate, and resolve incidents with a structured lifecycle and clear communication.

Overview#

Incidents are unplanned disruptions or degradations to your services. JustAnalytics provides a structured incident management workflow that takes you from detection through resolution and learning. Every incident follows a consistent lifecycle so your team knows exactly what to do when things break.

Incident Lifecycle#

Every incident moves through four phases:

DECLARED → INVESTIGATING → MITIGATED → RESOLVED
    │           │              │            │
    │           │              │            └─ Root cause fixed, postmortem written
    │           │              └─ User impact reduced/eliminated
    │           └─ Team actively diagnosing
    └─ Incident created, team notified

Phase 1: Declared#

An incident is declared when a disruption is detected. This can happen:

Automatically -- a critical alert fires and triggers incident creation via a workflow
Manually -- an engineer notices something wrong and declares an incident from the dashboard

At declaration, the incident gets a unique ID (e.g., INC-2026-0042), a severity level, and initial metadata.

Phase 2: Investigating#

The team is actively diagnosing the issue. During this phase:

An incident commander is assigned
Related traces, errors, logs, and metrics are linked
Timeline updates are posted as investigation progresses
Severity may be adjusted as the scope becomes clearer

Phase 3: Mitigated#

The user-facing impact has been reduced or eliminated, but the root cause may not be fully fixed. Common mitigation actions:

Rolling back a bad deploy
Scaling up infrastructure
Enabling a feature flag to bypass the broken path
Redirecting traffic away from the affected service

Phase 4: Resolved#

The incident is fully resolved. The root cause has been identified and fixed (or a permanent workaround is in place). After resolution:

Duration is calculated
A postmortem is created (automatically or manually)
Action items are tracked to completion

Severity Levels#

JustAnalytics uses four severity levels. Assign severity based on user impact, not technical complexity.

| Level | Name | Description | Response Time | Example | |-------|------|-------------|---------------|---------| | SEV1 | Critical | Complete outage or data loss for all users | < 15 minutes | API returning 500 for all requests | | SEV2 | Major | Significant degradation for many users | < 30 minutes | Checkout failing for 40% of users | | SEV3 | Minor | Limited impact, workaround available | < 2 hours | Search results loading slowly | | SEV4 | Low | Minimal impact, cosmetic or edge case | Next business day | Dashboard chart rendering glitch |

Escalation#

Severity can be changed at any time during the incident. If a SEV3 turns out to be wider than initially thought, escalate to SEV2 or SEV1. All severity changes are recorded in the timeline.

Creating Incidents#

From the Dashboard#

Navigate to Dashboard > Incidents and click Declare Incident.

Fill in:

Title -- short description of the impact (e.g., "Checkout API returning 500 errors")
Severity -- SEV1 through SEV4
Affected services -- select one or more services
Environment -- production, staging, etc.
Description -- what you know so far

From an Alert#

When an alert fires, you can create an incident directly from the alert detail page by clicking Create Incident. The alert context (service, environment, metric values) is automatically populated.

From an Error Group#

In the Error Tracking view, click the overflow menu on any error group and select Create Incident. The error group is automatically linked to the incident.

Via the API#

Create incidents programmatically for integration with external systems:

const response = await fetch('/api/dashboard/incidents', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    title: 'Payment processing failures',
    severity: 'SEV1',
    status: 'declared',
    services: ['payment-service', 'checkout-api'],
    environment: 'production',
    description: 'Stripe webhook processing failing since 14:32 UTC. ~30% of payments affected.',
    declaredBy: 'api',
  }),
});

const incident = await response.json();
// { id: 'INC-2026-0042', status: 'declared', createdAt: '2026-03-15T14:35:00Z' }

Via Workflow Automation#

Configure a workflow to automatically create incidents when specific conditions are met:

Trigger:    alert_fired
Condition:  severity = critical AND service in [payment-service, checkout-api]
Action:     create_incident with severity SEV1

See Workflow Automation for details.

Incident Commander#

Every incident should have an Incident Commander (IC) -- the person responsible for coordinating the response. The IC:

Coordinates -- assigns tasks, manages communication
Communicates -- posts timeline updates, notifies stakeholders
Decides -- makes calls on mitigation actions and severity changes
Does not debug -- the IC focuses on coordination, not hands-on troubleshooting

Assigning the IC#

The IC can be:

Auto-assigned via ownership rules (the on-call for the affected service)
Manually assigned when declaring the incident
Reassigned at any time during the incident

// Update incident commander via API
await fetch(`/api/dashboard/incidents/${incidentId}`, {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    commanderId: 'user_abc123',
  }),
});

Timeline and Updates#

The incident timeline is the single source of truth for what happened. Every action is recorded automatically:

Automatic Timeline Entries#

Incident declared
Severity changed
Status changed (investigating, mitigated, resolved)
Commander assigned or changed
Services added or removed
Alerts linked
Error groups linked

Manual Timeline Updates#

Post updates to keep your team informed:

await fetch(`/api/dashboard/incidents/${incidentId}/timeline`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    message: 'Identified bad deploy (v2.3.1) as root cause. Rolling back to v2.3.0.',
    type: 'update',       // 'update' | 'action' | 'resolution'
  }),
});

From the dashboard, click Add Update on the incident detail page. Updates support Markdown formatting.

Timeline Best Practices#

Post updates at least every 15 minutes during SEV1/SEV2
Include what you know, what you don't know, and what you're doing next
Note when customer-facing impact starts and stops
Record all mitigation actions taken, even failed ones

Incidents become more useful when connected to the relevant observability data. Link the following to any incident:

Alerts#

Alerts that contributed to the incident detection. Linked alerts show their metric values and threshold breach details on the incident timeline.

Error Groups#

Error groups that are part of the incident. This connects the incident to specific stack traces, affected users, and error frequency data.

Traces#

Specific trace IDs that demonstrate the failure. Useful for showing the exact request path that failed.

Deploys / Releases#

The release that caused the incident (or the release that fixed it). This creates a causal link between code changes and incidents.

// Link an alert to an incident
await fetch(`/api/dashboard/incidents/${incidentId}/links`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    type: 'alert',
    targetId: 'alert_event_xyz789',
  }),
});

Managing Incident Status#

Transitioning to Mitigated#

When user impact is reduced, move to mitigated:

await fetch(`/api/dashboard/incidents/${incidentId}`, {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    status: 'mitigated',
    mitigationSummary: 'Rolled back to v2.3.0. Error rates returned to baseline.',
  }),
});

Resolving the Incident#

When the root cause is fixed and the incident is fully resolved:

await fetch(`/api/dashboard/incidents/${incidentId}`, {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    status: 'resolved',
    resolutionSummary: 'Root cause was a missing null check in payment handler (commit abc123). Fix deployed in v2.3.2.',
    resolvedAt: new Date().toISOString(),
  }),
});

On resolution, the incident duration is calculated and a postmortem is automatically generated if enabled for the project.

Incident Dashboard#

List View#

The incident list shows all incidents with:

Status badge -- color-coded by current phase
Severity -- SEV1 (red), SEV2 (orange), SEV3 (yellow), SEV4 (blue)
Title and affected services
Duration -- time since declaration (or total duration if resolved)
Commander -- assigned IC
Last update -- most recent timeline entry

Filter by status, severity, service, environment, or date range.

Detail View#

The incident detail page provides:

Header -- title, severity, status, duration, commander
Timeline -- chronological feed of all events and updates
Linked data -- alerts, error groups, traces, releases
Metrics -- error rate and latency charts for affected services during the incident window
Postmortem -- link to the postmortem (if created)

Metrics#

The incidents overview shows aggregate metrics:

MTTD (Mean Time to Detect) -- average time from impact start to incident declaration
MTTR (Mean Time to Resolve) -- average time from declaration to resolution
Incident frequency -- incidents per week/month by severity
Top affected services -- which services are involved in the most incidents

Postmortems#

When an incident is resolved, JustAnalytics can automatically generate a postmortem template pre-populated with timeline data, linked alerts, and metric snapshots. See Postmortems for the full workflow.

Integration with Alerts#

Incidents and alerts are tightly connected:

Alert fires -- notification sent, optional auto-incident creation
Incident declared -- alert linked, context carried over
Investigation -- alert metric charts embedded in incident view
Resolution -- alert auto-resolves when the metric recovers

Configure auto-incident creation in Dashboard > Monitoring > Alerts > [Alert Rule] > Advanced > Auto-create incident.

Use plain language in timeline updates
State impact in user terms ("30% of checkout attempts are failing") not technical terms ("OOM on pod-3")
Include estimated time to resolution when possible

Separate Roles#

During SEV1/SEV2 incidents, have separate people for:

Incident Commander -- coordinates
Technical Lead -- debugs
Communicator -- updates stakeholders

Review and Improve#

Track MTTD and MTTR trends over time. If MTTD is high, improve your alerting. If MTTR is high, improve your runbooks and tooling.