Incident Management

Declare, investigate, and resolve incidents with a structured lifecycle and clear communication.

Overview#

Incidents are unplanned disruptions or degradations to your services. JustAnalytics provides a structured incident management workflow that takes you from detection through resolution and learning. Every incident follows a consistent lifecycle so your team knows exactly what to do when things break.

Incident Lifecycle#

Every incident moves through four phases:

DECLARED → INVESTIGATING → MITIGATED → RESOLVED
    │           │              │            │
    │           │              │            └─ Root cause fixed, postmortem written
    │           │              └─ User impact reduced/eliminated
    │           └─ Team actively diagnosing
    └─ Incident created, team notified

Phase 1: Declared#

An incident is declared when a disruption is detected. This can happen:

  • Automatically -- a critical alert fires and triggers incident creation via a workflow
  • Manually -- an engineer notices something wrong and declares an incident from the dashboard

At declaration, the incident gets a unique ID (e.g., INC-2026-0042), a severity level, and initial metadata.

Phase 2: Investigating#

The team is actively diagnosing the issue. During this phase:

  • An incident commander is assigned
  • Related traces, errors, logs, and metrics are linked
  • Timeline updates are posted as investigation progresses
  • Severity may be adjusted as the scope becomes clearer

Phase 3: Mitigated#

The user-facing impact has been reduced or eliminated, but the root cause may not be fully fixed. Common mitigation actions:

  • Rolling back a bad deploy
  • Scaling up infrastructure
  • Enabling a feature flag to bypass the broken path
  • Redirecting traffic away from the affected service

Phase 4: Resolved#

The incident is fully resolved. The root cause has been identified and fixed (or a permanent workaround is in place). After resolution:

  • Duration is calculated
  • A postmortem is created (automatically or manually)
  • Action items are tracked to completion

Severity Levels#

JustAnalytics uses four severity levels. Assign severity based on user impact, not technical complexity.

| Level | Name | Description | Response Time | Example | |-------|------|-------------|---------------|---------| | SEV1 | Critical | Complete outage or data loss for all users | < 15 minutes | API returning 500 for all requests | | SEV2 | Major | Significant degradation for many users | < 30 minutes | Checkout failing for 40% of users | | SEV3 | Minor | Limited impact, workaround available | < 2 hours | Search results loading slowly | | SEV4 | Low | Minimal impact, cosmetic or edge case | Next business day | Dashboard chart rendering glitch |

Escalation#

Severity can be changed at any time during the incident. If a SEV3 turns out to be wider than initially thought, escalate to SEV2 or SEV1. All severity changes are recorded in the timeline.

Creating Incidents#

From the Dashboard#

Navigate to Dashboard > Incidents and click Declare Incident.

Fill in:

  • Title -- short description of the impact (e.g., "Checkout API returning 500 errors")
  • Severity -- SEV1 through SEV4
  • Affected services -- select one or more services
  • Environment -- production, staging, etc.
  • Description -- what you know so far

From an Alert#

When an alert fires, you can create an incident directly from the alert detail page by clicking Create Incident. The alert context (service, environment, metric values) is automatically populated.

From an Error Group#

In the Error Tracking view, click the overflow menu on any error group and select Create Incident. The error group is automatically linked to the incident.

Via the API#

Create incidents programmatically for integration with external systems:

const response = await fetch('/api/dashboard/incidents', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    title: 'Payment processing failures',
    severity: 'SEV1',
    status: 'declared',
    services: ['payment-service', 'checkout-api'],
    environment: 'production',
    description: 'Stripe webhook processing failing since 14:32 UTC. ~30% of payments affected.',
    declaredBy: 'api',
  }),
});

const incident = await response.json();
// { id: 'INC-2026-0042', status: 'declared', createdAt: '2026-03-15T14:35:00Z' }

Via Workflow Automation#

Configure a workflow to automatically create incidents when specific conditions are met:

Trigger:    alert_fired
Condition:  severity = critical AND service in [payment-service, checkout-api]
Action:     create_incident with severity SEV1

See Workflow Automation for details.

Incident Commander#

Every incident should have an Incident Commander (IC) -- the person responsible for coordinating the response. The IC:

  • Coordinates -- assigns tasks, manages communication
  • Communicates -- posts timeline updates, notifies stakeholders
  • Decides -- makes calls on mitigation actions and severity changes
  • Does not debug -- the IC focuses on coordination, not hands-on troubleshooting

Assigning the IC#

The IC can be:

  • Auto-assigned via ownership rules (the on-call for the affected service)
  • Manually assigned when declaring the incident
  • Reassigned at any time during the incident
// Update incident commander via API
await fetch(`/api/dashboard/incidents/${incidentId}`, {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    commanderId: 'user_abc123',
  }),
});

Timeline and Updates#

The incident timeline is the single source of truth for what happened. Every action is recorded automatically:

Automatic Timeline Entries#

  • Incident declared
  • Severity changed
  • Status changed (investigating, mitigated, resolved)
  • Commander assigned or changed
  • Services added or removed
  • Alerts linked
  • Error groups linked

Manual Timeline Updates#

Post updates to keep your team informed:

await fetch(`/api/dashboard/incidents/${incidentId}/timeline`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    message: 'Identified bad deploy (v2.3.1) as root cause. Rolling back to v2.3.0.',
    type: 'update',       // 'update' | 'action' | 'resolution'
  }),
});

From the dashboard, click Add Update on the incident detail page. Updates support Markdown formatting.

Timeline Best Practices#

  • Post updates at least every 15 minutes during SEV1/SEV2
  • Include what you know, what you don't know, and what you're doing next
  • Note when customer-facing impact starts and stops
  • Record all mitigation actions taken, even failed ones

Incidents become more useful when connected to the relevant observability data. Link the following to any incident:

Alerts#

Alerts that contributed to the incident detection. Linked alerts show their metric values and threshold breach details on the incident timeline.

Error Groups#

Error groups that are part of the incident. This connects the incident to specific stack traces, affected users, and error frequency data.

Traces#

Specific trace IDs that demonstrate the failure. Useful for showing the exact request path that failed.

Deploys / Releases#

The release that caused the incident (or the release that fixed it). This creates a causal link between code changes and incidents.

// Link an alert to an incident
await fetch(`/api/dashboard/incidents/${incidentId}/links`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    type: 'alert',
    targetId: 'alert_event_xyz789',
  }),
});

Managing Incident Status#

Transitioning to Mitigated#

When user impact is reduced, move to mitigated:

await fetch(`/api/dashboard/incidents/${incidentId}`, {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    status: 'mitigated',
    mitigationSummary: 'Rolled back to v2.3.0. Error rates returned to baseline.',
  }),
});

Resolving the Incident#

When the root cause is fixed and the incident is fully resolved:

await fetch(`/api/dashboard/incidents/${incidentId}`, {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    status: 'resolved',
    resolutionSummary: 'Root cause was a missing null check in payment handler (commit abc123). Fix deployed in v2.3.2.',
    resolvedAt: new Date().toISOString(),
  }),
});

On resolution, the incident duration is calculated and a postmortem is automatically generated if enabled for the project.

Incident Dashboard#

List View#

The incident list shows all incidents with:

  • Status badge -- color-coded by current phase
  • Severity -- SEV1 (red), SEV2 (orange), SEV3 (yellow), SEV4 (blue)
  • Title and affected services
  • Duration -- time since declaration (or total duration if resolved)
  • Commander -- assigned IC
  • Last update -- most recent timeline entry

Filter by status, severity, service, environment, or date range.

Detail View#

The incident detail page provides:

  • Header -- title, severity, status, duration, commander
  • Timeline -- chronological feed of all events and updates
  • Linked data -- alerts, error groups, traces, releases
  • Metrics -- error rate and latency charts for affected services during the incident window
  • Postmortem -- link to the postmortem (if created)

Metrics#

The incidents overview shows aggregate metrics:

  • MTTD (Mean Time to Detect) -- average time from impact start to incident declaration
  • MTTR (Mean Time to Resolve) -- average time from declaration to resolution
  • Incident frequency -- incidents per week/month by severity
  • Top affected services -- which services are involved in the most incidents

Postmortems#

When an incident is resolved, JustAnalytics can automatically generate a postmortem template pre-populated with timeline data, linked alerts, and metric snapshots. See Postmortems for the full workflow.

Integration with Alerts#

Incidents and alerts are tightly connected:

  1. Alert fires -- notification sent, optional auto-incident creation
  2. Incident declared -- alert linked, context carried over
  3. Investigation -- alert metric charts embedded in incident view
  4. Resolution -- alert auto-resolves when the metric recovers

Configure auto-incident creation in Dashboard > Monitoring > Alerts > [Alert Rule] > Advanced > Auto-create incident.

Best Practices#

Declare Early#

It's better to declare an incident and close it quickly than to delay and let impact grow. If you're not sure whether something is an incident, declare it at SEV3 and adjust.

Communicate Clearly#

  • Use plain language in timeline updates
  • State impact in user terms ("30% of checkout attempts are failing") not technical terms ("OOM on pod-3")
  • Include estimated time to resolution when possible

Separate Roles#

During SEV1/SEV2 incidents, have separate people for:

  • Incident Commander -- coordinates
  • Technical Lead -- debugs
  • Communicator -- updates stakeholders

Review and Improve#

Track MTTD and MTTR trends over time. If MTTD is high, improve your alerting. If MTTR is high, improve your runbooks and tooling.