SLOs, Incidents, and Escalation: Enterprise Observability Features

Reliability Is a Feature

Your users don't care about your architecture. They care about whether your application works when they need it. Service Level Objectives (SLOs), incident management, and escalation policies are the tools that help engineering teams deliver on that expectation.

Until now, getting these capabilities meant adopting Datadog, PagerDuty, and a runbook tool -- three more subscriptions, three more dashboards, three more integration points. JustAnalytics brings SLOs, incident management, and escalation into the same platform where you already track errors, traces, logs, and uptime.

Service Level Objectives (SLOs)

An SLO defines a measurable target for your service's reliability. Instead of vague promises like "we aim for high availability," an SLO gives you a concrete number: 99.9% of API requests should succeed within 500ms over any 30-day rolling window.

Creating an SLO

In JustAnalytics, creating an SLO takes less than a minute:

Name: API Latency SLO
Description: 99.9% of API requests complete within 500ms
Type: Latency
Target: 99.9%
Window: 30 days (rolling)
Data Source: Spans where service = "api-gateway"
Good Event: Duration < 500ms
Total Event: All spans

JustAnalytics supports three SLO types:

SLO Type	Good Event Definition	Example
Availability	HTTP status < 500	99.95% of requests succeed
Latency	Duration < threshold	99.9% of requests under 500ms
Custom	Any boolean expression	99% of payments process correctly

Error Budgets

Every SLO comes with an error budget -- the amount of unreliability you can tolerate before violating your objective. JustAnalytics calculates and visualizes your error budget in real time.

For a 99.9% availability SLO over 30 days:

Total budget:    43.2 minutes of downtime
Budget consumed: 12.8 minutes (29.6%)
Budget remaining: 30.4 minutes (70.4%)
Burn rate:       0.98x (healthy)

The burn rate tells you how fast you're consuming your error budget relative to the window. A burn rate of 1.0x means you'll exactly exhaust your budget by the end of the window. Anything above 1.0x means you're on track to miss the SLO.

Burn Rate Alerts

JustAnalytics implements the multi-window burn rate alerting strategy recommended by Google's SRE book:

Alert Severity	Burn Rate	Long Window	Short Window	Action
Page (critical)	14.4x	1 hour	5 minutes	Wake someone up
Page (high)	6x	6 hours	30 minutes	Investigate now
Ticket (medium)	3x	1 day	2 hours	Fix this week
Ticket (low)	1x	3 days	6 hours	Track and trend

Multi-window alerting prevents false positives. A brief spike that resolves quickly won't page you at 3 AM, but a sustained degradation will.

SLO Dashboard

The SLO dashboard provides a single view of all your service level objectives:

SLO status grid -- green/yellow/red indicators for each SLO
Error budget timeline -- how budget has been consumed over the window
Budget forecast -- projected budget remaining at window end based on current burn rate
Incident correlation -- which incidents impacted which SLOs
Historical compliance -- SLO attainment over the past 3, 6, and 12 months

Incident Management

When things go wrong, you need a structured process -- not a Slack thread. JustAnalytics incident management provides a clear lifecycle for every incident, from detection to resolution to retrospective.

Incident Lifecycle

Triggered → Acknowledged → Investigating → Identified → Monitoring → Resolved
                                                                        ↓
                                                                   Retrospective

Each state transition is logged with a timestamp and the team member who made the change, creating a complete audit trail.

Creating Incidents

Incidents can be created in three ways:

1. Automatic (from alert rules)

When an alert rule fires, it can automatically create an incident:

Alert Rule: API Error Rate > 5%
Action: Create incident
  Severity: High
  Title: "API error rate spike: {{value}}%"
  Assign to: Backend On-Call

2. Automatic (from SLO burn rate)

When an SLO burn rate exceeds your configured threshold, an incident is created automatically:

SLO: API Latency SLO
Burn Rate Alert: 14.4x over 1 hour
Action: Create incident
  Severity: Critical
  Title: "API Latency SLO burn rate critical"
  Assign to: Platform On-Call

3. Manual

Any team member can create an incident from the dashboard:

POST /api/incidents
{
  "title": "Payment processing delays",
  "severity": "high",
  "description": "Multiple customers reporting slow checkout",
  "assignee": "backend-team"
}

Incident Timeline

Every incident has a timeline that records all activity:

[14:32:01] INCIDENT CREATED - Alert: API error rate > 5%
[14:32:02] NOTIFICATION SENT - Email to backend-oncall@company.com
[14:32:15] ACKNOWLEDGED - by Sarah Chen
[14:33:00] STATUS CHANGED - Investigating
[14:33:45] NOTE ADDED - "Seeing increased latency on database queries"
[14:35:12] LINKED - Trace ID: abc123def456
[14:38:00] STATUS CHANGED - Identified
[14:38:30] NOTE ADDED - "Root cause: connection pool exhaustion after deploy"
[14:42:00] ACTION - Rollback initiated
[14:45:00] STATUS CHANGED - Monitoring
[14:55:00] STATUS CHANGED - Resolved
[14:55:00] DURATION - 22 minutes 59 seconds

Linking Incidents to Observability Data

This is where JustAnalytics incident management differs from standalone tools like PagerDuty or OpsGenie. Because incidents live in the same platform as your traces, logs, errors, and metrics, you can link directly:

Link to traces -- attach the specific trace that shows the failure
Link to error groups -- associate the incident with the error that caused it
Link to log queries -- save the log search that helped diagnose the issue
Link to metric charts -- embed the dashboard panel that shows the impact
Link to session replays -- show the user experience during the incident

Incident Metrics

JustAnalytics tracks incident metrics over time:

Metric	Definition
MTTA	Mean Time to Acknowledge
MTTI	Mean Time to Identify (root cause)
MTTR	Mean Time to Resolve
Incident frequency	Incidents per week/month
Severity distribution	Breakdown by severity level
SLO impact	Which SLOs were affected

These metrics feed into team performance dashboards and retrospective reports.

Escalation Policies

Not every alert should go to the same person. Escalation policies define who gets notified, when, and in what order -- so critical issues always reach someone who can act on them.

Multi-Step Escalation

An escalation policy defines a series of notification steps:

Escalation Policy: Production Critical
Steps:
  - Step 1 (0 min):
      Notify: Current on-call (backend-rotation)
      Channels: Email, Push Notification
  - Step 2 (5 min, if not acknowledged):
      Notify: Backend team lead
      Channels: Email, Push Notification, SMS
  - Step 3 (15 min, if not acknowledged):
      Notify: Engineering manager
      Channels: Email, SMS, Phone call
  - Step 4 (30 min, if not acknowledged):
      Notify: CTO
      Channels: All channels
Repeat: After all steps, restart from Step 1
Max repeats: 3

On-Call Rotations

JustAnalytics supports on-call rotation schedules:

Weekly rotation -- rotate the primary on-call every week
Daily rotation -- for teams that prefer shorter shifts
Custom schedule -- define specific date ranges for each team member
Override -- temporarily assign on-call to someone outside the rotation
Handoff time -- configure when rotations switch (e.g., Monday 9 AM)

On-Call Rotation: Backend Team
Schedule: Weekly, Monday 9:00 AM UTC
Members:
  - Week 1: Sarah Chen
  - Week 2: Marcus Johnson
  - Week 3: Priya Sharma
  - Week 4: Alex Rodriguez
Override: March 31 - April 7: James Park (covering for Sarah)

Notification Channels

Escalation steps can use any combination of notification channels:

Channel	Configuration
Email	Team member's email address
Push notification	JustAnalytics mobile app
SMS	Phone number (Twilio integration)
Slack	Channel or DM via Slack webhook
Webhook	Custom HTTP endpoint

Routing Rules

Route incidents to different escalation policies based on conditions:

Rule 1: IF service = "payment-service" AND severity = "critical"
  → Escalation Policy: Payment Critical
  → Additional: Notify finance-team Slack channel

Rule 2: IF service = "api-gateway" AND error_rate > 10%
  → Escalation Policy: Production Critical

Rule 3: IF source = "uptime-monitor"
  → Escalation Policy: Infrastructure On-Call

Default: → Escalation Policy: General Engineering

Workflow Automation

Beyond escalation, JustAnalytics supports automated workflows that trigger when incidents are created, updated, or resolved.

Auto-Remediation

Workflow: Auto-scale on High CPU
Trigger: Incident created where metric = "cpu_usage" AND value > 90%
Actions:
  - POST webhook to auto-scaler API
  - Add note to incident: "Auto-scale triggered"
  - Wait 5 minutes
  - Check if CPU < 70%
  - If yes: Resolve incident with note "Auto-scale resolved the issue"
  - If no: Escalate to Step 2

Incident Communication

Workflow: Status Page Update
Trigger: Incident severity = "critical" AND status = "investigating"
Actions:
  - Create status page incident
  - Post update: "We are investigating reports of {{incident.title}}"
  - On resolution: Post update "This issue has been resolved"

Retrospective Automation

Workflow: Post-Incident Review
Trigger: Incident resolved where severity IN ("critical", "high")
Actions:
  - Wait 24 hours
  - Create retrospective document from incident timeline
  - Assign to incident owner
  - Schedule review meeting (calendar integration)
  - Notify team channel

How It Compares to Datadog

Datadog's incident management and SLO features are the benchmark for enterprise observability. Here's where JustAnalytics stands:

Feature	JustAnalytics	Datadog
SLO creation	Yes	Yes
Error budget tracking	Yes	Yes
Burn rate alerts	Yes	Yes
Incident lifecycle	Yes	Yes
Incident timeline	Yes	Yes
Escalation policies	Yes	Via PagerDuty integration
On-call rotations	Yes	Via PagerDuty integration
Multi-channel notifications	Yes	Yes
Auto-remediation workflows	Basic	Advanced
Retrospective templates	Yes	Yes
Status page integration	Built-in	Separate product
Pricing	Included in plan	$23/host/month + add-ons

The key difference: Datadog charges separately for incident management, SLOs, and on-call -- and for escalation, you typically need PagerDuty as well. JustAnalytics includes all of these in a single plan.

Getting Started with SLOs

Step 1: Define Your First SLO

Start with your most critical user-facing endpoint. A good first SLO is availability:

Go to Monitoring > SLOs > Create SLO
Select Availability as the type
Choose your data source (spans from your API service)
Set the target (start with 99.9%)
Set the window (30 days rolling)

Step 2: Set Up Burn Rate Alerts

JustAnalytics will suggest default burn rate alert thresholds when you create an SLO. Review them and adjust based on your team's tolerance for alerts.

Step 3: Create an Escalation Policy

Define who should be notified when the SLO is at risk:

Go to Settings > Escalation Policies > Create
Add notification steps with increasing urgency
Assign team members and channels to each step

Step 4: Connect Incidents to Your Workflow

Configure automatic incident creation from your SLO burn rate alerts, and link your escalation policy to route incidents to the right people.

Within an hour, you'll have a complete reliability management workflow -- SLO tracking, burn rate alerting, incident management, and escalation -- all in one platform.

Start your 7-day free trial and start managing reliability like the best engineering teams in the world.