ProductMarch 31, 202610 min read

SLOs, Incidents, and Escalation: Enterprise Observability Features

Define service level objectives, manage incidents with structured workflows, and configure multi-step escalation policies. Enterprise-grade reliability engineering in JustAnalytics.

Reliability Is a Feature

Your users don't care about your architecture. They care about whether your application works when they need it. Service Level Objectives (SLOs), incident management, and escalation policies are the tools that help engineering teams deliver on that expectation.

Until now, getting these capabilities meant adopting Datadog, PagerDuty, and a runbook tool -- three more subscriptions, three more dashboards, three more integration points. JustAnalytics brings SLOs, incident management, and escalation into the same platform where you already track errors, traces, logs, and uptime.

Service Level Objectives (SLOs)

An SLO defines a measurable target for your service's reliability. Instead of vague promises like "we aim for high availability," an SLO gives you a concrete number: 99.9% of API requests should succeed within 500ms over any 30-day rolling window.

Creating an SLO

In JustAnalytics, creating an SLO takes less than a minute:

Name: API Latency SLO
Description: 99.9% of API requests complete within 500ms
Type: Latency
Target: 99.9%
Window: 30 days (rolling)
Data Source: Spans where service = "api-gateway"
Good Event: Duration < 500ms
Total Event: All spans

JustAnalytics supports three SLO types:

SLO TypeGood Event DefinitionExample
AvailabilityHTTP status < 50099.95% of requests succeed
LatencyDuration < threshold99.9% of requests under 500ms
CustomAny boolean expression99% of payments process correctly

Error Budgets

Every SLO comes with an error budget -- the amount of unreliability you can tolerate before violating your objective. JustAnalytics calculates and visualizes your error budget in real time.

For a 99.9% availability SLO over 30 days:

Total budget:    43.2 minutes of downtime
Budget consumed: 12.8 minutes (29.6%)
Budget remaining: 30.4 minutes (70.4%)
Burn rate:       0.98x (healthy)

The burn rate tells you how fast you're consuming your error budget relative to the window. A burn rate of 1.0x means you'll exactly exhaust your budget by the end of the window. Anything above 1.0x means you're on track to miss the SLO.

Burn Rate Alerts

JustAnalytics implements the multi-window burn rate alerting strategy recommended by Google's SRE book:

Alert SeverityBurn RateLong WindowShort WindowAction
Page (critical)14.4x1 hour5 minutesWake someone up
Page (high)6x6 hours30 minutesInvestigate now
Ticket (medium)3x1 day2 hoursFix this week
Ticket (low)1x3 days6 hoursTrack and trend

Multi-window alerting prevents false positives. A brief spike that resolves quickly won't page you at 3 AM, but a sustained degradation will.

SLO Dashboard

The SLO dashboard provides a single view of all your service level objectives:

  • SLO status grid -- green/yellow/red indicators for each SLO
  • Error budget timeline -- how budget has been consumed over the window
  • Budget forecast -- projected budget remaining at window end based on current burn rate
  • Incident correlation -- which incidents impacted which SLOs
  • Historical compliance -- SLO attainment over the past 3, 6, and 12 months

Incident Management

When things go wrong, you need a structured process -- not a Slack thread. JustAnalytics incident management provides a clear lifecycle for every incident, from detection to resolution to retrospective.

Incident Lifecycle

Triggered → Acknowledged → Investigating → Identified → Monitoring → Resolved
                                                                        ↓
                                                                   Retrospective

Each state transition is logged with a timestamp and the team member who made the change, creating a complete audit trail.

Creating Incidents

Incidents can be created in three ways:

1. Automatic (from alert rules)

When an alert rule fires, it can automatically create an incident:

Alert Rule: API Error Rate > 5%
Action: Create incident
  Severity: High
  Title: "API error rate spike: {{value}}%"
  Assign to: Backend On-Call

2. Automatic (from SLO burn rate)

When an SLO burn rate exceeds your configured threshold, an incident is created automatically:

SLO: API Latency SLO
Burn Rate Alert: 14.4x over 1 hour
Action: Create incident
  Severity: Critical
  Title: "API Latency SLO burn rate critical"
  Assign to: Platform On-Call

3. Manual

Any team member can create an incident from the dashboard:

POST /api/incidents
{
  "title": "Payment processing delays",
  "severity": "high",
  "description": "Multiple customers reporting slow checkout",
  "assignee": "backend-team"
}

Incident Timeline

Every incident has a timeline that records all activity:

[14:32:01] INCIDENT CREATED - Alert: API error rate > 5%
[14:32:02] NOTIFICATION SENT - Email to backend-oncall@company.com
[14:32:15] ACKNOWLEDGED - by Sarah Chen
[14:33:00] STATUS CHANGED - Investigating
[14:33:45] NOTE ADDED - "Seeing increased latency on database queries"
[14:35:12] LINKED - Trace ID: abc123def456
[14:38:00] STATUS CHANGED - Identified
[14:38:30] NOTE ADDED - "Root cause: connection pool exhaustion after deploy"
[14:42:00] ACTION - Rollback initiated
[14:45:00] STATUS CHANGED - Monitoring
[14:55:00] STATUS CHANGED - Resolved
[14:55:00] DURATION - 22 minutes 59 seconds

Linking Incidents to Observability Data

This is where JustAnalytics incident management differs from standalone tools like PagerDuty or OpsGenie. Because incidents live in the same platform as your traces, logs, errors, and metrics, you can link directly:

  • Link to traces -- attach the specific trace that shows the failure
  • Link to error groups -- associate the incident with the error that caused it
  • Link to log queries -- save the log search that helped diagnose the issue
  • Link to metric charts -- embed the dashboard panel that shows the impact
  • Link to session replays -- show the user experience during the incident

Incident Metrics

JustAnalytics tracks incident metrics over time:

MetricDefinition
MTTAMean Time to Acknowledge
MTTIMean Time to Identify (root cause)
MTTRMean Time to Resolve
Incident frequencyIncidents per week/month
Severity distributionBreakdown by severity level
SLO impactWhich SLOs were affected

These metrics feed into team performance dashboards and retrospective reports.

Escalation Policies

Not every alert should go to the same person. Escalation policies define who gets notified, when, and in what order -- so critical issues always reach someone who can act on them.

Multi-Step Escalation

An escalation policy defines a series of notification steps:

Escalation Policy: Production Critical
Steps:
  - Step 1 (0 min):
      Notify: Current on-call (backend-rotation)
      Channels: Email, Push Notification
  - Step 2 (5 min, if not acknowledged):
      Notify: Backend team lead
      Channels: Email, Push Notification, SMS
  - Step 3 (15 min, if not acknowledged):
      Notify: Engineering manager
      Channels: Email, SMS, Phone call
  - Step 4 (30 min, if not acknowledged):
      Notify: CTO
      Channels: All channels
Repeat: After all steps, restart from Step 1
Max repeats: 3

On-Call Rotations

JustAnalytics supports on-call rotation schedules:

  • Weekly rotation -- rotate the primary on-call every week
  • Daily rotation -- for teams that prefer shorter shifts
  • Custom schedule -- define specific date ranges for each team member
  • Override -- temporarily assign on-call to someone outside the rotation
  • Handoff time -- configure when rotations switch (e.g., Monday 9 AM)
On-Call Rotation: Backend Team
Schedule: Weekly, Monday 9:00 AM UTC
Members:
  - Week 1: Sarah Chen
  - Week 2: Marcus Johnson
  - Week 3: Priya Sharma
  - Week 4: Alex Rodriguez
Override: March 31 - April 7: James Park (covering for Sarah)

Notification Channels

Escalation steps can use any combination of notification channels:

ChannelConfiguration
EmailTeam member's email address
Push notificationJustAnalytics mobile app
SMSPhone number (Twilio integration)
SlackChannel or DM via Slack webhook
WebhookCustom HTTP endpoint

Routing Rules

Route incidents to different escalation policies based on conditions:

Rule 1: IF service = "payment-service" AND severity = "critical"
  → Escalation Policy: Payment Critical
  → Additional: Notify finance-team Slack channel

Rule 2: IF service = "api-gateway" AND error_rate > 10%
  → Escalation Policy: Production Critical

Rule 3: IF source = "uptime-monitor"
  → Escalation Policy: Infrastructure On-Call

Default: → Escalation Policy: General Engineering

Workflow Automation

Beyond escalation, JustAnalytics supports automated workflows that trigger when incidents are created, updated, or resolved.

Auto-Remediation

Workflow: Auto-scale on High CPU
Trigger: Incident created where metric = "cpu_usage" AND value > 90%
Actions:
  - POST webhook to auto-scaler API
  - Add note to incident: "Auto-scale triggered"
  - Wait 5 minutes
  - Check if CPU < 70%
  - If yes: Resolve incident with note "Auto-scale resolved the issue"
  - If no: Escalate to Step 2

Incident Communication

Workflow: Status Page Update
Trigger: Incident severity = "critical" AND status = "investigating"
Actions:
  - Create status page incident
  - Post update: "We are investigating reports of {{incident.title}}"
  - On resolution: Post update "This issue has been resolved"

Retrospective Automation

Workflow: Post-Incident Review
Trigger: Incident resolved where severity IN ("critical", "high")
Actions:
  - Wait 24 hours
  - Create retrospective document from incident timeline
  - Assign to incident owner
  - Schedule review meeting (calendar integration)
  - Notify team channel

How It Compares to Datadog

Datadog's incident management and SLO features are the benchmark for enterprise observability. Here's where JustAnalytics stands:

FeatureJustAnalyticsDatadog
SLO creationYesYes
Error budget trackingYesYes
Burn rate alertsYesYes
Incident lifecycleYesYes
Incident timelineYesYes
Escalation policiesYesVia PagerDuty integration
On-call rotationsYesVia PagerDuty integration
Multi-channel notificationsYesYes
Auto-remediation workflowsBasicAdvanced
Retrospective templatesYesYes
Status page integrationBuilt-inSeparate product
PricingIncluded in plan$23/host/month + add-ons

The key difference: Datadog charges separately for incident management, SLOs, and on-call -- and for escalation, you typically need PagerDuty as well. JustAnalytics includes all of these in a single plan.

Getting Started with SLOs

Step 1: Define Your First SLO

Start with your most critical user-facing endpoint. A good first SLO is availability:

  1. Go to Monitoring > SLOs > Create SLO
  2. Select Availability as the type
  3. Choose your data source (spans from your API service)
  4. Set the target (start with 99.9%)
  5. Set the window (30 days rolling)

Step 2: Set Up Burn Rate Alerts

JustAnalytics will suggest default burn rate alert thresholds when you create an SLO. Review them and adjust based on your team's tolerance for alerts.

Step 3: Create an Escalation Policy

Define who should be notified when the SLO is at risk:

  1. Go to Settings > Escalation Policies > Create
  2. Add notification steps with increasing urgency
  3. Assign team members and channels to each step

Step 4: Connect Incidents to Your Workflow

Configure automatic incident creation from your SLO burn rate alerts, and link your escalation policy to route incidents to the right people.

Within an hour, you'll have a complete reliability management workflow -- SLO tracking, burn rate alerting, incident management, and escalation -- all in one platform.

Start your 7-day free trial and start managing reliability like the best engineering teams in the world.

JT
JustAnalytics TeamEngineering Team

The engineering and product team behind JustAnalytics. We're on a mission to make web observability simpler, faster, and more private.

Related posts