SLOs, Incidents, and Escalation: Enterprise Observability Features
Define service level objectives, manage incidents with structured workflows, and configure multi-step escalation policies. Enterprise-grade reliability engineering in JustAnalytics.
Reliability Is a Feature
Your users don't care about your architecture. They care about whether your application works when they need it. Service Level Objectives (SLOs), incident management, and escalation policies are the tools that help engineering teams deliver on that expectation.
Until now, getting these capabilities meant adopting Datadog, PagerDuty, and a runbook tool -- three more subscriptions, three more dashboards, three more integration points. JustAnalytics brings SLOs, incident management, and escalation into the same platform where you already track errors, traces, logs, and uptime.
Service Level Objectives (SLOs)
An SLO defines a measurable target for your service's reliability. Instead of vague promises like "we aim for high availability," an SLO gives you a concrete number: 99.9% of API requests should succeed within 500ms over any 30-day rolling window.
Creating an SLO
In JustAnalytics, creating an SLO takes less than a minute:
Name: API Latency SLO
Description: 99.9% of API requests complete within 500ms
Type: Latency
Target: 99.9%
Window: 30 days (rolling)
Data Source: Spans where service = "api-gateway"
Good Event: Duration < 500ms
Total Event: All spans
JustAnalytics supports three SLO types:
| SLO Type | Good Event Definition | Example |
|---|---|---|
| Availability | HTTP status < 500 | 99.95% of requests succeed |
| Latency | Duration < threshold | 99.9% of requests under 500ms |
| Custom | Any boolean expression | 99% of payments process correctly |
Error Budgets
Every SLO comes with an error budget -- the amount of unreliability you can tolerate before violating your objective. JustAnalytics calculates and visualizes your error budget in real time.
For a 99.9% availability SLO over 30 days:
Total budget: 43.2 minutes of downtime
Budget consumed: 12.8 minutes (29.6%)
Budget remaining: 30.4 minutes (70.4%)
Burn rate: 0.98x (healthy)
The burn rate tells you how fast you're consuming your error budget relative to the window. A burn rate of 1.0x means you'll exactly exhaust your budget by the end of the window. Anything above 1.0x means you're on track to miss the SLO.
Burn Rate Alerts
JustAnalytics implements the multi-window burn rate alerting strategy recommended by Google's SRE book:
| Alert Severity | Burn Rate | Long Window | Short Window | Action |
|---|---|---|---|---|
| Page (critical) | 14.4x | 1 hour | 5 minutes | Wake someone up |
| Page (high) | 6x | 6 hours | 30 minutes | Investigate now |
| Ticket (medium) | 3x | 1 day | 2 hours | Fix this week |
| Ticket (low) | 1x | 3 days | 6 hours | Track and trend |
Multi-window alerting prevents false positives. A brief spike that resolves quickly won't page you at 3 AM, but a sustained degradation will.
SLO Dashboard
The SLO dashboard provides a single view of all your service level objectives:
- SLO status grid -- green/yellow/red indicators for each SLO
- Error budget timeline -- how budget has been consumed over the window
- Budget forecast -- projected budget remaining at window end based on current burn rate
- Incident correlation -- which incidents impacted which SLOs
- Historical compliance -- SLO attainment over the past 3, 6, and 12 months
Incident Management
When things go wrong, you need a structured process -- not a Slack thread. JustAnalytics incident management provides a clear lifecycle for every incident, from detection to resolution to retrospective.
Incident Lifecycle
Triggered → Acknowledged → Investigating → Identified → Monitoring → Resolved
↓
Retrospective
Each state transition is logged with a timestamp and the team member who made the change, creating a complete audit trail.
Creating Incidents
Incidents can be created in three ways:
1. Automatic (from alert rules)
When an alert rule fires, it can automatically create an incident:
Alert Rule: API Error Rate > 5%
Action: Create incident
Severity: High
Title: "API error rate spike: {{value}}%"
Assign to: Backend On-Call
2. Automatic (from SLO burn rate)
When an SLO burn rate exceeds your configured threshold, an incident is created automatically:
SLO: API Latency SLO
Burn Rate Alert: 14.4x over 1 hour
Action: Create incident
Severity: Critical
Title: "API Latency SLO burn rate critical"
Assign to: Platform On-Call
3. Manual
Any team member can create an incident from the dashboard:
POST /api/incidents
{
"title": "Payment processing delays",
"severity": "high",
"description": "Multiple customers reporting slow checkout",
"assignee": "backend-team"
}
Incident Timeline
Every incident has a timeline that records all activity:
[14:32:01] INCIDENT CREATED - Alert: API error rate > 5%
[14:32:02] NOTIFICATION SENT - Email to backend-oncall@company.com
[14:32:15] ACKNOWLEDGED - by Sarah Chen
[14:33:00] STATUS CHANGED - Investigating
[14:33:45] NOTE ADDED - "Seeing increased latency on database queries"
[14:35:12] LINKED - Trace ID: abc123def456
[14:38:00] STATUS CHANGED - Identified
[14:38:30] NOTE ADDED - "Root cause: connection pool exhaustion after deploy"
[14:42:00] ACTION - Rollback initiated
[14:45:00] STATUS CHANGED - Monitoring
[14:55:00] STATUS CHANGED - Resolved
[14:55:00] DURATION - 22 minutes 59 seconds
Linking Incidents to Observability Data
This is where JustAnalytics incident management differs from standalone tools like PagerDuty or OpsGenie. Because incidents live in the same platform as your traces, logs, errors, and metrics, you can link directly:
- Link to traces -- attach the specific trace that shows the failure
- Link to error groups -- associate the incident with the error that caused it
- Link to log queries -- save the log search that helped diagnose the issue
- Link to metric charts -- embed the dashboard panel that shows the impact
- Link to session replays -- show the user experience during the incident
Incident Metrics
JustAnalytics tracks incident metrics over time:
| Metric | Definition |
|---|---|
| MTTA | Mean Time to Acknowledge |
| MTTI | Mean Time to Identify (root cause) |
| MTTR | Mean Time to Resolve |
| Incident frequency | Incidents per week/month |
| Severity distribution | Breakdown by severity level |
| SLO impact | Which SLOs were affected |
These metrics feed into team performance dashboards and retrospective reports.
Escalation Policies
Not every alert should go to the same person. Escalation policies define who gets notified, when, and in what order -- so critical issues always reach someone who can act on them.
Multi-Step Escalation
An escalation policy defines a series of notification steps:
Escalation Policy: Production Critical
Steps:
- Step 1 (0 min):
Notify: Current on-call (backend-rotation)
Channels: Email, Push Notification
- Step 2 (5 min, if not acknowledged):
Notify: Backend team lead
Channels: Email, Push Notification, SMS
- Step 3 (15 min, if not acknowledged):
Notify: Engineering manager
Channels: Email, SMS, Phone call
- Step 4 (30 min, if not acknowledged):
Notify: CTO
Channels: All channels
Repeat: After all steps, restart from Step 1
Max repeats: 3
On-Call Rotations
JustAnalytics supports on-call rotation schedules:
- Weekly rotation -- rotate the primary on-call every week
- Daily rotation -- for teams that prefer shorter shifts
- Custom schedule -- define specific date ranges for each team member
- Override -- temporarily assign on-call to someone outside the rotation
- Handoff time -- configure when rotations switch (e.g., Monday 9 AM)
On-Call Rotation: Backend Team
Schedule: Weekly, Monday 9:00 AM UTC
Members:
- Week 1: Sarah Chen
- Week 2: Marcus Johnson
- Week 3: Priya Sharma
- Week 4: Alex Rodriguez
Override: March 31 - April 7: James Park (covering for Sarah)
Notification Channels
Escalation steps can use any combination of notification channels:
| Channel | Configuration |
|---|---|
| Team member's email address | |
| Push notification | JustAnalytics mobile app |
| SMS | Phone number (Twilio integration) |
| Slack | Channel or DM via Slack webhook |
| Webhook | Custom HTTP endpoint |
Routing Rules
Route incidents to different escalation policies based on conditions:
Rule 1: IF service = "payment-service" AND severity = "critical"
→ Escalation Policy: Payment Critical
→ Additional: Notify finance-team Slack channel
Rule 2: IF service = "api-gateway" AND error_rate > 10%
→ Escalation Policy: Production Critical
Rule 3: IF source = "uptime-monitor"
→ Escalation Policy: Infrastructure On-Call
Default: → Escalation Policy: General Engineering
Workflow Automation
Beyond escalation, JustAnalytics supports automated workflows that trigger when incidents are created, updated, or resolved.
Auto-Remediation
Workflow: Auto-scale on High CPU
Trigger: Incident created where metric = "cpu_usage" AND value > 90%
Actions:
- POST webhook to auto-scaler API
- Add note to incident: "Auto-scale triggered"
- Wait 5 minutes
- Check if CPU < 70%
- If yes: Resolve incident with note "Auto-scale resolved the issue"
- If no: Escalate to Step 2
Incident Communication
Workflow: Status Page Update
Trigger: Incident severity = "critical" AND status = "investigating"
Actions:
- Create status page incident
- Post update: "We are investigating reports of {{incident.title}}"
- On resolution: Post update "This issue has been resolved"
Retrospective Automation
Workflow: Post-Incident Review
Trigger: Incident resolved where severity IN ("critical", "high")
Actions:
- Wait 24 hours
- Create retrospective document from incident timeline
- Assign to incident owner
- Schedule review meeting (calendar integration)
- Notify team channel
How It Compares to Datadog
Datadog's incident management and SLO features are the benchmark for enterprise observability. Here's where JustAnalytics stands:
| Feature | JustAnalytics | Datadog |
|---|---|---|
| SLO creation | Yes | Yes |
| Error budget tracking | Yes | Yes |
| Burn rate alerts | Yes | Yes |
| Incident lifecycle | Yes | Yes |
| Incident timeline | Yes | Yes |
| Escalation policies | Yes | Via PagerDuty integration |
| On-call rotations | Yes | Via PagerDuty integration |
| Multi-channel notifications | Yes | Yes |
| Auto-remediation workflows | Basic | Advanced |
| Retrospective templates | Yes | Yes |
| Status page integration | Built-in | Separate product |
| Pricing | Included in plan | $23/host/month + add-ons |
The key difference: Datadog charges separately for incident management, SLOs, and on-call -- and for escalation, you typically need PagerDuty as well. JustAnalytics includes all of these in a single plan.
Getting Started with SLOs
Step 1: Define Your First SLO
Start with your most critical user-facing endpoint. A good first SLO is availability:
- Go to Monitoring > SLOs > Create SLO
- Select Availability as the type
- Choose your data source (spans from your API service)
- Set the target (start with 99.9%)
- Set the window (30 days rolling)
Step 2: Set Up Burn Rate Alerts
JustAnalytics will suggest default burn rate alert thresholds when you create an SLO. Review them and adjust based on your team's tolerance for alerts.
Step 3: Create an Escalation Policy
Define who should be notified when the SLO is at risk:
- Go to Settings > Escalation Policies > Create
- Add notification steps with increasing urgency
- Assign team members and channels to each step
Step 4: Connect Incidents to Your Workflow
Configure automatic incident creation from your SLO burn rate alerts, and link your escalation policy to route incidents to the right people.
Within an hour, you'll have a complete reliability management workflow -- SLO tracking, burn rate alerting, incident management, and escalation -- all in one platform.
Start your 7-day free trial and start managing reliability like the best engineering teams in the world.
The engineering and product team behind JustAnalytics. We're on a mission to make web observability simpler, faster, and more private.