Workflow Automation
Automate responses to alerts, incidents, and errors with configurable triggers, conditions, and actions.
What Is Workflow Automation?#
Workflow automation lets you define rules that automatically take actions when specific events occur in your system. Instead of manually creating incidents, assigning issues, or sending notifications, workflows handle repetitive operational tasks for you.
A workflow has three parts:
- Trigger -- the event that starts the workflow (e.g., an alert fires)
- Conditions -- optional filters that narrow when the workflow runs (e.g., only for production, only for SEV1)
- Actions -- what the workflow does (e.g., create an incident, send a webhook)
Triggers#
Triggers define which events start a workflow. Each workflow has exactly one trigger.
alert_fired#
Runs when any alert rule transitions to the firing state.
Available context:
alert.name-- name of the alert rulealert.severity-- critical, warning, infoalert.service-- affected service namealert.environment-- production, staging, etc.alert.metric-- the metric that breached the thresholdalert.value-- the current metric valuealert.threshold-- the configured threshold
incident_declared#
Runs when an incident is created (manually or automatically).
Available context:
incident.title-- incident titleincident.severity-- SEV1, SEV2, SEV3, SEV4incident.services-- list of affected servicesincident.environment-- environmentincident.commander-- assigned commander (if any)
error_new#
Runs when a new error group is created (first occurrence of a unique error).
Available context:
error.message-- error messageerror.type-- error type (TypeError, RangeError, etc.)error.service-- service where the error originatederror.environment-- environmenterror.file-- source file patherror.count-- occurrence count (always 1 for new errors)
error_regression#
Runs when a previously resolved error group reappears in a new release.
Available context:
error.message-- error messageerror.service-- service nameerror.resolvedInRelease-- the release where it was resolvederror.regressedInRelease-- the release where it reappearederror.regressionCount-- how many times this error has regressed
deploy_completed#
Runs when a new release is tracked.
Available context:
deploy.version-- release versiondeploy.service-- service namedeploy.environment-- environmentdeploy.commitCount-- number of commits in the release
slo_budget_warning#
Runs when an SLO error budget drops below a threshold.
Available context:
slo.name-- SLO nameslo.service-- service nameslo.budgetRemaining-- percentage remainingslo.burnRate-- current burn rateslo.target-- SLO target percentage
custom#
Runs when your code sends a custom event via the API. Use this to trigger workflows from your own tooling.
// Fire a custom workflow trigger
await fetch('/api/ingest/workflow-event', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': 'YOUR_API_KEY',
},
body: JSON.stringify({
event: 'custom',
name: 'canary_deploy_failed',
context: {
service: 'payment-service',
version: 'v2.3.1',
failureReason: 'Error rate exceeded 5% threshold',
},
}),
});
Conditions#
Conditions filter when a workflow should run after it's triggered. If all conditions match, the actions execute. If any condition fails, the workflow skips silently.
Severity Filter#
Match on severity level:
condition: severity IN [SEV1, SEV2]
Works with: alert_fired, incident_declared
Service Filter#
Match on one or more service names:
condition: service IN [payment-service, checkout-api]
Works with: all triggers
Environment Filter#
Match on environment:
condition: environment = production
Works with: all triggers
Tag Filter#
Match on custom tags:
condition: tags.team = platform
condition: tags.region = us-east-1
Works with: alert_fired, error_new, error_regression
Metric Value Filter#
Match on the metric value that triggered the alert:
condition: alert.value > 100
condition: alert.value BETWEEN 50 AND 200
Works with: alert_fired
Budget Threshold#
Match on remaining error budget:
condition: slo.budgetRemaining < 25
Works with: slo_budget_warning
Combining Conditions#
Multiple conditions are combined with AND logic. All conditions must match for the workflow to execute.
Trigger: alert_fired
Conditions:
AND severity IN [critical]
AND service IN [payment-service, checkout-api]
AND environment = production
Actions#
Actions define what happens when a workflow's trigger fires and all conditions are met. A workflow can have multiple actions that execute in parallel.
webhook#
Send an HTTP POST request to an external URL. Use this to integrate with Slack, PagerDuty, custom tooling, or any system that accepts webhooks.
{
"action": "webhook",
"config": {
"url": "https://hooks.slack.com/services/T00/B00/xxxx",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"body": {
"text": "Alert fired: {{alert.name}} on {{alert.service}} ({{alert.severity}})",
"channel": "#incidents"
},
"timeoutMs": 10000,
"retries": 2
}
}
Template variables (wrapped in {{ }}) are replaced with values from the trigger context.
email#
Send an email notification:
{
"action": "email",
"config": {
"to": ["oncall@company.com", "platform-team@company.com"],
"subject": "SEV1 Incident: {{incident.title}}",
"body": "A SEV1 incident has been declared.\n\nServices: {{incident.services}}\nCommander: {{incident.commander}}\n\nView: {{incident.url}}"
}
}
create_incident#
Automatically create an incident:
{
"action": "create_incident",
"config": {
"title": "Auto: {{alert.name}} on {{alert.service}}",
"severity": "SEV1",
"services": ["{{alert.service}}"],
"description": "Automatically created from alert {{alert.name}}.\nMetric: {{alert.metric}} = {{alert.value}} (threshold: {{alert.threshold}})"
}
}
assign_issue#
Assign an error group or incident to a team member based on ownership rules:
{
"action": "assign_issue",
"config": {
"strategy": "ownership_rules",
"fallback": "user_abc123"
}
}
Strategy options:
ownership_rules-- use CODEOWNERS / ownership rules to determine assigneeround_robin-- rotate assignment among team membersspecific_user-- always assign to a specific user
run_query#
Execute a saved query and include the results in a notification or incident:
{
"action": "run_query",
"config": {
"queryId": "saved_query_xyz",
"attachTo": "incident",
"format": "table"
}
}
update_status#
Update the status of a linked resource:
{
"action": "update_status",
"config": {
"target": "error_group",
"status": "acknowledged",
"comment": "Auto-acknowledged by workflow"
}
}
Visual Workflow Builder#
Create and edit workflows visually in Dashboard > Settings > Workflows.
Building a Workflow#
- Click Create Workflow
- Give it a name and optional description
- Select a trigger from the dropdown
- Add conditions by clicking Add Condition (each condition adds a filter row)
- Add one or more actions by clicking Add Action
- Toggle the workflow Enabled / Disabled
- Click Save
Workflow Canvas#
The visual builder shows your workflow as a flowchart:
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Trigger │────▶│ Conditions │────▶│ Actions │
│ alert_fired │ │ severity=critical│ │ create_incident │
│ │ │ env=production │ │ webhook (Slack) │
└─────────────┘ └──────────────────┘ │ email (oncall) │
└─────────────────┘
Each block is clickable and opens a configuration panel.
Testing Workflows#
Click Test on any workflow to simulate it with sample data. The test shows:
- Whether conditions matched
- What actions would execute
- The rendered templates (with variables replaced)
- Any errors in configuration
Testing does not execute real actions -- webhooks are not sent, incidents are not created.
Examples#
Auto-Create Incident on Critical Alert#
Name: Auto-Incident on Critical Alert
Trigger: alert_fired
Conditions:
- severity = critical
- environment = production
Actions:
1. create_incident
- title: "Auto: {{alert.name}} on {{alert.service}}"
- severity: SEV1
2. webhook (Slack)
- channel: #incidents
- message: "SEV1 incident created for {{alert.service}}"
Auto-Assign Errors by Service#
Name: Assign Payment Errors
Trigger: error_new
Conditions:
- service IN [payment-service, billing-service]
- environment = production
Actions:
1. assign_issue
- strategy: ownership_rules
- fallback: @payments-team-lead
2. webhook (Slack)
- channel: #payments-errors
- message: "New error in {{error.service}}: {{error.message}}"
Notify on SLO Budget Depletion#
Name: SLO Budget Warning
Trigger: slo_budget_warning
Conditions:
- slo.budgetRemaining < 25
Actions:
1. email
- to: platform-team@company.com
- subject: "SLO Budget Warning: {{slo.name}}"
2. webhook (PagerDuty)
- routing_key: "R00..."
- severity: warning
Auto-Reopen on Regression#
Name: Regression Alert
Trigger: error_regression
Conditions:
- environment = production
Actions:
1. update_status
- target: error_group
- status: reopened
2. assign_issue
- strategy: ownership_rules
3. webhook (Slack)
- channel: #regressions
- message: "Regression: {{error.message}} reappeared in {{error.regressedInRelease}}"
Custom: Canary Deploy Failure#
Name: Canary Rollback Notification
Trigger: custom (name = canary_deploy_failed)
Conditions:
- context.service IN [payment-service, checkout-api]
Actions:
1. create_incident
- title: "Canary failed: {{context.service}} {{context.version}}"
- severity: SEV2
2. webhook (Slack)
- message: "Canary deploy of {{context.service}} {{context.version}} failed: {{context.failureReason}}"
Limits and Quotas#
| Resource | Limit | |----------|-------| | Workflows per site | 20 | | Conditions per workflow | 10 | | Actions per workflow | 5 | | Webhook timeout | 10 seconds | | Webhook retries | 2 (with exponential backoff) | | Email recipients per action | 10 | | Custom event payload size | 10 KB |
Rate Limiting#
Workflows are rate-limited to prevent runaway execution:
- Each workflow can fire at most once per minute for the same trigger context
- If an alert is flapping (firing and resolving rapidly), the workflow runs on the first firing and suppresses subsequent firings for 60 seconds
- Custom event triggers are limited to 100 per hour per site
Workflow Execution History#
View the execution history for any workflow under Dashboard > Settings > Workflows > [Workflow] > History.
Each execution shows:
- Timestamp -- when it ran
- Trigger event -- the event that triggered it
- Conditions result -- which conditions matched (or didn't)
- Actions result -- success/failure for each action, with response details
- Duration -- how long the workflow took to execute
Use the history to debug workflows that aren't firing as expected or to audit automated actions.
Disabling and Deleting#
- Disable a workflow to stop it from executing without deleting it. Useful during maintenance windows.
- Delete a workflow to permanently remove it. Execution history is retained for 30 days after deletion.
- Pause all -- a global kill switch under Dashboard > Settings > Workflows > Pause All Workflows stops all workflow execution site-wide.