When Stripe or AWS Goes Down: Detecting Third-Party Outages Before Your Users Do
Your app looks broken when a dependency dies. Learn to detect third-party outages with synthetic checks and error-spike correlation — minutes, not hours.
It was 2:47pm on a Thursday when our checkout conversion rate dropped from 3.2% to zero. Not low. Zero. The on-call engineer (me, unfortunately) spent the first twelve minutes convinced we'd shipped a bad deploy. The error logs showed StripeCardError: Your card was declined on every single transaction — which made no sense, because these were test cards in staging that had worked an hour earlier.
Stripe's status page? Green. All systems operational. Everything's fine, folks. (Sound familiar? We've all been there — which is why uptime monitoring that doesn't rely on vendor status pages is non-negotiable.)
At 3:14pm, Stripe finally acknowledged "elevated error rates" on their status page. By then I'd already rolled back a perfectly good deploy, restarted three services that didn't need restarting, and burned half the afternoon on a problem that wasn't ours to fix.
This happens constantly. A dependency dies, your app looks broken, and you waste hours proving "it's not us" before you can even start communicating with customers. So this post is about building a detection system that tells you it's a third-party outage within minutes — not after your support queue explodes.
Why Status Pages Lie (Or At Least Lag)
Status pages are manually updated. Someone at Stripe or AWS has to notice the problem, confirm it's real, decide it's worth announcing, draft the message, get approval, and publish. That whole dance takes 5-20 minutes on a good day. On a bad day — when the incident is partial, affecting only some customers or regions — the status page might never update at all.
I've watched AWS S3 have a regional outage for forty minutes before the status page acknowledged anything. The us-east-1 problems in December 2021 famously broke the AWS status page itself, so customers couldn't even check if AWS was down because the page that would tell them was... down.
Relying on status pages for incident detection is like relying on your customers to tell you when your site is slow. They will, eventually. But by then the damage is done.
The Two-Layer Detection Stack
Here's what actually works: synthetic checks against your dependencies, plus error-spike correlation against your own telemetry. The synthetic checks catch when the dependency is down. The error correlation proves the spike in your app is caused by that dependency, not your own code.
Layer 1: Synthetic Checks Against Dependency Endpoints
Synthetic monitoring means hitting an endpoint on a schedule and checking the response. Basic stuff. Most uptime tools do this for your own site — the insight is doing it for your dependencies too.
For Stripe, that means checking:
https://api.stripe.com/v1/charges(core API health)https://dashboard.stripe.com(dashboard availability — affects your support team)https://r.stripe.com(webhook delivery endpoint)
For AWS dependencies, it depends on what you use:
https://s3.us-east-1.amazonaws.com(S3 regional)https://dynamodb.us-west-2.amazonaws.com(DynamoDB regional)https://sqs.eu-west-1.amazonaws.com(SQS regional)
You don't need to authenticate these checks. A HEAD request to the endpoint that returns a 200 or 403 (forbidden without auth) means the service is up. A 503, timeout, or connection refused means trouble.
In JustAnalytics, you'd set these up as HTTP monitors under Uptime → Monitors. Here's a typical config for Stripe:
monitors:
- name: "Stripe API Health"
url: "https://api.stripe.com/v1/charges"
method: HEAD
expected_status: [200, 401, 403] # 401/403 = auth required, but API is up
interval_seconds: 60
regions: ["us-east", "eu-west", "ap-southeast"]
alert_after_failures: 2
- name: "Stripe Webhooks Endpoint"
url: "https://r.stripe.com/healthcheck"
method: GET
expected_status: [200]
interval_seconds: 60
regions: ["us-east"]
alert_after_failures: 2
The multi-region piece matters. If your check from us-east fails but eu-west succeeds, it's probably a regional issue — and you can tell your EU customers everything's fine while routing US traffic to a maintenance page.
Layer 2: Error Correlation to Prove "Not Us"
Synthetic checks tell you a dependency is unhealthy. But you still need to prove that your error spike is caused by that dependency, not a bug you shipped at 2:45pm.
This is where error tracking pays off. When Stripe goes down, your errors will have a signature:
- Exception type:
StripeAPIErrororStripeConnectionError - Stack trace pointing at your
payment_service.pyorcheckout.js - Error message mentioning timeouts, 503s, or connection refused
- Timing that correlates with your synthetic check failures
The correlation is the proof. If your error tracking dashboard shows a spike in StripeAPIError starting at 14:32, and your Stripe synthetic check failed at 14:31, you've got your answer in under two minutes.
Here's a Python example of error tagging that makes this correlation trivial:
import stripe
from justanalytics import track_error
def charge_card(amount, token, customer_id):
try:
return stripe.Charge.create(
amount=amount,
currency="usd",
source=token,
metadata={"customer_id": customer_id}
)
except stripe.error.APIConnectionError as e:
track_error(
exception=e,
tags={
"dependency": "stripe",
"endpoint": "charges",
"error_type": "connection",
}
)
raise
except stripe.error.APIError as e:
track_error(
exception=e,
tags={
"dependency": "stripe",
"endpoint": "charges",
"error_type": "api_error",
"stripe_http_status": e.http_status,
}
)
raise
That dependency: stripe tag is what makes the correlation instant. Filter your error dashboard by that tag, compare the spike timing to your synthetic check, done. You're not searching through stack traces hoping to spot the pattern.
Building the Alert Chain
Detection without alerting is just archaeology. (Ask me how many times I've caught an outage in logs three days later. Too many.) Here's the alert chain that works:
Tier 1: Warning (Slack channel, email)
- Any synthetic check fails 2x consecutively
- Error rate for a specific
dependencytag exceeds 10x baseline - Triggers: internal awareness, start investigating
Tier 2: Incident (PagerDuty, phone call)
- Synthetic check fails 3x consecutively (3+ minutes of outage)
- Error rate exceeds 50x baseline AND affects >5% of requests
- Triggers: wake someone up, start customer communication
Tier 3: Customer-Facing (status page update)
- Confirm the dependency is the cause (check their status page, Twitter, Down Detector)
- Post to your own status page: "We're experiencing issues with our payment processor. Payments may fail. We're monitoring the situation."
The customer communication part matters more than engineers usually think. (I'm guilty of ignoring this for years.) Your users don't care whose fault it is — they care that checkout doesn't work. Silence is worse than "we're looking into it," even when the fix is literally "wait for Stripe to come back."
The Edge Cases That Will Burn You
Partial Outages
The hardest outages to detect are partial ones. Stripe might be returning 503s for 20% of requests, or only for certain card types, or only for customers with billing addresses in Ohio. Your synthetic check passes. Your error rate goes up, but not by 50x.
The fix? Percentile monitoring on your dependency calls. Track p50, p95, p99 latency for external API calls. A partial outage often shows up as latency spikes before error rates spike — requests are retrying internally, or hitting overloaded servers, before they start failing outright. (If you're running ads alongside your SaaS, partial outages can also trigger false fraud signals — ClickzProtect can help distinguish real bot traffic from dependency-induced anomalies.)
In JustAnalytics, this surfaces in the APM traces. Filter by span name containing "stripe" or "aws", check the latency percentiles over the last hour. If p99 latency jumped from 200ms to 4,000ms but p50 is still 150ms, you've got a partial outage brewing.
Cascading Failures
When AWS S3 goes down, your image uploads fail. Then your image processing queue backs up. Then your worker processes max out waiting on the queue. Then your web servers slow down because worker connections are exhausted. Now your whole app is slow and your Stripe calls are timing out — not because Stripe is down, but because your infrastructure is drowning.
This is where having all your observability in one place actually matters. You need to see the dependency error, the queue depth, the worker saturation, and the web server latency on the same timeline to trace the cascade. If your error tracking is in Sentry, your APM is in Datadog, your queues are in CloudWatch, and your uptime is in Pingdom — well, good luck correlating that in real-time. I've tried. It's miserable. (This fragmentation is exactly why we built JustAnalytics as an all-in-one observability platform.)
DNS and CDN Issues
Your synthetic check might be hitting a different DNS resolver than your production servers. Cloudflare had an outage in June 2022 where some resolvers returned stale records and others didn't. Your monitoring from us-east-1 looked fine; your production servers in us-west-2 couldn't resolve stripe.com.
The fix is monitoring from the same infrastructure you serve production from, not just from your monitoring vendor's network. Run the check from inside your VPC, hitting the same DNS your app uses. Otherwise you're testing a different network path than production actually takes.
What To Do When You've Detected It
So your alert fires. Stripe is down. Now what?
First 2 minutes:
- Check their status page (knowing it might lag)
- Check Twitter/X for
stripe downorstripe outage - Check Down Detector for spike patterns
- Confirm your synthetic checks across multiple regions
Minutes 2-5:
- Post to your internal incident channel: "Investigating Stripe connectivity issues, customer impact TBD"
- Pull up error dashboard filtered by
dependency:stripe, note start time and error rate - If impact is high, draft a status page update
Minutes 5-15:
- If Stripe hasn't acknowledged yet but you're confident, post your status update anyway: "We're seeing elevated payment failures due to issues with our payment processor"
- Consider graceful degradation: can you queue payments for retry? Show a friendlier error? Switch to a backup processor?
- Monitor for recovery — your synthetic checks will tell you when they're back
After recovery:
- Timeline the incident: when did it start (your detection), when did they acknowledge, when did they resolve, when did you recover
- Measure customer impact: how many failed checkouts, how much revenue affected
- Decide if you need redundancy: is this dependency critical enough to warrant a failover provider?
That last question is uncomfortable. Stripe is down maybe 2-3 hours per year total. Is that worth building a Braintree fallback? For most companies, honestly, no. The engineering cost outweighs the risk. But if you're doing $100K/hour in checkout volume, that math changes fast.
The Infrastructure You Actually Need
Look, let me be direct about what this requires:
- Synthetic monitoring for your critical dependencies (3-10 endpoints)
- Error tracking with dependency tagging on external calls
- APM / tracing to see latency percentiles on external spans
- Alerting that can correlate across these signals
You can build this with separate tools — Pingdom for synthetics, Sentry for errors, Datadog for APM, PagerDuty for alerting. That's $200-500/month at startup scale, and the correlation happens in your head (or in a Slack thread) rather than in a dashboard.
Or you can use something like JustAnalytics where uptime monitoring, error tracking, and APM live in the same product. Pro at $49/month ($39 annual) covers 5 sites and 1M events with 1 year retention — and uptime monitoring is included, not a separate line item. The AI Command Center add-on ($25/month) can surface correlations automatically: "Error spike in payment_service correlates with Stripe API latency increase at 14:32."
Either way, the pattern is the same: synthetic checks catch the outage, error correlation proves the cause, alerting makes sure humans know. The tooling matters less than having all three layers. Pick your tools. Just have them.
A Real Incident Timeline
Here's what the Stripe outage from the opening story would have looked like with this setup:
- 14:31:00 — Stripe API synthetic check fails (us-east region)
- 14:31:05 — JustAnalytics logs the failure, starts 60-second countdown to second check
- 14:32:00 — Second synthetic check fails, alert fires to Slack: "Stripe API Health failing in us-east"
- 14:32:15 — Error tracking shows spike in
StripeAPIError, auto-tagged withdependency:stripe - 14:32:30 — Engineer opens dashboard, sees synthetic failure + error spike correlation
- 14:33:00 — Incident posted to internal channel: "Stripe API down, confirming scope"
- 14:35:00 — Status page updated: "Payment processing degraded due to third-party issues"
- 14:47:00 — Stripe acknowledges on their status page (we already knew 16 minutes earlier)
- 15:22:00 — Synthetic checks pass, error rate returns to baseline, incident resolved
Total time from outage to internal awareness: under 2 minutes. Total time from outage to customer communication: under 4 minutes. Total time I wasted chasing ghosts: zero.
Compare that to the 12 minutes I actually lost — rolling back a working deploy, restarting services that were fine. That's the difference.
Frequently Asked Questions
How quickly can synthetic checks detect a third-party outage?
Synthetic checks running every 60 seconds will catch most outages within 2 minutes of the dependency going unhealthy. If you're monitoring critical payment endpoints like Stripe's /v1/charges, you'll know before the status page updates — which typically lags 5-15 minutes behind reality. The tradeoff is cost: running 60-second checks against 10 endpoints burns through your monitoring quota faster than 5-minute intervals.
Should I monitor the third-party status page or hit their API directly?
Hit their API directly. Status pages are manually updated, often lag reality by 10-20 minutes, and sometimes never reflect partial outages at all. A synthetic check against Stripe's /v1/payment_intents endpoint will fail the moment their API returns 503s — no human in the loop. Status page monitoring is fine as a secondary signal, but don't rely on it as your primary detection.
What's the difference between detecting an outage and proving it's not us?
Detection tells you something is wrong. Proving "not us" requires correlation — your infrastructure is healthy, your code didn't change, but errors spiked at 14:32 and so did Stripe API latency in your traces. The combination of green synthetic checks on your own endpoints plus red checks on the dependency, plus error fingerprints pointing at the external call, is what lets you tell your VP "Stripe is down, not us" with confidence.
How do I avoid alert fatigue from third-party monitoring?
Set sensible thresholds. A single failed check shouldn't page anyone — networks hiccup. Two consecutive failures at 60-second intervals (so 2+ minutes of downtime) is a reasonable baseline for a warning. Three consecutive failures for a page. And group related checks: if Stripe's API, dashboard, and webhook endpoints all fail simultaneously, that's one incident, not three.
Try JustAnalytics
All-in-one observability in one under-5KB script: cookieless analytics + error tracking + APM + session replay + uptime + structured logs. Replaces GA4 + Sentry + Datadog + Pingdom + LogRocket. Free tier (100K events/mo), Pro $49/month ($39 annual).
Author at JustAnalytics.