Incident Response Playbook for CDN/Cloud Outages

A hands‑on playbook for SREs to detect, communicate, mitigate, and postmortem CDN/cloud outages caused by upstream providers in 2026.

When Cloudflare or AWS goes dark: a practical incident response playbook for tech teams

Hook: If your users are seeing errors, page loads are frozen, or monitoring alarms flood your on‑call channel because an upstream CDN or cloud provider failed, you need a concise, repeatable playbook that reduces MTTR, preserves customer trust, and prevents recurrence.

Executive summary — Why this playbook matters in 2026

Late 2025 and early 2026 reinforced a simple truth: even dominant edge and cloud platforms are not infallible. High‑profile outages affecting Cloudflare, AWS and large public platforms exposed how quickly dependency chains can cascade. For platform owners and operators, the objective is not to eliminate third‑party risk (impossible) but to detect fast, communicate clearly, mitigate decisively, and learn thoroughly.

This playbook is focused specifically on upstream service outages (CDN/edge providers and cloud networking failures) and covers four phases: Detection, Communication, Mitigation, and Postmortem. It’s written for developers, SREs and IT admins running production services with real customers and SLAs.

Goals & measurable outcomes

Reduce mean time to detection (MTTD) to under 3 minutes for customer‑impacting upstream failures.
Reduce mean time to mitigation (MTTM) to under 30 minutes for partial recovery steps.
Provide clear, auditable communications to customers and stakeholders within 15 minutes of detection.
Deliver a postmortem with remediations and timelines within 72 hours.

Phase 1 — Detection: get alerted to upstream failures quickly

Relying solely on provider status pages or public social media is reactive and slow. Build detection that validates your users' experience across networks.

Active signals to monitor

Synthetic user probes from multiple geographies and ISPs (HTTP/S, DNS resolution, and TCP/TLS) with short intervals (30–60s) and anomaly thresholds.
Real user monitoring (RUM) to capture client‑side failures (JS errors, fetch failures, timeouts) and aggregate by region and ASN.
Infrastructure telemetry: origin errors rates (5xx), backend latency spikes, and cache hit ratios. Sudden global cache hit drop may indicate CDN issues.
DNS monitoring: authoritative, recursive, and ISP‑level resolve times and NXDOMAIN/servfail spikes.
Third‑party status feeds: provider status API, pager/incident webhooks (subscribe to Cloudflare / AWS healthfeed), and independent aggregators like DownDetector.

Detection checklist

Run synthetic probes from at least three cloud regions and two colo providers.
Alert if >5% of RUM sessions show timeouts in a 1‑minute window.
Alert if global 5xx rate increases by >200% over baseline for 2 consecutive minutes.
Alert if DNS resolution errors exceed 0.5% of queries across probes.

Rule of thumb: detect in under 3 minutes, escalate in under 5.

Phase 2 — Communication plan: who says what, when, and how

Clear communications prevent confusion and reduce support load. Prepare templates and an escalation path in advance.

Communication roles

Incident Commander (IC): owns the incident lifecycle and public messaging.
Tech Lead / SRE: assesses technical impact and runs mitigation tasks.
Comms Lead: crafts customer messages and coordinates status page/feed updates.
Support Lead: triages incoming tickets and posts standard responses.

Notification timeline

Within 5–15 minutes: initial internal notification with known impact and next steps.
Within 15 minutes: public status page entry (even if preliminary) and a short customer notification for critical customers.
Every 30 minutes: status updates until resolution, then a final summary and postmortem ETA.

Customer message template (short)

Use this as a 1‑line status update on your status page, support portal, and social channels:

We are investigating an issue affecting content delivery and site access. Initial data indicates an upstream CDN/cloud provider outage. Impact: intermittent failures for some regions. Next update in 30 minutes. (Incident ID: INC‑YYYYMMDD‑NN)

Phase 3 — Mitigation playbook: technical steps to reduce impact

Mitigation depends on architecture. The fastest wins are traffic steering, bypassing affected layers, and preserving critical write paths.

Immediate actions (first 15 minutes)

Confirm blast radius — use RUM, synthetic probes, and logs to find affected regions, ASNs, and endpoints.
Open an upstream incident channel with your provider(s): escalate through support/SE contacts and post incident IDs to your internal channel.
Adjust DNS/TCP failover — if using multi‑CDN, lower DNS TTL temporarily and switch traffic via weighted DNS or steering (use API to avoid manual delays).
Bypass CDN for critical traffic — enable origin direct access or origin pull bypass for API endpoints that require availability (ensure origin IPs are access‑restricted with origin ACLs or mTLS).
Enable degradations — turn off nonessential features (analytics, batch jobs, heavy JS) to reduce load on origin and networks.

Medium term actions (15–60 minutes)

Fail over to backup provider if available (multi‑CDN or secondary cloud region). Use automated DNS steering or BGP/Anycast controls if you have them.
Use signed URLs or origin auth to allow clients to reach origin while protecting against abuse.
Temporarily adjust cache TTLs to increase cache hit ratios for static assets if upstream supports partial behavior.
Implement rate limiting at perimeter or load balancer to protect origin during reconfiguration.

What NOT to do

Don’t change BGP announcements unless your team is practiced in BGP failover — mistakes can increase outage scope.
Don’t publish speculative root cause publicly. Say facts about impact and next update windows.
Avoid large config rollouts during an outage; prefer small, reversible changes and test on a canary first.

On‑call & escalation: roles, runbooks, and thresholds

On‑call must be structured with clear escalation. For upstream incidents, most responses are configuration and traffic steering — ensure on‑call has privileges and scripts ready.

Escalation matrix (example)

Pager triggers -> On‑call SRE (5 min)
No resolution -> Senior SRE and IC (10 min)
Critical customer impact -> Eng Director & CSM briefed (15 min)
Provider confirmed major outage -> Exec notification and customer AM briefing (30 min)

Runbook snippets (actionable commands)

Include tested scripts in your repo. Example quick checks:

HTTP probe: curl -sS -w "%{http_code} %{time_total}\n" -o /dev/null https://yourdomain.example/health
DNS check: dig +short @8.8.8.8 yourdomain.example A and dig +trace yourdomain.example
Cache hit telemetry: query CDN API or check edge headers (e.g., CF-Cache-Status).

Postmortem & RCA: restoring trust and preventing recurrence

A timely, blameless postmortem is non‑negotiable. Use a standard template and publish internally and publicly when appropriate.

Postmortem template (deliver within 72 hours)

Incident summary: timeline and impact (users, regions, SLAs).
Root cause: what failed upstream and why it affected you.
Detection and remediation timeline with timestamps (exact times for alerts, actions, and provider confirmations).
Contributing factors: configuration, automation gaps, missing fallbacks.
Corrective actions: tactical fixes and long‑term improvements with owners and due dates.
Customer impact statement and communication log.

RCA methods & evidence

Correlate provider status messages with your telemetry. Capture:

Edge logs and headers across regions
DNS resolution traces
Support tickets and provider incident IDs
Network traces (pcap, if relevant) and route tables

Hardening roadmap — reduce upstream single points of failure

Outages are inevitable; dependence on one provider isn't. Prioritize mitigations that yield the most resilience per effort:

Multi‑CDN & multi‑region deployments: use API‑driven DNS steering and health checks to fail over automatically.
Edge independence: design graceful degradation so core APIs remain accessible without full CDN feature set.
RPKI and BGP hygiene: accelerate adoption where you control BGP; monitor route anomalies.
Automated runbooks: scripted playbooks in CI, accessible to on‑call with “one‑button” actions for common mitigations.
Contractual SLAs and credits: ensure your agreements with providers include clear escalation paths and post‑incident deliverables.

2026 trends you should plan for

In 2026, expect the following to shape incident strategy:

Edge consolidation and regulatory overlays: greater adoption of national/regional edge cloaks—plan for geo‑segmented outages and compliance drift.
Increased BGP/RPKI tooling: better detection of route leaks and more automation to protect prefix announcements.
Platform interdependence visibility: observability platforms are shipping built‑in dependency graphs and upstream incident correlation (use them).
SLA and legal scrutiny: regulators are paying attention to systemic outages—publishable postmortems and documented mitigations are increasingly important.

Appendix — Practical checklists & templates

Incident starter checklist (first 10 minutes)

Confirm incident and set severity.
Notify IC and on‑call via designated channel with incident ID.
Open provider support case and record provider incident ID.
Post preliminary status page message.
Run diagnostic probes and capture outputs (curl, dig, traceroute).

Mitigation checklist (first 30 minutes)

Switch DNS TTL to low value (if safe) and prepare failover target.
Enable origin bypass for critical endpoints and restrict origin access.
Enable degraded mode for nonessential features.
Keep stakeholders updated every 30 minutes.

Postmortem action items (example)

Implement multi‑CDN with automated steering by June 2026. Owner: SRE Lead.
Publish runbooks and one‑click scripts for origin bypass. Owner: Platform Eng.
Subscribe to provider incident feeds and integrate with Slack/PagerDuty. Owner: Observability Team.

Final notes — Operational discipline wins

When upstream providers fail, the differentiator is not whether they fail, but how your organization responds. Prepared detection, a clear communication playbook, practiced mitigations, and a rigorous postmortem process reduce downtime, preserve customer trust, and prevent recurrence.

Actionable takeaway: Implement the detection checklist this week, create the incident templates in your status platform, and run a tabletop drill with your on‑call team within 30 days.

Call to action

If you want a customized incident playbook or a 60‑minute tabletop run for your team, contact our SRE consultants. We’ll map this playbook to your architecture, automate the critical runbooks, and help cut your MTTR in half.