October 21, 2025

Tech Deep Dive - When the Cloud Goes Dark: Observability After the AWS Outage

22 minutes

Send us a text

Episode 9: When the Cloud Goes Dark - Observability After the AWS Outage

Yesterday's AWS outage cost hundreds of billions and took down Snapchat, Coinbase, Ring, even Amazon's own retail site. 15+ hours of chaos exposed a critical truth: most organizations are doing observability completely wrong.

THE RECEIPTS:
- October 20, 2025, 3:11 AM ET - DNS resolution failure in US-EAST-1
- 15 hours 12 minutes to full recovery
- 50,000+ simultaneous Downdetector reports at peak
- 70+ AWS services affected
- $2M/hour median cost for enterprises (New Relic 2025 Forecast)
- Organizations with proper observability: 50% cost reduction

WHAT FAILED:
DNS couldn't resolve DynamoDB endpoints → EC2 launch failures → Network Load Balancer health checks failed → 70+ services cascaded down. Even AWS's own monitoring systems went offline.

REAL IMPACT:
- Coinbase locked out during trading hours
- 8Sleep smart mattresses stuck in "relax mode"
- Disabled users lost Alexa-controlled lights
- Students couldn't submit assignments (Canvas down)
- Ring doorbells blind during security incidents
- Amazon warehouse workers sent to break rooms

THE THREE PILLARS OF OBSERVABILITY:
1. Metrics (Prometheus, CloudWatch, Azure Monitor)
2. Logs (ELK stack, Splunk, centralized logging)
3. Traces (OpenTelemetry, Jaeger for distributed systems)

CRITICAL LESSON: If your observability stack lives in the same cloud region you're monitoring, it goes down when you need it most. CloudWatch was down during the AWS outage.

5 LESSONS FROM THE OUTAGE:
1. Multi-region is the new minimum (multi-AZ didn't save anyone)
2. Observability must be independent (Datadog, New Relic, Dynatrace)
3. DR plans are useless if untested (monthly drills, not yearly)
4. Dependency mapping is critical (know what fails when X fails)
5. Control plane resilience matters (AWS support system went offline)

YOUR ACTION PLAN:
□ Audit observability stack independence
□ Map all cloud dependencies by region
□ Test DR plan THIS WEEK
□ Set up degradation alerts (not just "down" alerts)
□ Practice chaos engineering

"The prudent see danger and take refuge, but the simple keep going and pay the penalty." - Proverbs 27:12

NEXT EPISODE: CI/CD Pipeline Security - SBOM, artifact signing, secrets management

SERIES ARC: This builds on our DevSecOps → Kubernetes → Multi-Cloud → Platform Engineering foundation.

FIND US:
🌐 FaithFreedomTech.com
📝 DevSecOpsWithScott.com
📝 scottwhoughton.medium.com
🐦 @FaithFT_Podcast (X)
📱 @FaithFreedomTech (everywhere else)

Available on all podcast apps - Apple, Spotify, Google, Amazon Music, and more.

#DevSecOps #CloudArchitecture #SiteReliability #AWS #Observability #MultiCloud