Platform Engineering Playbook Podcast

Black Friday War Stories: Lessons from E-Commerce's Worst Days


Listen Later

Why do major retailers with unlimited budgets still crash on Black Friday? This episode dives into the graveyard of e-commerce outages—from J.Crew's $775,000 five-hour crash to the AWS typo that cost $150 million.

In this Black Friday special episode, we examine:

📊 THE HALL OF FAME CRASHES

• J.Crew 2018: 323,000 shoppers affected, $775,000 lost in 5 hours
• Walmart 2018: $9 million lost before Black Friday even started
• Best Buy 2014: Infrastructure optimized for desktop, got 78% mobile
• Cloudflare 2024: 99.3% of Shopify stores frozen (6M+ domains)

💥 THE FAMOUS NON-BLACK-FRIDAY DISASTERS

• AWS S3 2017: One typo took down half the internet for 4+ hours
• GitLab 2017: 5 backup systems, none working, 300GB data deleted
• k8s.af: The community treasure trove of Kubernetes failures

🛡️ THE PLATFORM ENGINEER'S PLAYBOOK

• Load test at 5-10x (not 2x)
• Multi-CDN/multi-cloud strategies
• Monthly backup restore tests
• Practice chaos before it finds you
• Design mobile-first (78%+ of traffic)
• Safeguards on dangerous commands

The uncomfortable truth: These outages aren't caused by lack of budget or talent. They're caused by complexity, assumptions, and the gap between "should work" and "actually tested."

🔗 Full transcript & notes: https://platformengineeringplaybook.com/podcasts/00039-black-friday-war-stories

Episode Tags: Black Friday, e-commerce outages, AWS S3, GitLab, Kubernetes, platform engineering, SRE, incident response, chaos engineering, load testing

...more
View all episodesView all episodes
Download on the App Store

Platform Engineering Playbook PodcastBy vibesre