
Sign up to save your podcasts
Or


In this episode of 10,000 Feet, Nate Sherman and Matt Glenn dive into the recent wave of major cloud outages impacting AWS, Azure, and Cloudflare, exploring what went wrong and why these failures are so disruptive. They discuss the growing risks of globally applied changes, the importance of error budgets, and strategies for building resilience in modern infrastructure. The conversation also covers best practices in site reliability engineering, monitoring, and alerting, as well as the role of AI and automation in change management. Packed with insights for architects, SREs, and IT leaders, this episode offers practical guidance on balancing speed, reliability, and risk in today’s cloud-driven world.
By Vervint5
1515 ratings
In this episode of 10,000 Feet, Nate Sherman and Matt Glenn dive into the recent wave of major cloud outages impacting AWS, Azure, and Cloudflare, exploring what went wrong and why these failures are so disruptive. They discuss the growing risks of globally applied changes, the importance of error budgets, and strategies for building resilience in modern infrastructure. The conversation also covers best practices in site reliability engineering, monitoring, and alerting, as well as the role of AI and automation in change management. Packed with insights for architects, SREs, and IT leaders, this episode offers practical guidance on balancing speed, reliability, and risk in today’s cloud-driven world.