AWS Morning Brief

Whiteboard Confessional: The Day IBM Cloud Dissipated


Listen Later

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

  • CHAOSSEARCH
  • @QuinnyPig


Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.



Welcome to the AWS Morning Brief’s Whiteboard Confessional series. I am Cloud Economist Corey Quinn, and today's topic is going to be slightly challenging to talk about. One of the core tenants that we've always had around technology companies and working with SRE, or operations-type organizations is, full stop, you do not make fun of other people's downtime because today it's their downtime, and tomorrow it's yours. It's important. That's why we see the hashtag #HugOps on Twitter start to—well, not trend. It's not that well known but definitely happens fairly frequently when there's a well-publicized multi-hour outage that affects a company that people are familiar with. 



So, what we're going to talk about is an outage that happened several weeks ago for IBM Cloud. I want to point out some failings on IBM’s part but this is in the quote-unquote, “Sober light of day.” They are not currently experiencing an outage. They've had ample time to make public statements about the cause of the outage. And I've had time to reflect a little bit on what message I want to carry forward, given that there are definitely lessons for the rest of us to learn. HugOps is important, but it only goes so far, and at some point, it's important to talk about the failings of large companies and their associated response to crises so the rest of us can learn. 



Now, I'm about to dunk on them fairly hard, but I stand by the position that I'm taking, and I hope that it's interpreted in the constructive spirit that I intend it to. For background, IBM Cloud is IBM's purported hyperscale cloud offering. It was effectively stitched together from a variety of different acquisitions, most notable among them SoftLayer. I've had multiple consulting clients who are customers of IBM Cloud over the past few years, and their experience has been, to put it politely, a mixed bag. In practice, the invective that they would lobby against it would be something worse. 



Now, a month ago, something strange happened to IBM Cloud. Specifically, it went down. I don't mean that a service started having problems in a region. That tends to happen to every cloud provider, and it's important that we don't wind up beating them up unnecessarily for these things. No, IBM Cloud went down. And when I say that IBM Cloud went down, I mean, the entire thing effectively went off the internet. Their status page stopped working, for example. Every resource that people had inside of IBM Cloud was reportedly down. And this was relatively unheard of in the world of global cloud providers. 



Azure and GCP don't have the same isolated network boundary per region that AWS has, but even in those cases, we tend to see far more frequently rolling outages rather than global outages affecting everything simultaneously. It's a bit uncommon. What's strange is that their status page was down. Every point of access you had into looking at what was going on with IBM Cloud was down. Their Twitter accounts fell silent, other than pre-scheduled promotional tweets that were set to go out. It looked for all the world like IBM had just decided to pack up early, turn everything off on the way out of the office, and enjoy the night off. 



That obviously isn't what happened, but it was notable in that there was no communication for the first hour or so of the outage, and this was causing people to go more than a little bonkers. One of the pieces that was interesting to me, while this was happening, since it was impossible to get data out of this for anything substantive or authoritative, was I pulled up their marketing site. Now, the marketing site still worked—apparently, it does not live on top of IBM Cloud—but it listed a lot of their marquee customers and case studies. I went through a quick sampling, and American Airlines was the only site that had a big outage notification on the front of it. Everything else seemed to be working. 



So, either the outage was not as widespread as people thought, or a lot of their marquee customers are only using them for specific components. Either one of those is compelling and interesting, but we don't have a whole lot of data to feed back into the system to draw reasonable conclusions. Their status page itself, like it was mentioned, was down, and that's super bad. One of the early things you learn when running a large-scale system of any kind is the thing that tells you—and the world—that you're down cannot have a dependency on any of the things that you are personally running. The AWS status page had this, somewhat hilariously, during the S3 outage a few years ago, when they had trouble updating what was going on due to that outage. I would imagine that's no longer the case, but one does wonder. 



And most damning, and the reason I bring this up is the following day, they posted the following analysis on their site: “IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services on Tuesday, June 9th. All services have been restored. A detailed root cause analysis is underway. An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic, and impacting IBM Cloud services, and our data centers. Migration steps have been taken to prevent a recurrence. Root ...

...more
View all episodesView all episodes
Download on the App Store

AWS Morning BriefBy Corey Quinn

  • 4.7
  • 4.7
  • 4.7
  • 4.7
  • 4.7

4.7

77 ratings


More shows like AWS Morning Brief

View all
Hanselminutes with Scott Hanselman by Scott Hanselman

Hanselminutes with Scott Hanselman

377 Listeners

Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

Software Engineering Radio - the podcast for professional software developers

266 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

285 Listeners

The Cloudcast by Massive Studios

The Cloudcast

154 Listeners

Thoughtworks Technology Podcast by Thoughtworks

Thoughtworks Technology Podcast

41 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

628 Listeners

Soft Skills Engineering by Jamison Dance and Dave Smith

Soft Skills Engineering

275 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

200 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

140 Listeners

Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

Syntax - Tasty Web Development Treats

987 Listeners

Screaming in the Cloud by Corey Quinn

Screaming in the Cloud

93 Listeners

Kubernetes Podcast from Google by Abdel Sghiouar, Kaslin Fields

Kubernetes Podcast from Google

181 Listeners

Practical AI by Practical AI LLC

Practical AI

190 Listeners

The Stack Overflow Podcast by The Stack Overflow Podcast

The Stack Overflow Podcast

63 Listeners

The Pragmatic Engineer by Gergely Orosz

The Pragmatic Engineer

52 Listeners