July 03, 2020

Whiteboard Confessional: The Day IBM Cloud Dissipated

13 minutes

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

CHAOSSEARCH
@QuinnyPig

Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.

This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.

Welcome to the AWS Morning Brief’s Whiteboard Confessional series. I am Cloud Economist Corey Quinn, and today's topic is going to be slightly challenging to talk about. One of the core tenants that we've always had around technology companies and working with SRE, or operations-type organizations is, full stop, you do not make fun of other people's downtime because today it's their downtime, and tomorrow it's yours. It's important. That's why we see the hashtag #HugOps on Twitter start to—well, not trend. It's not that well known but definitely happens fairly frequently when there's a well-publicized multi-hour outage that affects a company that people are familiar with.

So, what we're going to talk about is an outage that happened several weeks ago for IBM Cloud. I want to point out some failings on IBM’s part but this is in the quote-unquote, “Sober light of day.” They are not currently experiencing an outage. They've had ample time to make public statements about the cause of the outage. And I've had time to reflect a little bit on what message I want to carry forward, given that there are definitely lessons for the rest of us to learn. HugOps is important, but it only goes so far, and at some point, it's important to talk about the failings of large companies and their associated response to crises so the rest of us can learn.

Now, I'm about to dunk on them fairly hard, but I stand by the position that I'm taking, and I hope that it's interpreted in the constructive spirit that I intend it to. For background, IBM Cloud is IBM's purported hyperscale cloud offering. It was effectively stitched together from a variety of different acquisitions, most notable among them SoftLayer. I've had multiple consulting clients who are customers of IBM Cloud over the past few years, and their experience has been, to put it politely, a mixed bag. In practice, the invective that they would lobby against it would be something worse.

Now, a month ago, something strange happened to IBM Cloud. Specifically, it went down. I don't mean that a service started having problems in a region. That tends to happen to every cloud provider, and it's important that we don't wind up beating them up unnecessarily for these things. No, IBM Cloud went down. And when I say that IBM Cloud went down, I mean, the entire thing effectively went off the internet. Their status page stopped working, for example. Every resource that people had inside of IBM Cloud was reportedly down. And this was relatively unheard of in the world of global cloud providers.

Azure and GCP don't have the same isolated network boundary per region that AWS has, but even in those cases, we tend to see far more frequently rolling outages rather than global outages affecting everything simultaneously. It's a bit uncommon. What's strange is that their status page was down. Every point of access you had into looking at what was going on with IBM Cloud was down. Their Twitter accounts fell silent, other than pre-scheduled promotional tweets that were set to go out. It looked for all the world like IBM had just decided to pack up early, turn everything off on the way out of the office, and enjoy the night off.

That obviously isn't what happened, but it was notable in that there was no communication for the first hour or so of the outage, and this was causing people to go more than a little bonkers. One of the pieces that was interesting to me, while this was happening, since it was impossible to get data out of this for anything substantive or authoritative, was I pulled up their marketing site. Now, the marketing site still worked—apparently, it does not live on top of IBM Cloud—but it listed a lot of their marquee customers and case studies. I went through a quick sampling, and American Airlines was the only site that had a big outage notification on the front of it. Everything else seemed to be working.

So, either the outage was not as widespread as people thought, or a lot of their marquee customers are only using them for specific components. Either one of those is compelling and interesting, but we don't have a whole lot of data to feed back into the system to draw reasonable conclusions. Their status page itself, like it was mentioned, was down, and that's super bad. One of the early things you learn when running a large-scale system of any kind is the thing that tells you—and the world—that you're down cannot have a dependency on any of the things that you are personally running. The AWS status page had this, somewhat hilariously, during the S3 outage a few years ago, when they had trouble updating what was going on due to that outage. I would imagine that's no longer the case, but one does wonder.

And most damning, and the reason I bring this up is the following day, they posted the following analysis on their site: “IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services on Tuesday, June 9th. All services have been restored. A detailed root cause analysis is underway. An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic, and impacting IBM Cloud services, and our data centers. Migration steps have been taken to prevent a recurrence. Root ...

...more