The VOID

Episode 1: Honeycomb and the Kafka Migration


Listen Later

"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be."

In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May. 

We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:

  • Complex socio-technical systems and the kinds of failures that can happen in them (they're always surprises)
  • Transparency and the benefits of companies sharing these outage reports
  • Safety margins, performance envelopes, and the role of expertise in developing a sense for them
  • Honeycomb's incident response philosophy and process
  • The cognitive costs of responding to incidents
  • What we can (and can't) learn from incident reports

Resources mentioned in the episode:

  • Kafka Migration and Lessons Learned by  Honeycomb
  • Managing the Hidden Costs of Coordination by Laura McGuire
  • Above the Line, Below the Line by Richard Cook
  • "Those found responsible have been sacked": Some observations on the usefulness of error by Richard Cook and Christopher P. Nemeth


Published in partnership with Indeed.

...more
View all episodesView all episodes
Download on the App Store

The VOIDBy Courtney Nash