How did it make sense?

Ep. 13 - Chad Todd. Can't you just turn it off and then on again: IT crash!


Listen Later

Learn how engineers and leaders turn breakdowns into breakthroughs to foster learning, resilience, and innovation.

Today I am joined by Chad Todd a seasoned SRE Manager at CrowdStrike with over 20 years of experience in the tech industry. Chad shares with us a a detailed account of a recent system incident, breaking it into the first story—what happened at the surface—and the second story, which uncovers the deeper, systemic factors and decision-making processes involved.

The conversation highlights the challenges of maintaining complex IT systems, the value of fostering a culture of learning from incidents, and the role of teamwork in troubleshooting under pressure. They discuss the importance of database maintenance, how latent conditions contribute to failures, and the art of adaptive problem-solving.

Recommended Resources:

  • A Tale of Two Stories: Contrasting Views of Patient Safety by Richard Cook and David Woods.
  • How Complex Systems Fail by Richard Cook.
  • Richard Cook’s presentation at the Velocity Conference.
  • Rasmussen’s framework on safety and resilience.
  • You can connect with Chad Todd on LinkedIn or on BlueSky and Twitter (X) @CTodKicker1

    Human in the System 

    Transforming teams. Unlocking human potential.

    Using principles from Human Factors (HF), High-Reliability Organisations (HRO), and Human and Organisational Performance (HOP), we develop and deliver highly immersive and impactful programmes using the High-Velocity Learning LAB (HVLL) concept. We give you the know-how, the tools and the support to make results stick and empower your people to achieve the extraordinary. We help you answer the question "How do we uncover those hidden stories in our organisation?"

    Contact us here

     

    ...more
    View all episodesView all episodes
    Download on the App Store

    How did it make sense?By Gareth Lock