October 16, 2025

Root Cause Analysis vs. Resilience Engineering w/special guest Lorin Hochstein

Listen Later

59 minutes

A history of the 5 whys and root cause analysis from papers

Some critiques of the 5 whys:

From John Allspaw: https://www.oreilly.com/radar/the-infinite-hows/

From Alan J Card: https://qualitysafety.bmj.com/content/26/8/671

James Reason and the Swiss Cheese Model:

https://pmc.ncbi.nlm.nih.gov/articles/PMC8514562/

James Reason’s book Human Error: https://bookshop.org/p/books/human-error/9e06d8a100a07537?ean=9780521314190&next=t

And a classic from Sidney Dekker (et al.) on the implication of complexity within safety investigations:

https://www.sciencedirect.com/science/article/abs/pii/S0925753511000105?via%3Dihub

We always recommend the Howie Guide: https://howie-guide.pagerduty.com/

STAMP is starting to get popular: https://functionalsafetyengineer.com/introduction-to-stamp/

Google’s STAMP paper: https://www.usenix.org/publications/loginonline/evolution-sre-google

Google’s STAMP discussion on ProdCast: https://sre.google/prodcast/#season4-episode7

And presentation at SRECon: https://www.usenix.org/conference/srecon25americas/presentation/klein

Nancy Leveson’s google scholar is always worth browsing: https://scholar.google.com/citations?user=78y4sEcAAAAJ&hl=en

Allspaw’s LinkedIn post that we quoted: https://www.linkedin.com/posts/jallspaw_important-reminders-about-learning-effectively-activity-7378775591447183360-c_eD

Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/

Want to talk more about this subject? We’re doing a live event co-sponsored by RISF and you can sign up for it here: https://resilienceinsoftware.org/networks/events/146485

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

This is Fine! A podcast about resilience engineering and software

By Colette Alexander and Clint Byrum

5

44 ratings

October 16, 2025

Root Cause Analysis vs. Resilience Engineering w/special guest Lorin Hochstein

Listen Later

59 minutes

A history of the 5 whys and root cause analysis from papers

Some critiques of the 5 whys:

From John Allspaw: https://www.oreilly.com/radar/the-infinite-hows/

From Alan J Card: https://qualitysafety.bmj.com/content/26/8/671

James Reason and the Swiss Cheese Model:

https://pmc.ncbi.nlm.nih.gov/articles/PMC8514562/

James Reason’s book Human Error: https://bookshop.org/p/books/human-error/9e06d8a100a07537?ean=9780521314190&next=t

And a classic from Sidney Dekker (et al.) on the implication of complexity within safety investigations:

https://www.sciencedirect.com/science/article/abs/pii/S0925753511000105?via%3Dihub

We always recommend the Howie Guide: https://howie-guide.pagerduty.com/

STAMP is starting to get popular: https://functionalsafetyengineer.com/introduction-to-stamp/

Google’s STAMP paper: https://www.usenix.org/publications/loginonline/evolution-sre-google

Google’s STAMP discussion on ProdCast: https://sre.google/prodcast/#season4-episode7

And presentation at SRECon: https://www.usenix.org/conference/srecon25americas/presentation/klein

Nancy Leveson’s google scholar is always worth browsing: https://scholar.google.com/citations?user=78y4sEcAAAAJ&hl=en

Allspaw’s LinkedIn post that we quoted: https://www.linkedin.com/posts/jallspaw_important-reminders-about-learning-effectively-activity-7378775591447183360-c_eD

Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/

Want to talk more about this subject? We’re doing a live event co-sponsored by RISF and you can sign up for it here: https://resilienceinsoftware.org/networks/events/146485

...more

More shows like This is Fine! A podcast about resilience engineering and software

Google SRE Prodcast by Salim Virji

Google SRE Prodcast

18 Listeners