Code[ish]

Chaos Engineering


Listen Later

Rick Newman interviews Mikolaj Pawlikowski, who recently wrote a book called "Chaos Engineering: Crash test your applications." The theory behind chaos engineering is to "break things on purpose" in your operational flow. You want to deliberately inject failures that might occur in production ahead of time, in order to anticipate them, and thus implement workarounds and corrections. Typically, this practice is often used for large, distributed systems, because of the many points of failure, but it can be useful in any architecture.

One of the obstacles to embracing chaos engineering is finding high level approval from other teammates, or even managers. Even after the feature is a complete and the unit tests are passing, it can be difficult to convince someone that some resiliency work needs to continue, because there's no visible or tangible benefit to preparing for a disaster. Mikolaj suggests that people clearly lay out to other colleagues the ways a system can fail, and the impact it can have on the application or business. Rather than try to fear monger, it can be useful to point to other companies' availability issues as words of caution for their teams to embrace. Mikolaj also says that chaos engineering doesn't need to focus solely on complicated problems like race conditions across distributed systems. Often, there's enough low hanging fruit, such as disk space running out or an API timing out, that can be useful to consider fixing.

The chaos engineering mindset can also extend beyond pure software. If you think about people working across different timezones as a distributed system, you can also optimize for failures in communication before they occur. Everyone works at a different pace, and communication issues can be analogous to a network loss. Rather than fix miscommunications after they occur, establishing shared practices (like writing down every meeting, or setting up playbooks) can go a long way to ensuring that everyone will be able to do their best under changing circumstances.

Links from this episode
  • Mikolaj's book is called Chaos Engineering: Crash test your applications -- get a 40% discount using the code podish19
  • powerfulseal is a testing tool for Kubernetes clusters
  • Mikolaj distributes the Chaos Engineering Newsletter
  • Conf42 is a conference focusing on high-level computer science
  • ChaosConf is the world's largest Chaos Engineering event
  • Awesome Chaos Engineering is a curated list of Chaos Engineering resources
...more
View all episodesView all episodes
Download on the App Store

Code[ish]By Heroku from Salesforce

  • 4.7
  • 4.7
  • 4.7
  • 4.7
  • 4.7

4.7

18 ratings


More shows like Code[ish]

View all
Motley Fool Money by The Motley Fool

Motley Fool Money

3,211 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

289 Listeners

Up First from NPR by NPR

Up First from NPR

56,500 Listeners

CoRecursive: Coding Stories by Adam Gordon Bell - Software Developer

CoRecursive: Coding Stories

189 Listeners

Elis James and John Robins by BBC Radio 5 Live

Elis James and John Robins

331 Listeners

Tech Lead Journal by Henry Suryawirawan

Tech Lead Journal

13 Listeners