Reliability Enablers

#29 - Reacting to Google's SRE book 2016 (Chapter 1 Part 2)


Listen Later

Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al.


We covered passages like:



  1. Monitoring is one of the primary means by which service owners keep track of a system's health and availability.

  2. Efficient use of resources is important anytime a service cares about money.

  3. Humans add latency, even if a given system experiences more actual failures. A system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands on intervention.

  4. SRE has found that roughly, 70 percent of outages are due to changes in a live system. Best practices in this domain use automation to accomplish implementing progressive rollouts.

  5. Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand, the required availability.





This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
...more
View all episodesView all episodes
Download on the App Store

Reliability EnablersBy Ash Patel & Sebastian Vietz

  • 5
  • 5
  • 5
  • 5
  • 5

5

2 ratings


More shows like Reliability Enablers

View all
a16z Podcast by Andreessen Horowitz

a16z Podcast

1,032 Listeners

Google SRE Prodcast by Salim Virji

Google SRE Prodcast

17 Listeners