This week we read the forward, preface, and the first two chapters of Site Reliability Engineering. We discuss the origins and basic tenants of SRE, look at how Google manages risk, and think about how we can incorporate SRE into our work. You can join our free discussions Thursdays at 7 pm Eastern by signing up at https://www.bookclub.dev/thursdays.
Resources
- The Wheel of Time Series (Amazon)
- Awareness: The Perils and Opportunities of Reality (Amazon)
- SRE Book companion site
- Principles of Network and System Administration (Amazon)
- Practical Reliability Engineering (Amazon)
- Facts and Fallacies of Software Engineering (Amazon)
- The Factors That Impact Availability, Visualized
- The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition
- A Study of Non-Blocking Switching Networks
- Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network
- B4: Experience with a Globally-Deployed Software Defined WAN
- BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing
- Large-scale cluster management at Google with Borg
- MapReduce: Simplified Data Processing on Large Clusters
- The Google File System
- Bigtable: A Distributed Storage System for Structured Data
- Spanner: Google’s Globally-Distributed Database
- The Chubby Lock Service for Loosely-Coupled Distributed Systems
- Searching for Build Debt: Experiences Managing Technical Debt at Google
- The Motivation for a Monolithic Codebase: Why Google stores billions of lines of code in a single repository
- Borg, Omega, and Kubernetes