Sign up for the Thursday night companion discussions at bookclub.dev/thursdays. Discussions go from 7-9 pm Eastern.
This week we look at chapter 23 Managing Critical State: Distributed Consensus for Reliability and chapter 24 Distributed Periodic Scheduling with Cron from Site Reliability Engineering: How Google Runs Production Systems.
Things to monitor with a distributed consensus system
- The number of members running in each consensus group, and the status of each process (healthy or not healthy)
- Persistently lagging replicas
- Whether or not a leader exists
- Number of leader changes
- Consensus transaction number - in a healthy system, this number will be incremented over time
- Number of proposals seen and number of proposals agreed upon
- Throughput and latency
- Latency distributions for proposal acceptance
- Distributions of network latencies observed between parts of the system in different locations
- The amount of time acceptors spend on durable logging
- Overall bytes accepted per second in the system
Projects mentioned
- Apache ZooKeeper™
- Consul
- etcd
Articles from this week's chapters
- Harvest, Yield, and Scalable Tolerant Systems
- Brewer’s Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web
Services - CAP Twelve Years Later: How the "Rules" Have Changed
- Existential Consistency:
Measuring and Understanding Consistency at Facebook - The trouble with timestamps
- F1: A Distributed SQL Database That Scales
- Impossibility of Distributed Consensus With One Faulty Process
- The Part-Time Parliament
- In Search of an Understandable Consensus Algorithm
(Extended Version) - Zab: High-performance broadcast for primary-backup systems
- Mencius: Building Efficient Replicated State Machines for WANs
- ZooKeeper Recipes and Solutions
- ZooKeeper: Wait-free coordination for Internet-scale systems
- The Chubby lock service for loosely-coupled distributed systems
- Stumbling over consensus research: Misunderstandings and issues
- Paxos for System Builders: An Overview
- Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial - Spanner: Google’s Globally-Distributed Database
- The Google File System
- Bigtable: A Distributed Storage System for Structured Data
- MapReduce: Simplified Data Processing on Large Clusters
- Unreliable Failure Detectors for Reliable Distributed Systems
- Paxos Replicated State Machines as the Basis of a High-Performance Data Store
- Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
- High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
- Paxos Quorum Leases: Fast Reads Without Sacrificing Writes
- Fast Paxos
- Classic Paxos vs. Fast Paxos: Caveat Emptor
- Egalitarian Paxos
- Tuning Paxos for High-Throughput with Batching and Pipelining
- Paxos made live: an engineering perspective
- Practical Byzantine Fault Tolerance
- WHAT TAKES US DOWN?
- Large-scale cluster management at Google with Borg