Reading Notes presented by BookClub.dev

Managing Critical State and Distributed Periodic Scheduling with Cron


Listen Later

Sign up for the Thursday night companion discussions at bookclub.dev/thursdays. Discussions go from 7-9 pm Eastern.

This week we look at chapter 23 Managing Critical State: Distributed Consensus for Reliability and chapter 24 Distributed Periodic Scheduling with Cron from Site Reliability Engineering: How Google Runs Production Systems.

Things to monitor with a distributed consensus system

  • The number of members running in each consensus group, and the status of each process (healthy or not healthy)
  • Persistently lagging replicas
  • Whether or not a leader exists
  • Number of leader changes
  • Consensus transaction number - in a healthy system, this number will be incremented over time
  • Number of proposals seen and number of proposals agreed upon
  • Throughput and latency
  • Latency distributions for proposal acceptance
  • Distributions of network latencies observed between parts of the system in different locations
  • The amount of time acceptors spend on durable logging
  • Overall bytes accepted per second in the system

Projects mentioned

  • Apache ZooKeeper™
  • Consul
  • etcd

 

Articles from this week's chapters

  • Harvest, Yield, and Scalable Tolerant Systems
  • Brewer’s Conjecture and the Feasibility of
    Consistent, Available, Partition-Tolerant Web
    Services
  • CAP Twelve Years Later: How the "Rules" Have Changed
  • Existential Consistency:
    Measuring and Understanding Consistency at Facebook
  • The trouble with timestamps
  • F1: A Distributed SQL Database That Scales
  • Impossibility of Distributed Consensus With One Faulty Process
  • The Part-Time Parliament
  • In Search of an Understandable Consensus Algorithm
    (Extended Version)
  • Zab: High-performance broadcast for primary-backup systems
  • Mencius: Building Efficient Replicated State Machines for WANs
  • ZooKeeper Recipes and Solutions
  • ZooKeeper: Wait-free coordination for Internet-scale systems
  • The Chubby lock service for loosely-coupled distributed systems
  • Stumbling over consensus research: Misunderstandings and issues
  • Paxos for System Builders: An Overview
  • Implementing Fault-Tolerant Services Using the State Machine
    Approach: A Tutorial
  • Spanner: Google’s Globally-Distributed Database
  • The Google File System
  • Bigtable: A Distributed Storage System for Structured Data
  • MapReduce: Simplified Data Processing on Large Clusters
  • Unreliable Failure Detectors for Reliable Distributed Systems
  • Paxos Replicated State Machines as the Basis of a High-Performance Data Store
  • Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
  • High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
  • Paxos Quorum Leases: Fast Reads Without Sacrificing Writes
  • Fast Paxos
  • Classic Paxos vs. Fast Paxos: Caveat Emptor
  • Egalitarian Paxos
  • Tuning Paxos for High-Throughput with Batching and Pipelining
  • Paxos made live: an engineering perspective
  • Practical Byzantine Fault Tolerance
  • WHAT TAKES US DOWN?
  • Large-scale cluster management at Google with Borg
...more
View all episodesView all episodes
Download on the App Store

Reading Notes presented by BookClub.devBy Dan Cook