March 22, 2021

Managing Critical State and Distributed Periodic Scheduling with Cron

7 minutes

This week we look at chapter 23 Managing Critical State: Distributed Consensus for Reliability and chapter 24 Distributed Periodic Scheduling with Cron from Site Reliability Engineering: How Google Runs Production Systems.

Things to monitor with a distributed consensus system

The number of members running in each consensus group, and the status of each process (healthy or not healthy)
Persistently lagging replicas
Whether or not a leader exists
Number of leader changes
Consensus transaction number - in a healthy system, this number will be incremented over time
Number of proposals seen and number of proposals agreed upon
Throughput and latency
Latency distributions for proposal acceptance
Distributions of network latencies observed between parts of the system in different locations
The amount of time acceptors spend on durable logging
Overall bytes accepted per second in the system

Projects mentioned

Apache ZooKeeper™
Consul
etcd

Articles from this week's chapters

Harvest, Yield, and Scalable Tolerant Systems
Brewer’s Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web
Services
CAP Twelve Years Later: How the "Rules" Have Changed
Existential Consistency:
Measuring and Understanding Consistency at Facebook
The trouble with timestamps
F1: A Distributed SQL Database That Scales
Impossibility of Distributed Consensus With One Faulty Process
The Part-Time Parliament
In Search of an Understandable Consensus Algorithm
(Extended Version)
Zab: High-performance broadcast for primary-backup systems
Mencius: Building Efficient Replicated State Machines for WANs
ZooKeeper Recipes and Solutions
ZooKeeper: Wait-free coordination for Internet-scale systems
The Chubby lock service for loosely-coupled distributed systems
Stumbling over consensus research: Misunderstandings and issues
Paxos for System Builders: An Overview
Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial
Spanner: Google’s Globally-Distributed Database
The Google File System
Bigtable: A Distributed Storage System for Structured Data
MapReduce: Simplified Data Processing on Large Clusters
Unreliable Failure Detectors for Reliable Distributed Systems
Paxos Replicated State Machines as the Basis of a High-Performance Data Store
Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
Paxos Quorum Leases: Fast Reads Without Sacrificing Writes
Fast Paxos
Classic Paxos vs. Fast Paxos: Caveat Emptor
Egalitarian Paxos
Tuning Paxos for High-Throughput with Batching and Pipelining
Paxos made live: an engineering perspective
Practical Byzantine Fault Tolerance
WHAT TAKES US DOWN?
Large-scale cluster management at Google with Borg

...more

View all episodes

By Dan Cook

March 22, 2021

Managing Critical State and Distributed Periodic Scheduling with Cron

7 minutes

Things to monitor with a distributed consensus system

The number of members running in each consensus group, and the status of each process (healthy or not healthy)
Persistently lagging replicas
Whether or not a leader exists
Number of leader changes
Consensus transaction number - in a healthy system, this number will be incremented over time
Number of proposals seen and number of proposals agreed upon
Throughput and latency
Latency distributions for proposal acceptance
Distributions of network latencies observed between parts of the system in different locations
The amount of time acceptors spend on durable logging
Overall bytes accepted per second in the system

Projects mentioned

Apache ZooKeeper™
Consul
etcd

Articles from this week's chapters

Harvest, Yield, and Scalable Tolerant Systems
Brewer’s Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web
Services
CAP Twelve Years Later: How the "Rules" Have Changed
Existential Consistency:
Measuring and Understanding Consistency at Facebook
The trouble with timestamps
F1: A Distributed SQL Database That Scales
Impossibility of Distributed Consensus With One Faulty Process
The Part-Time Parliament
In Search of an Understandable Consensus Algorithm
(Extended Version)
Zab: High-performance broadcast for primary-backup systems
Mencius: Building Efficient Replicated State Machines for WANs
ZooKeeper Recipes and Solutions
ZooKeeper: Wait-free coordination for Internet-scale systems
The Chubby lock service for loosely-coupled distributed systems
Stumbling over consensus research: Misunderstandings and issues
Paxos for System Builders: An Overview
Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial
Spanner: Google’s Globally-Distributed Database
The Google File System
Bigtable: A Distributed Storage System for Structured Data
MapReduce: Simplified Data Processing on Large Clusters
Unreliable Failure Detectors for Reliable Distributed Systems
Paxos Replicated State Machines as the Basis of a High-Performance Data Store
Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
Paxos Quorum Leases: Fast Reads Without Sacrificing Writes
Fast Paxos
Classic Paxos vs. Fast Paxos: Caveat Emptor
Egalitarian Paxos
Tuning Paxos for High-Throughput with Batching and Pipelining
Paxos made live: an engineering perspective
Practical Byzantine Fault Tolerance
WHAT TAKES US DOWN?
Large-scale cluster management at Google with Borg

...more

Share Managing Critical State and Distributed Periodic Scheduling with Cron

Sign up to save your podcasts

Managing Critical State and Distributed Periodic Scheduling with Cron

Managing Critical State and Distributed Periodic Scheduling with Cron