Reading Notes presented by BookClub.dev

By Dan Cook

There is no end to learning that you need to do when you work in the software industry, and it can feel overwhelming. But we will take on that challenge one book at a time. Every Monday host Dan Cook ... more

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Reading Notes presented by BookClub.dev:

How many episodes does Reading Notes presented by BookClub.dev have?

The podcast currently has 62 episodes available.

Reading Notes presented by BookClub.dev episodes:

March 15, 2021Handling Overload and Addressing Cascading Failures
This week we look at chapter 21 Handling Overload and chapter 22 Addressing Cascading Failures of Site Reliability Engineering: How Google Runs Production Systems.

Sign up for the free companion discussion at bookclub.dev/thursdays. The discussion is from 7-9 pm eastern on Thursdays.

Overload is the leading cause of cascading failures.

Overload is what happens when a server gets more requests than it can handle. There are basic methods for relieving server overload:

make the requests easier to process

add more resources

reduce the amount of traffic
...more
6min
March 08, 2021Load Balancing at the Frontend and Load Balancing in the Datacenter
This week we look at chapter 19 Load Balancing at the Frontend and chapter 20 Load Balancing in the Datacenter.

Join the companion discussion Thursday at 7 pm Eastern by signing up at bookclub.dev/thursdays.

Resources

The Pragmatic Programmer: Your Journey To Mastery, 20th Anniversary Edition

The Actor Model

The Twelve-Factor App
...more
11min
March 01, 2021Testing for Reliablity and Software Engineering in SRE
Join the free companion discussions Thursday night at 7 pm Eastern.
Sign up at BookClub.Dev/Thursdays.

Share the show BookClub.dev/Reading-Notes.

This week we look at chapter 17 Testing for Reliability and chapter 18 Software Engineering in SRE

Resources

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

The Netflix Simian Army

The Pragmatic Programmer: Your Journey To Mastery, 20th Anniversary Edition
...more
17min
February 22, 2021 Postmortem Culture and Tracking Outages
This week we look at Postmortem Culture: Learning from Failure and Tracking Outages.

Join the free companion discussion for this episode on Thursday at 7 pm Eastern. Sign up at bookclub.dev/thursdays.

Links

Appendix D: Example Postmortem

Dilbert: Must Escalate

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

Westrum organizational culture

Disaster Recovery Testing

Wheel of Misfortune Template

...more
16min
February 15, 2021 Emergency Response and Managing Incidents
Join the free companion discussions Thursday nights at 7 pm Eastern. Sign up at bookclub.dev/thursdays.

This week we look at chapter 13 Emergency Response and chapter 14 Managing Incidents.

Incident Command System Resources

Best Practices for Incident Management

Prioritize.
Stop the bleeding, restore service, and preserve the evidence for root-causing.
Prepare.
Develop and document your incident management procedures in advance, in consultation with incident participants.
Trust.
Give full autonomy within the assigned role to all incident participants.
Introspect.
Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.
Consider alternatives.
Periodically consider your options and re-evaluate whether it still makes sense to continue what you're doing or whether you should be taking another tack in incident response.
Practice.
Use the process routinely so it becomes second nature.
Change it around.
Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.

...more
14min
February 08, 2021Being On-Call and Effective Troubleshooting
This week we look at chapter 11 Being On-Call and chapter 12 Effective Troubleshooting.

You can join the companion discussions Thursday at 7 pm Eastern for free by signing up at bookclub.dev/thursdays.

If you've found these episodes useful, please tell me about it by emailing [email protected].

Resources from the show:

Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services

Figure 12-1

Best Guess Who Strategy
...more
20min
February 01, 2021Simplicity and Practical Alerting
This week we look at chapter 9 Simplicity and chapter 10 Practical Alerting from Time-Series Data from Site Reliability Engineering: How Google Runs Production Systems.

If you've found these episodes useful, please send me an email at [email protected] and tell me about it.

The companion discussions to this podcast happen Thursdays at 7 pm Eastern. You can join by signing up at bookclub.dev/thursdays.

In the episode, I mention some books I've enjoyed.

The Goal

The Phoenix Project

The Unicorn Project

Clean Architecture

Prometheus is mentioned as a tool for monitoring and alerting along with AppInsights.

Next week we'll look at chapter 11 Being On-Call and chapter 12 Effective Troubleshooting.
...more
15min
January 25, 2021The Evolution of Automation at Google and Release Engineering
This week we look at The Evolution of Automation at Google and Release Engineering.

The companion discussion to this week's episode starts at 7 pm Eastern. You can join by going to bookclub.dev/thursdays.

Notes

Automation is meta-software, software that acts on other software

The Value of Automation

Consistency

A Platform

Faster Repairs

Faster Action

Time Savings

The levels of automation

No automation

Externally maintained system-specific automation

Externally maintained generic automation

Internally maintained system-specific automation

Systems that don't need any automation

Reliability Is The Fundamental Feature

XKCD: 1205

Philosophy of Release Engineering

Self-Service Model

High Velocity

Hermetic Builds

Enforcement of Policies and Procedures

Bazel

Next week we'll read chapter 9 Simplicity and chapter 10 Practical Alerting from Time-Series Data
...more
16min
January 18, 2021Eliminating Toil and Monitoring Distributed Systems
Join the companion discussion on Thursday at 7 pm Eastern bookclub.dev/thursdays
Eliminating Toil
"If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow."

-Carla Geisser, Google SRE

4 types of work

Software engineering

Systems engineering

Toil

Overhead

What makes it toil?

Manual

Repetitive

Automatable

Tactical

No enduring value

Scales linearly with service growth

None of these individually are enough to make it toil, but the more boxes it checks the more likely it is toil

Toil isn't always bad, and it is not possible to completely eliminate

Figuring out somethings type of work largely revolves around how much value it creates and on what time scale

Toil tends to grow and expand if left unchecked

Tracking types of work

Google has quarterly surveys to ensure they are meeting or beating their <= 50% toil time

"If we all commit to eliminate a bit of toil each week with some good engineering, we’ll steadily clean up our services, and we can shift our collective efforts to engineering for scale, architecting the next generation of services, and building cross-SRE toolchains. Let’s invent more, and toil less."
Monitoring Distributed Systems
Terms around monitoring

Not consistent, but a basic idea

Monitoring

White-box monitoring

Black box monitoring

Dashboard

Alert

Root cause

Node and machine

Push

What can you get from monitoring?

Trends, help debugging, alerts, baselines to compare from, data for the business to analyze, and things to analyze in the event of a security breach

This is a large scale endeavor. Every 10-12 person has at least 1 "monitoring person"

Even with a dedicated person, the monitoring needs to be simple enough for everyone on the team to understand, especially if it's something that triggers a page.

White-box and black-box monitoring

Black-box can tell you when there is an issue, but not when there is going to be an issue

White-box can see inside the system and see when there are imminent problems on the horizon

Not only is white-box predictive, but for certain things like an application thinking a DB is slow it's the only way to distinguish between an issue with the network and an issue with the database

Symptoms vs causes

Monitoring should address 2 questions, what's broken and why

Table 6-1 shows some symptoms and causes

A symptom is "I'm serving HTTP 500s and 404s" the cause is "Database servers are refusing connections"

Paging should be based on symptoms while data around causes should be used for debugging

From the perspective of someone monitoring an application "Database servers are refusing connections" should not generate the page, "I'm serving HTTP 500s and 404s" should.

At the same time if you are monitoring the DB "Database servers are refusing connections" is a symptom for you and not a cause

4 golden signals

Latency

Traffic

Errors

Saturation

Scale and accuracy

Find the right resolution for your needs

You can't see more detail than what you collect. Monitoring at the finest detail is costly and creates a lot of noise

If your SLA is 99.9 than checking something more than once a minute is probably unnecessary

What you monitor will change

You will find new things to monitor, but also some things will prove to not be worth monitoring anymore.

If you are collecting some data but not using it in any alerts or dashboards is it worth keeping?

Pagers and Pages

Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.

Every page should be actionable.

Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.

Pages should be about a novel problem or an event that hasn’t been seen before.

Here are some questions to ask about your alerts to make sure you aren't paging for the wrong things

Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible?

Will I ever be able to ignore this alert, knowing it’s benign? When and why will I be able to ignore this alert, and how can I avoid this scenario?

Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren’t being negatively impacted, such as drained traffic or test deployments, that should be filtered out?

Can I take action in response to this alert? Is that action urgent, or could it wait until morning? Could the action be safely automated? Will that action be a long-term fix, or just a short-term workaround?

Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?

Short and long term balance

Responding to a page is toil, and it takes away resources from more valuable work. Finding the root cause and resolving it is often the best thing to do for the long term. If it can't be resolved, fully automating the response is the next best option

The book shares two case studies

BigTable over alerting, the solution was to lower the SLO to create space to solve underlying issues

Gmail had an issue where the team was concerned if they automated away a rote task the real underlying issue would not be addressed. This showed a lack of faith in the team's ability to clean up their technical debt which is a deeper issue that needs to be addressed, probably by escalating.
...more
16min
January 11, 2021Embracing Risk, Error Budgets, and SLOs
The Thursday night discussions start at 7 pm Eastern.

Sign up at bookclub.dev/thursdays

This week I share my notes from chapters 3 & 4 of Site Reliability Engineering: How Google Runs Production Systems

Topics covered

Error budgets

Calculating availability

SLAs

SLOs

SLIs
...more
23min

FAQs about Reading Notes presented by BookClub.dev:

How many episodes does Reading Notes presented by BookClub.dev have?

The podcast currently has 62 episodes available.