Join the companion discussion on Thursday at 7 pm Eastern bookclub.dev/thursdays
Eliminating Toil
"If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow."
-Carla Geisser, Google SRE
4 types of work
- Software engineering
- Systems engineering
- Toil
- Overhead
What makes it toil?
- Manual
- Repetitive
- Automatable
- Tactical
- No enduring value
- Scales linearly with service growth
None of these individually are enough to make it toil, but the more boxes it checks the more likely it is toil
Toil isn't always bad, and it is not possible to completely eliminate
Figuring out somethings type of work largely revolves around how much value it creates and on what time scale
Toil tends to grow and expand if left unchecked
Tracking types of work
Google has quarterly surveys to ensure they are meeting or beating their <= 50% toil time
"If we all commit to eliminate a bit of toil each week with some good engineering, we’ll steadily clean up our services, and we can shift our collective efforts to engineering for scale, architecting the next generation of services, and building cross-SRE toolchains. Let’s invent more, and toil less."
Monitoring Distributed Systems
Terms around monitoring
Not consistent, but a basic idea
- Monitoring
- White-box monitoring
- Black box monitoring
- Dashboard
- Alert
- Root cause
- Node and machine
- Push
What can you get from monitoring?
Trends, help debugging, alerts, baselines to compare from, data for the business to analyze, and things to analyze in the event of a security breach
This is a large scale endeavor. Every 10-12 person has at least 1 "monitoring person"
Even with a dedicated person, the monitoring needs to be simple enough for everyone on the team to understand, especially if it's something that triggers a page.
White-box and black-box monitoring
Black-box can tell you when there is an issue, but not when there is going to be an issue
White-box can see inside the system and see when there are imminent problems on the horizon
Not only is white-box predictive, but for certain things like an application thinking a DB is slow it's the only way to distinguish between an issue with the network and an issue with the database
Symptoms vs causes
Monitoring should address 2 questions, what's broken and why
Table 6-1 shows some symptoms and causes
A symptom is "I'm serving HTTP 500s and 404s" the cause is "Database servers are refusing connections"
Paging should be based on symptoms while data around causes should be used for debugging
From the perspective of someone monitoring an application "Database servers are refusing connections" should not generate the page, "I'm serving HTTP 500s and 404s" should.
At the same time if you are monitoring the DB "Database servers are refusing connections" is a symptom for you and not a cause
4 golden signals
- Latency
- Traffic
- Errors
- Saturation
Scale and accuracy
Find the right resolution for your needs
You can't see more detail than what you collect. Monitoring at the finest detail is costly and creates a lot of noise
If your SLA is 99.9 than checking something more than once a minute is probably unnecessary
What you monitor will change
You will find new things to monitor, but also some things will prove to not be worth monitoring anymore.
If you are collecting some data but not using it in any alerts or dashboards is it worth keeping?
Pagers and Pages
- Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
- Every page should be actionable.
- Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
- Pages should be about a novel problem or an event that hasn’t been seen before.
Here are some questions to ask about your alerts to make sure you aren't paging for the wrong things
- Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible?
- Will I ever be able to ignore this alert, knowing it’s benign? When and why will I be able to ignore this alert, and how can I avoid this scenario?
- Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren’t being negatively impacted, such as drained traffic or test deployments, that should be filtered out?
- Can I take action in response to this alert? Is that action urgent, or could it wait until morning? Could the action be safely automated? Will that action be a long-term fix, or just a short-term workaround?
- Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?
Short and long term balance
Responding to a page is toil, and it takes away resources from more valuable work. Finding the root cause and resolving it is often the best thing to do for the long term. If it can't be resolved, fully automating the response is the next best option
The book shares two case studies
BigTable over alerting, the solution was to lower the SLO to create space to solve underlying issues
Gmail had an issue where the team was concerned if they automated away a rote task the real underlying issue would not be addressed. This showed a lack of faith in the team's ability to clean up their technical debt which is a deeper issue that needs to be addressed, probably by escalating.