
Sign up to save your podcasts
Or
Summary
In this episode of the Overcommitted Podcast, Brittany Ellich and her co-hosts engage with Ross Brodbeck, a software engineer at GitHub, to explore the critical topic of software availability. They discuss the definitions of availability, reliability, and uptime, and delve into frameworks for improving availability in software systems. The conversation covers proactive versus reactive approaches to availability, the business impact of availability, and the hidden costs associated with downtime. Ross shares insights on creating effective availability programs, the role of incident commanders, and emerging technologies that may shape the future of availability in software engineering. The episode concludes with book recommendations for software engineers looking to deepen their understanding of the field.
Takeaways
Availability is subjective and varies by organization.
Observability is crucial for understanding production behavior.
Proactive measures can help prevent availability issues.
On-call burnout is a significant cost to organizations.
Understanding business needs is key to defining availability.
SLOs help in measuring and reporting availability effectively.
Incident commanders play a vital role in managing incidents.
Game days and playbooks are essential for preparedness.
Hidden costs of downtime include loss of customer trust.
Emerging technologies like AI may change availability management.
Links
Ross’s Blog
Google SRE Book
https://sreweekly.com/
https://uptime.is/
Catchpoint SRE Report
Software engineer’s guidebook
Designing data-intensive applications
Thinking in systems
The best software writing one - Joel on Software
Algorithms to live by
The Staff Engineer
Clean Code
Pragmatic Engineer Podcast - Thomas Dhomke interview
Distributed systems by Martin van Steen
Practical object-oriented design in Ruby
Looks Good To Me
Tech book club Repo
Overcommitted Discord
Hosts
Overcommitted.dev
Bethany Janos
Brittany Ellich
Eggyhead
Jonathan Tamsut
Summary
In this episode of the Overcommitted Podcast, Brittany Ellich and her co-hosts engage with Ross Brodbeck, a software engineer at GitHub, to explore the critical topic of software availability. They discuss the definitions of availability, reliability, and uptime, and delve into frameworks for improving availability in software systems. The conversation covers proactive versus reactive approaches to availability, the business impact of availability, and the hidden costs associated with downtime. Ross shares insights on creating effective availability programs, the role of incident commanders, and emerging technologies that may shape the future of availability in software engineering. The episode concludes with book recommendations for software engineers looking to deepen their understanding of the field.
Takeaways
Availability is subjective and varies by organization.
Observability is crucial for understanding production behavior.
Proactive measures can help prevent availability issues.
On-call burnout is a significant cost to organizations.
Understanding business needs is key to defining availability.
SLOs help in measuring and reporting availability effectively.
Incident commanders play a vital role in managing incidents.
Game days and playbooks are essential for preparedness.
Hidden costs of downtime include loss of customer trust.
Emerging technologies like AI may change availability management.
Links
Ross’s Blog
Google SRE Book
https://sreweekly.com/
https://uptime.is/
Catchpoint SRE Report
Software engineer’s guidebook
Designing data-intensive applications
Thinking in systems
The best software writing one - Joel on Software
Algorithms to live by
The Staff Engineer
Clean Code
Pragmatic Engineer Podcast - Thomas Dhomke interview
Distributed systems by Martin van Steen
Practical object-oriented design in Ruby
Looks Good To Me
Tech book club Repo
Overcommitted Discord
Hosts
Overcommitted.dev
Bethany Janos
Brittany Ellich
Eggyhead
Jonathan Tamsut