Code Impact

Reliability Engineering: History, Practice, and Future


Listen Later

This podcast explores the field of reliability engineering, tracing its origins at Google with the development of Site Reliability Engineering (SRE). It differentiates reliability engineering from SRE, highlighting its broader applicability across various organisational structures. The podcast outlines four key promises of a successful reliability team: defining service levels (SLA/SLO/SLI), managing the service infrastructure, participating in technical design, and providing tactical support during incidents. Finally, it discusses the evolving landscape of reliability engineering, emphasising pragmatic approaches to balancing cost and reliability needs, and advocating for a more nuanced understanding of when to build versus buy solutions.

...more
View all episodesView all episodes
Download on the App Store

Code ImpactBy Sanket Makhija