
Sign up to save your podcasts
Or


Have you ever wondered how Meta makes config rollouts safe at scale? In this episode, Pascal sits down with Ishwari and Joe to discuss Meta's approach for propagating changes across services in seconds and discuss why speed increases the need for strong safeguards. Catch the episode to discover canarying and progressive rollouts, the health checks and monitoring signals used to catch regressions early, and how incident reviews focus on improving systems rather than blaming people. We also hear how data and early AI/ML are slashing alert noise and speeding up bisecting when something goes wrong.
Got feedback? Send it to us on Threads (https://threads.net/@metatechpod), Instagram (https://instagram.com/metatechpod) and don't forget to follow our host Pascal (https://mastodon.social/@passy, https://threads.net/@passy_). Fancy working with us? Check out https://www.metacareers.com/.
Links
FFmpeg at Meta: Media Processing at Scale - https://engineering.fb.com/2026/03/02/video-engineering/ffmpeg-at-meta-media-processing-at-scale/
Reliably Changing Configuration @ Scale - https://atscaleconference.com/reliably-changing-configuration-scale/
Timestamps
Intro 0:06
Introduction and Overview of Configuration Changes 2:31
Understanding Configurations in Distributed Systems 4:02
Meta's Configuration Management Systems 6:43
Safeguards and Incident Prevention 9:22
Deployment Mechanisms: Canary and Progressive Rollouts 12:06
Challenges in Configuration Consumption 14:39
Health Checks and Incident Response 17:13
Mitigation Strategies for Configuration Issues 19:18
Balancing Developer Velocity and Configuration Safety 21:09
Data-Driven Improvements in Incident Management 22:12
Leveraging AI for Change Detection 26:05
Challenges in Deployment and Testing 28:21
Reinventing Change Safety Strategies 30:24
War Stories: Learning from Past Incidents 32:59
Outro 36:10
By Meta4.5
4343 ratings
Have you ever wondered how Meta makes config rollouts safe at scale? In this episode, Pascal sits down with Ishwari and Joe to discuss Meta's approach for propagating changes across services in seconds and discuss why speed increases the need for strong safeguards. Catch the episode to discover canarying and progressive rollouts, the health checks and monitoring signals used to catch regressions early, and how incident reviews focus on improving systems rather than blaming people. We also hear how data and early AI/ML are slashing alert noise and speeding up bisecting when something goes wrong.
Got feedback? Send it to us on Threads (https://threads.net/@metatechpod), Instagram (https://instagram.com/metatechpod) and don't forget to follow our host Pascal (https://mastodon.social/@passy, https://threads.net/@passy_). Fancy working with us? Check out https://www.metacareers.com/.
Links
FFmpeg at Meta: Media Processing at Scale - https://engineering.fb.com/2026/03/02/video-engineering/ffmpeg-at-meta-media-processing-at-scale/
Reliably Changing Configuration @ Scale - https://atscaleconference.com/reliably-changing-configuration-scale/
Timestamps
Intro 0:06
Introduction and Overview of Configuration Changes 2:31
Understanding Configurations in Distributed Systems 4:02
Meta's Configuration Management Systems 6:43
Safeguards and Incident Prevention 9:22
Deployment Mechanisms: Canary and Progressive Rollouts 12:06
Challenges in Configuration Consumption 14:39
Health Checks and Incident Response 17:13
Mitigation Strategies for Configuration Issues 19:18
Balancing Developer Velocity and Configuration Safety 21:09
Data-Driven Improvements in Incident Management 22:12
Leveraging AI for Change Detection 26:05
Challenges in Deployment and Testing 28:21
Reinventing Change Safety Strategies 30:24
War Stories: Learning from Past Incidents 32:59
Outro 36:10

32,246 Listeners

1,993 Listeners

288 Listeners

3,141 Listeners

626 Listeners

154 Listeners

343 Listeners

212 Listeners

204 Listeners

2,660 Listeners

63 Listeners

161 Listeners

10,254 Listeners

5,576 Listeners

512 Listeners