The Stack Overflow Podcast

Who owns this outage? Building intelligent, automated escalation chains


Listen Later

Maxwell, a solution architect at xMatters, took a winding road to get to where he is. After a computer engineering education, he held jobs as field support engineer, product manager, SRE, and finally his current role as a solutions architect, where he serves as something of an SRE for SREs, helping them solve incident management problems with the help of xMatters. 

When he moved to the SRE role, Maxwell wanted to get back to doing technical work. It was a lateral move within his company, which was migrating an on-prem solution into the cloud. It’s a journey that plenty of companies are making now: breaking an application into microservices, running processes in containers, and using Kubernetes to orchestrate the whole thing. Non-production environments would go down and waste SRE time, making it harder to address problems in the production pipeline. 

At the heart of their issues was the incident response process. They had several bottlenecks that prevented them from delivering value to their customers quickly. Incidents would send emails to the relevant engineers, sometimes 20 on a single email, which made it easy for any one engineer to ignore the problem—someone else has got this. They had a bad silo problem, where escalating to the right person across groups became an issue of its own. And of course, most of this was manual. Their MTTR—mean time to resolve—was lagging. 

Maxwell moved over to xMatters because they managed to solve these problems through clever automation. Their product automates the scheduling and notification process so that the right person knows about the incident as soon as possible. At the core of this process was a different MTTR—mean time to respond. Once an engineer started working to resolve a problem, it was all down to runbooks and skill. But the lag between the initial incident and that start was the real slowdown. 

It’s not just the response from the first SRE on call. It’s the other escalations down the line—to data engineers, for example—that can eat away time. They’ve worked hard to make  escalation configuration easy. It not only handles who's responsible for specific services and metrics, but who’s in the escalation chain from there. When the incident hits, the notifications go out through a series of configured channels; maybe it tries a chat program first, then email, then SMS. 

The on-call process is often a source of dread, but automating the escalation process can take some of the sting out of it. Check out the episode to learn more. 

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

...more
View all episodesView all episodes
Download on the App Store

The Stack Overflow PodcastBy The Stack Overflow Podcast

  • 4.3
  • 4.3
  • 4.3
  • 4.3
  • 4.3

4.3

62 ratings


More shows like The Stack Overflow Podcast

View all
Software Engineering Radio - the podcast for professional software developers by team@se-radio.net (SE-Radio Team)

Software Engineering Radio - the podcast for professional software developers

273 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

289 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,100 Listeners

Daily Tech News Show by Tom Merritt

Daily Tech News Show

1,392 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

623 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

299 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

348 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

232 Listeners

Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

Syntax - Tasty Web Development Treats

990 Listeners

Tech Brew Ride Home by Morning Brew

Tech Brew Ride Home

971 Listeners

Practical AI by Practical AI LLC

Practical AI

212 Listeners

Latent Space: The AI Engineer Podcast by Latent.Space

Latent Space: The AI Engineer Podcast

98 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

229 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

659 Listeners