
Sign up to save your podcasts
Or


When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.
About Omri:
Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.
Show Highlights:
(02:12) What is Updog and How Does It Work
(03:38) Why Knowing If It's a Global Problem Matters
(04:01) The Problem With Testing Every Endpoint Yourself
(05:52) How Datadog Discovered EC2 Outages From Their Own Systems
(10:38) When AWS Regions Go Down and Cascade Failures
(13:13) What Happens When Services Rebuild Completely
(16:29) The Most Important Learning During a 3 AM Incident
(20:11) Why This Took So Long to Build
(23:40) When Datadog Going Down Isn't Critical Path
(25:22) How They Picked Which AWS Services to Monitor
(27:07) What Comes Next for Updog
(30:11) Where to Find Omri and Updog
Links:
Datadog: datadoghq.com
Omir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/
Sponsored by:
duckbillhq.com
By Corey Quinn4.7
9292 ratings
When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.
About Omri:
Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.
Show Highlights:
(02:12) What is Updog and How Does It Work
(03:38) Why Knowing If It's a Global Problem Matters
(04:01) The Problem With Testing Every Endpoint Yourself
(05:52) How Datadog Discovered EC2 Outages From Their Own Systems
(10:38) When AWS Regions Go Down and Cascade Failures
(13:13) What Happens When Services Rebuild Completely
(16:29) The Most Important Learning During a 3 AM Incident
(20:11) Why This Took So Long to Build
(23:40) When Datadog Going Down Isn't Critical Path
(25:22) How They Picked Which AWS Services to Monitor
(27:07) What Comes Next for Updog
(30:11) Where to Find Omri and Updog
Links:
Datadog: datadoghq.com
Omir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/
Sponsored by:
duckbillhq.com

272 Listeners

382 Listeners

289 Listeners

1,098 Listeners

623 Listeners

151 Listeners

45 Listeners

227 Listeners

988 Listeners

207 Listeners

79 Listeners

64 Listeners

530 Listeners

67 Listeners

651 Listeners