
Sign up to save your podcasts
Or


When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.
About Omri:
Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.
Show Highlights:
(02:12) What is Updog and How Does It Work
(03:38) Why Knowing If It's a Global Problem Matters
(04:01) The Problem With Testing Every Endpoint Yourself
(05:52) How Datadog Discovered EC2 Outages From Their Own Systems
(10:38) When AWS Regions Go Down and Cascade Failures
(13:13) What Happens When Services Rebuild Completely
(16:29) The Most Important Learning During a 3 AM Incident
(20:11) Why This Took So Long to Build
(23:40) When Datadog Going Down Isn't Critical Path
(25:22) How They Picked Which AWS Services to Monitor
(27:07) What Comes Next for Updog
(30:11) Where to Find Omri and Updog
Links:
Datadog: datadoghq.com
Omir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/
Sponsored by:
duckbillhq.com
By Corey Quinn4.7
9292 ratings
When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.
About Omri:
Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.
Show Highlights:
(02:12) What is Updog and How Does It Work
(03:38) Why Knowing If It's a Global Problem Matters
(04:01) The Problem With Testing Every Endpoint Yourself
(05:52) How Datadog Discovered EC2 Outages From Their Own Systems
(10:38) When AWS Regions Go Down and Cascade Failures
(13:13) What Happens When Services Rebuild Completely
(16:29) The Most Important Learning During a 3 AM Incident
(20:11) Why This Took So Long to Build
(23:40) When Datadog Going Down Isn't Critical Path
(25:22) How They Picked Which AWS Services to Monitor
(27:07) What Comes Next for Updog
(30:11) Where to Find Omri and Updog
Links:
Datadog: datadoghq.com
Omir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/
Sponsored by:
duckbillhq.com

273 Listeners

381 Listeners

288 Listeners

1,096 Listeners

628 Listeners

152 Listeners

45 Listeners

234 Listeners

990 Listeners

209 Listeners

79 Listeners

62 Listeners

561 Listeners

68 Listeners

675 Listeners