
Sign up to save your podcasts
Or


How do you introduce Site Reliability Engineering to an AI research lab, bringing concepts of scale to engineers who are at the leading edge of AI systems?
In the latest episode of The Prodcast, hosts Steve McGhee and Florian Rathgeber chat with Damion Yates, who helped establish the reliability engineering culture at Google DeepMind. Damion shares his journey of bringing scalable infrastructure to DeepMind, supporting massive machine learning experiments.
Discover the unique challenges of supporting AI research, such as managing highly expensive "lockstep" training models where a single machine failure halts the entire process. Damion also explains why he believes "luck is our enemy" in systems engineering, and why protecting a research scientist's time is the ultimate metric for success.
By Salim Virji5
1818 ratings
How do you introduce Site Reliability Engineering to an AI research lab, bringing concepts of scale to engineers who are at the leading edge of AI systems?
In the latest episode of The Prodcast, hosts Steve McGhee and Florian Rathgeber chat with Damion Yates, who helped establish the reliability engineering culture at Google DeepMind. Damion shares his journey of bringing scalable infrastructure to DeepMind, supporting massive machine learning experiments.
Discover the unique challenges of supporting AI research, such as managing highly expensive "lockstep" training models where a single machine failure halts the entire process. Damion also explains why he believes "luck is our enemy" in systems engineering, and why protecting a research scientist's time is the ultimate metric for success.

32,110 Listeners

30,736 Listeners

43,594 Listeners

288 Listeners

151 Listeners

756 Listeners

682 Listeners

213 Listeners

9,534 Listeners

180 Listeners

1,076 Listeners

561 Listeners

5,544 Listeners

184 Listeners

16 Listeners