
Sign up to save your podcasts
Or


Coffee Sessions #54 with Niall Murphy, Machine Learning SRE.
Join the Community: https://go.mlops.community/YTJoinIn
Get the newsletter: https://go.mlops.community/YTNewsletter
// Abstract
SRE is making its way into the machine learning world. Software engineering for machine learning requires reliability, performance, and maintainability. Site reliability engineering is the field that deals with reliability and ensuring constant, real-time performance. Niall Murphy, most recently Global Head of SRE at Microsoft Azure, helps us understand what SRE can do for modern ML products and teams.
Building machine learning teams requires a diverse set of technical experiences, and Niall shares his thoughts on how to do that most effectively. Machine learning organizations need to start to take advantage of SRE best practices like SLOs, which Niall walks through. Production machine learning depends on high-quality software engineering, and we get Niall's take on how to ensure that in a machine learning context.
// Bio
Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies worldwide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.
--------------- ✌️Connect With Us ✌️ -------------
Join our Slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Niall on LinkedIn: https://www.linkedin.com/in/niallm/
Timestamps:
[00:00] Introduction to Niall Murphy
[00:36] SRE background to Machine Learning space transition
[07:10] SLO's being a challenge in the ML space
[09:42] SRE Hiring Investments
[15:10] Behavior of teams concept
[17:45] Challenges dealing with ML production
[18:27] Update on Reliable Machine Learning book
[22:46] Monitoring
[25:05] Difference between ML and SRE
[29:18] Incident response in Machine Learning
[34:46] Rollbacks
[35:50] Machine Learning burden over time
[42:42] Niall's journey to the SRE space and focus on developing himself
By Demetrios4.6
2323 ratings
Coffee Sessions #54 with Niall Murphy, Machine Learning SRE.
Join the Community: https://go.mlops.community/YTJoinIn
Get the newsletter: https://go.mlops.community/YTNewsletter
// Abstract
SRE is making its way into the machine learning world. Software engineering for machine learning requires reliability, performance, and maintainability. Site reliability engineering is the field that deals with reliability and ensuring constant, real-time performance. Niall Murphy, most recently Global Head of SRE at Microsoft Azure, helps us understand what SRE can do for modern ML products and teams.
Building machine learning teams requires a diverse set of technical experiences, and Niall shares his thoughts on how to do that most effectively. Machine learning organizations need to start to take advantage of SRE best practices like SLOs, which Niall walks through. Production machine learning depends on high-quality software engineering, and we get Niall's take on how to ensure that in a machine learning context.
// Bio
Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies worldwide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.
--------------- ✌️Connect With Us ✌️ -------------
Join our Slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Niall on LinkedIn: https://www.linkedin.com/in/niallm/
Timestamps:
[00:00] Introduction to Niall Murphy
[00:36] SRE background to Machine Learning space transition
[07:10] SLO's being a challenge in the ML space
[09:42] SRE Hiring Investments
[15:10] Behavior of teams concept
[17:45] Challenges dealing with ML production
[18:27] Update on Reliable Machine Learning book
[22:46] Monitoring
[25:05] Difference between ML and SRE
[29:18] Incident response in Machine Learning
[34:46] Rollbacks
[35:50] Machine Learning burden over time
[42:42] Niall's journey to the SRE space and focus on developing himself

1,090 Listeners

623 Listeners

303 Listeners

334 Listeners

146 Listeners

226 Listeners

207 Listeners

95 Listeners

514 Listeners

131 Listeners

227 Listeners

35 Listeners

21 Listeners

40 Listeners

64 Listeners