Cloud Nets

S3 E13 Failure Recovery


Listen Later

Failure recovery is a very big issue when it comes to AI clusters because there are always failures and when the failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint. You lose a lot of time and money and resources that are spent idle and and wasted time. And the networking part is crucial in order to create a fail.


...more
View all episodesView all episodes
Download on the App Store

Cloud NetsBy DriveNets