
Sign up to save your podcasts
Or
Failure recovery is a very big issue when it comes to AI clusters because there are always failures and when the failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint. You lose a lot of time and money and resources that are spent idle and and wasted time. And the networking part is crucial in order to create a fail.
Failure recovery is a very big issue when it comes to AI clusters because there are always failures and when the failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint. You lose a lot of time and money and resources that are spent idle and and wasted time. And the networking part is crucial in order to create a fail.