Share HALoS: Hierarchical Asynchronous LLM Training over Slow Networks

Copy link

November 04, 2025

HALoS: Hierarchical Asynchronous LLM Training over Slow Networks

15 minutes

The June 5, 2025 research paper introducing **HALoS: Hierarchical Asynchronous Local SGD**, a novel optimization framework designed for training large language models (LLMs) across **geographically distributed** accelerators and slow, high-latency networks. The core challenge addressed is the inefficiency of standard synchronous training methods due to **slow inter-region communication** and **heterogeneous hardware speeds**. HALoS mitigates these issues through a **two-tier architecture** featuring local parameter servers (LPSs) and a global parameter server (GPS), which leverages fast intra-region links and asynchronous updates to **reduce communication overhead** and minimize straggler effects. The authors provide a **rigorous convergence analysis** for their non-convex objective and demonstrate empirically that HALoS achieves significantly **faster convergence** (up to 7.5x faster than synchronous baselines) while maintaining or exceeding **model quality**.

Sources:

https://arxiv.org/pdf/2506.04531

...more

View all episodes

By mcgrof