AI: post transformers

LithOS: Operating System for Efficient GPU Machine Learning


Listen Later

This 2025 CMU paper introduces **LithOS**, a novel operating system designed to improve the efficiency and utilization of Graphics Processing Units (GPUs) for machine learning (ML) workloads in data centers. The authors argue that current GPU management solutions, such as NVIDIA's MPS and MIG, are too coarse-grained, leading to low utilization and high latency in multi-tenant environments. LithOS proposes a transparent, OS-level approach featuring a **TPC Scheduler** for fine-grained resource control, a **Kernel Atomizer** that breaks up monolithic kernels to reduce head-of-line blocking, and mechanisms for **hardware right-sizing** and **transparent power management** (DVFS). Evaluation results demonstrate that LithOS significantly reduces tail latencies (up to 13× compared to MPS) and improves aggregate throughput in both inference-only and hybrid inference/training scenarios while achieving substantial capacity and energy savings. Overall, the work establishes a foundation for developing true operating systems for GPUs to address the growing efficiency crisis in ML infrastructure.


Source:

https://www.cs.cmu.edu/~dskarlat/publications/lithos_sosp25.pdf

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof