The paper introduces Dr.LLM (Dynamic Routing of Layers for LLMs), a retrofittable framework designed to improve both the efficiency and accuracy of Large Language Models (LLMs) without altering their base weights.
Typically, LLMs process every token through a fixed stack of transformer layers, which wastes computation on simple queries and lacks the depth needed for complex reasoning. While prior adaptive-depth methods have attempted to address this, they often degrade accuracy, require expensive inference-time searches, or demand large-scale retraining and architectural changes.
Dr.LLM overcomes these limitations by equipping a frozen, pretrained LLM with lightweight, per-layer routers that dynamically decide whether to skip, execute, or repeat a specific transformer block.
Key highlights of the paper include:
- Methodology: The routers are trained using explicit supervision derived from an offline Monte Carlo Tree Search (MCTS). The MCTS discovers optimal execution paths that preserve or improve accuracy under a compute budget, creating a compact dataset of 4,000 examples to train the routers.
- Design: To ensure stable routing decisions on long contexts and manage class imbalances, Dr.LLM utilizes windowed mean-pooling and focal loss with class-rebalancing weights.
- In-Domain Results: On reasoning-heavy tasks like ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4 percentage points while saving an average of 5 layers of computation per example.
- Out-of-Domain Robustness: The trained routers generalize well to out-of-domain tasks (such as MMLU, GSM8k, and TruthfulQA) with only a minimal 0.85 percentage point drop in accuracy while retaining their computational efficiency.
Overall, Dr.LLM successfully demonstrates that explicitly supervised routing can retrofit frozen LLMs to achieve budget-aware, accuracy-driven inference.