
Sign up to save your podcasts
Or


The paper introduces Dr.LLM (Dynamic Routing of Layers for LLMs), a retrofittable framework designed to improve both the efficiency and accuracy of Large Language Models (LLMs) without altering their base weights.
Typically, LLMs process every token through a fixed stack of transformer layers, which wastes computation on simple queries and lacks the depth needed for complex reasoning. While prior adaptive-depth methods have attempted to address this, they often degrade accuracy, require expensive inference-time searches, or demand large-scale retraining and architectural changes.
Dr.LLM overcomes these limitations by equipping a frozen, pretrained LLM with lightweight, per-layer routers that dynamically decide whether to skip, execute, or repeat a specific transformer block.
Key highlights of the paper include:
Overall, Dr.LLM successfully demonstrates that explicitly supervised routing can retrofit frozen LLMs to achieve budget-aware, accuracy-driven inference.
By Yun WuThe paper introduces Dr.LLM (Dynamic Routing of Layers for LLMs), a retrofittable framework designed to improve both the efficiency and accuracy of Large Language Models (LLMs) without altering their base weights.
Typically, LLMs process every token through a fixed stack of transformer layers, which wastes computation on simple queries and lacks the depth needed for complex reasoning. While prior adaptive-depth methods have attempted to address this, they often degrade accuracy, require expensive inference-time searches, or demand large-scale retraining and architectural changes.
Dr.LLM overcomes these limitations by equipping a frozen, pretrained LLM with lightweight, per-layer routers that dynamically decide whether to skip, execute, or repeat a specific transformer block.
Key highlights of the paper include:
Overall, Dr.LLM successfully demonstrates that explicitly supervised routing can retrofit frozen LLMs to achieve budget-aware, accuracy-driven inference.