Share RelayLLM: Efficient Reasoning via Collaborative Decoding

Copy link

January 10, 2026

RelayLLM: Efficient Reasoning via Collaborative Decoding

13 minutes

This paper discusses **RelayLLM**, a framework designed to improve the efficiency of complex reasoning by enabling **token-level collaboration** between small and large language models. Unlike traditional routers that offload entire queries, the **Small Language Model (SLM)** serves as an active controller that generates a special command to "relay" specific, difficult reasoning steps to a **Large Language Model (LLM)**. The system is trained using a two-stage process involving a **supervised warm-up** and **reinforcement learning** with difficulty-aware rewards to balance independence with strategic help-seeking. Results across multiple benchmarks show that this method significantly boosts the accuracy of smaller models while invoking the larger expert for only about **1.07% of the total tokens**. Ultimately, RelayLLM achieves a **98.2% reduction in computational costs** compared to standard performance-matched routing methods. This strategic intervention allows the smaller model to internalize better reasoning patterns, occasionally even improving its **independent performance** without teacher assistance.

...more

View all episodes

By Enoch H. Kang

January 10, 2026

RelayLLM: Efficient Reasoning via Collaborative Decoding

13 minutes

...more

Sign up to save your podcasts