Stop Thinking So Hard
Large reasoning models have an overthinking problem. They reach the correct answer early in their chain of thought — then keep generating thousands of additional tokens reconsidering, double-checking, and exploring alternatives they'll ultimately discard. A new paper from researchers at UT Austin, EPFL, ENS Paris-Saclay, and Telecom Paris introduces TERMINATOR, an inference-time early-exit strategy that detects when a model has already generated its final answer and stops reasoning immediately.
The key insight is that the first arrival of a model's final answer in its chain of thought is detectable from hidden states. Token confidence spikes distinctly at the answer position. Thinking-word usage shifts — words like "hmm" and "okay" cluster before the answer; words like "another" and "alternatively" cluster after. These signals are real, consistent across math, coding, and science domains, and learnable by a small classifier.
TERMINATOR is a single transformer layer — initialized from the base model's final layer — with a binary prediction head trained to predict answer arrival at every token position. At inference time, a sliding window of the ten most recent predictions triggers a stop when majority vote says the answer is already there, injecting a close-thinking token into the token stream. No data-calibrated thresholds. No test-time distribution samples. Train once, deploy anywhere.
Results
Tested on Qwen3-8B, Qwen3-14B, Ministral-3-8B-Reasoning, and Ministral-3-14B-Reasoning across MATH-500, AIME 2025, HumanEval, and GPQA:
Best or second-best on 28 out of 32 metrics (accuracy + compression rate)MATH-500: ~45% token reduction, accuracy drop under 0.5 percentage pointsAIME 2025: ~30% reduction; TERMINATOR exits too early on hard problems — documented failure modeConsistently occupies the best accuracy-efficiency Pareto frontier position versus DEER, Dynasor, Thought Calibration, and NoThinkingLinks
Paper: arXiv:2603.12529 — TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought ReasoningAuthors: Alliot Nagle (UT Austin), Jakhongir Saydaliev (EPFL), Dhia Garbaya (EPFL / ENS Paris-Saclay), Michael Gastpar (EPFL), Ashok Vardhan Makkuva (Telecom Paris / IP Paris), Hyeji Kim (UT Austin)Related Work Mentioned
DEER — chunk-based early exit via token probability thresholdsDynasor — periodic intermediate answer consistency checksThought Calibration — linear probes on reasoning step hidden statesSelf-Certainty / Kang et al. — KL divergence confidence metric for reasoningDeepSeek-R1 — large reasoning model showing overthinking phenomenonQwen3 — base models used in experimentsvLLM — inference framework used for dataset curationDatasets
MATH — Lightman et al., mathematical problem solvingAIME 2025 — American Invitational Mathematics ExaminationHumanEval — Chen et al., Python code generationGPQA — Rein et al., graduate-level science questionsOpenScience — NVIDIA, scientific research datasetOpenCoder-SFT — Huang et al., code instruction fine-tuningDTF:FTL is produced by PDX Hackerspace Foundation. Find us on Apple Podcasts, Spotify, or wherever fine podcasts are distributed.