
Sign up to save your podcasts
Or


This September 2025 paper explores the concept of long-horizon execution in Large Language Models (LLMs), arguing that marginal gains in single-step accuracy can lead to exponential improvements in the length of tasks LLMs can complete. The authors introduce a novel framework to isolate execution capabilities by providing models with necessary knowledge and plans, revealing that larger models can execute significantly more steps, even when smaller models achieve perfect single-turn accuracy. A key finding is the "self-conditioning effect," where LLMs become more prone to errors when their past mistakes are present in the context, a challenge not fully mitigated by increasing model size. However, the paper concludes that "thinking" models, which employ sequential test-time compute, effectively address this self-conditioning and can execute substantially longer tasks in a single turn.
Source:
https://arxiv.org/pdf/2509.09677
By mcgrofThis September 2025 paper explores the concept of long-horizon execution in Large Language Models (LLMs), arguing that marginal gains in single-step accuracy can lead to exponential improvements in the length of tasks LLMs can complete. The authors introduce a novel framework to isolate execution capabilities by providing models with necessary knowledge and plans, revealing that larger models can execute significantly more steps, even when smaller models achieve perfect single-turn accuracy. A key finding is the "self-conditioning effect," where LLMs become more prone to errors when their past mistakes are present in the context, a challenge not fully mitigated by increasing model size. However, the paper concludes that "thinking" models, which employ sequential test-time compute, effectively address this self-conditioning and can execute substantially longer tasks in a single turn.
Source:
https://arxiv.org/pdf/2509.09677