
Sign up to save your podcasts
Or


This study investigates Large Reasoning Models (LRMs), which generate "thinking processes" like Chain-of-Thought, using **controllable puzzle environments** to avoid data contamination and analyze "thinking traces".
Key findings reveal **three performance regimes**:
* **Low complexity**: Standard LLMs often surprisingly outperform LRMs with greater token efficiency.
* **Medium complexity**: LRMs show an advantage due to their "thinking" mechanisms.
* **High complexity**: **Both LRMs and standard LLMs experience complete accuracy collapse**.
Counter-intuitively, LRMs **reduce their reasoning effort** (thinking tokens) as problems approach accuracy collapse, despite having ample budget. Furthermore, LRMs exhibit **limitations in exact computation and consistently following explicit algorithms**, as providing the algorithm did not improve performance. These findings suggest current LRMs face **fundamental barriers to generalizable and robust reasoning**.
By CCStudiosThis study investigates Large Reasoning Models (LRMs), which generate "thinking processes" like Chain-of-Thought, using **controllable puzzle environments** to avoid data contamination and analyze "thinking traces".
Key findings reveal **three performance regimes**:
* **Low complexity**: Standard LLMs often surprisingly outperform LRMs with greater token efficiency.
* **Medium complexity**: LRMs show an advantage due to their "thinking" mechanisms.
* **High complexity**: **Both LRMs and standard LLMs experience complete accuracy collapse**.
Counter-intuitively, LRMs **reduce their reasoning effort** (thinking tokens) as problems approach accuracy collapse, despite having ample budget. Furthermore, LRMs exhibit **limitations in exact computation and consistently following explicit algorithms**, as providing the algorithm did not improve performance. These findings suggest current LRMs face **fundamental barriers to generalizable and robust reasoning**.