I did an investigation on more recent models (e.g. Opus 4.5) and found that many recent LLMs improve substantially on math problems when given filler tokens.
That is, I force the LLM to answer some math question immediately without being able to reason using Chain-of-Thought (CoT), but do give the LLM filter tokens (e.g., text like "Filler: 1 2 3 ...") before it has to answer.
boosts no-CoT performance from 45% to 51% (p=4e-7) on a dataset of relatively easy (competition) math problems
.
The first model that is measurably uplifted by repeats/filler is Opus 3, so this effect has been around for a [...]
---
Outline:
(03:59) Datasets
(06:01) Prompt
(06:58) Results
(07:01) Performance vs number of repeats/filler
(08:29) Alternative types of filler tokens
(09:03) Minimal comparison on many models
(10:05) Relative performance improvement vs default absolute performance
(13:59) Comparing filler vs repeat
(14:44) Future work
(15:53) Appendix: How helpful were AIs for this project?
(16:44) Appendix: more information about Easy-Comp-Math
(25:05) Appendix: tables with exact values
(25:11) Gen-Arithmetic - Repetitions
(25:36) Gen-Arithmetic - Filler
(25:59) Easy-Comp-Math - Repetitions
(26:21) Easy-Comp-Math - Filler
(26:45) Partial Sweep - Gen-Arithmetic - Repetitions
(27:07) Partial Sweep - Gen-Arithmetic - Filler
(27:27) Partial Sweep - Easy-Comp-Math - Repetitions
(27:48) Partial Sweep - Easy-Comp-Math - Filler
(28:09) Appendix: more plots
(28:14) Relative accuracy improvement plots
(29:05) Absolute accuracy improvement plots
(29:57) Filler vs repeat comparison plots
(30:02) Absolute
(30:30) Relative
(30:58) Appendix: significance matrices
(31:03) Easy-Comp-Math - Repetitions
(32:27) Easy-Comp-Math - Filler
(33:50) Gen-Arithmetic - Repetitions
(35:12) Gen-Arithmetic - Filler
The original text contained 9 footnotes which were omitted from this narration.
---