
Sign up to save your podcasts
Or


We are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku's accuracy increased by 22%.
This post examines this unexpected “typo uplift” phenomenon and explores why noise appears to help certain models.
Do Typos Make Haiku Try Harder?
We first hypothesize that Haiku's capabilities increased because harder-to-read text makes Haiku think harder. This aligns with observed results in humans that difficult fonts make students retain knowledge better, as it forces them to expend more effort. As a proxy for effort, we plotted the number of output tokens generated by both models[1]. Contrary to our hypothesis, the number of output tokens decreased by typo rate.
Typos don't make models think harder. As typo rates increase, the output lengths of Haiku and Opus go down.
The Anomaly is Haiku-Specific
We then tested if other small models have this typo uplift anomaly. We found that both Haiku 3.5 and 4.5 have this effect of increased accuracy as typos increase, while other smaller models from [...]
---
Outline:
(00:54) Do Typos Make Haiku Try Harder?
(01:34) The Anomaly is Haiku-Specific
(02:08) The Anomaly is Benchmark-Specific
(02:42) The Culprit
(04:02) Takeaways for the Eval Engineer
(04:06) Not all grading harnesses are created equal
(04:48) Scores are lower bounds
(05:15) Aligning the model to the eval
(05:43) Appendix
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongWe are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku's accuracy increased by 22%.
This post examines this unexpected “typo uplift” phenomenon and explores why noise appears to help certain models.
Do Typos Make Haiku Try Harder?
We first hypothesize that Haiku's capabilities increased because harder-to-read text makes Haiku think harder. This aligns with observed results in humans that difficult fonts make students retain knowledge better, as it forces them to expend more effort. As a proxy for effort, we plotted the number of output tokens generated by both models[1]. Contrary to our hypothesis, the number of output tokens decreased by typo rate.
Typos don't make models think harder. As typo rates increase, the output lengths of Haiku and Opus go down.
The Anomaly is Haiku-Specific
We then tested if other small models have this typo uplift anomaly. We found that both Haiku 3.5 and 4.5 have this effect of increased accuracy as typos increase, while other smaller models from [...]
---
Outline:
(00:54) Do Typos Make Haiku Try Harder?
(01:34) The Anomaly is Haiku-Specific
(02:08) The Anomaly is Benchmark-Specific
(02:42) The Culprit
(04:02) Takeaways for the Eval Engineer
(04:06) Not all grading harnesses are created equal
(04:48) Scores are lower bounds
(05:15) Aligning the model to the eval
(05:43) Appendix
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,326 Listeners

130 Listeners

7,242 Listeners

559 Listeners

16,321 Listeners

4 Listeners

14 Listeners

2 Listeners