June 08, 2026

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

16 minutes

Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what the model thinks the distance is between each position and how this varies per-layer.

Example sentence with each character plotted based on per-layer learned position increments. Note the clear punctuation-based boundaries in L0 and what looks like concept-based grouping in L3.

I think this might be useful as another technique to inspect "where the model is looking" in addition to plotting attention patterns (and with similar limitations). The patterns can also hint at what the model is looking for at each layer (when position increments match different kinds of boundaries).

Note: This is still partially a solution in search of a problem. I'm hoping to help with the "searching under lamp posts" problem by finding more lamp posts, but there's additional work to be done here to see if this is actually useful or just a novelty.

AI disclaimer: The Architecture, Learned Position Increments, and Related Work sections were originally drafted by Claude before being (heavily) human-edited.

Introduction

Standard LLMs use Rotary Position Embeddings (RoPE) to [...]

---

Outline:

(01:20) Introduction

(01:52) Method

(01:55) Architecture

(02:43) Learned position increments

(04:22) Data and training

(05:20) Results

(05:23) Per-Token Increments

(06:44) First Layer of Per-Layer Model

(07:17) Chinese Word Boundaries

(08:30) Per-Layer Plots

(09:16) Grouping Multi-word Entities

(10:27) Loss Neutral

(11:11) Limitations

(11:49) Future Work

(14:14) Related Work

(15:34) Code

The original text contained 1 footnote which was omitted from this narration.

---

First published:

June 7th, 2026

Source:

https://www.lesswrong.com/posts/Bxju8Fmpo2eW4oj9t/how-far-apart-does-a-model-think-its-tokens-are

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

June 08, 2026

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

16 minutes

Example sentence with each character plotted based on per-layer learned position increments. Note the clear punctuation-based boundaries in L0 and what looks like concept-based grouping in L3.

AI disclaimer: The Architecture, Learned Position Increments, and Related Work sections were originally drafted by Claude before being (heavily) human-edited.

Introduction

Standard LLMs use Rotary Position Embeddings (RoPE) to [...]

---

Outline:

(01:20) Introduction

(01:52) Method

(01:55) Architecture

(02:43) Learned position increments

(04:22) Data and training

(05:20) Results

(05:23) Per-Token Increments

(06:44) First Layer of Per-Layer Model

(07:17) Chinese Word Boundaries

(08:30) Per-Layer Plots

(09:16) Grouping Multi-word Entities

(10:27) Loss Neutral

(11:11) Limitations

(11:49) Future Work

(14:14) Related Work

(15:34) Code

The original text contained 1 footnote which was omitted from this narration.

---

First published:

June 7th, 2026

Source:

https://www.lesswrong.com/posts/Bxju8Fmpo2eW4oj9t/how-far-apart-does-a-model-think-its-tokens-are

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,279 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,248 Listeners

Dwarkesh Podcast

564 Listeners

The Ezra Klein Show

16,340 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

Sign up to save your podcasts

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi