This episode examines the MATT (Model-Aware Tokenizer Transfer) paper from AGH University of Krakow, which proposes a fundamentally different approach to extending language models to underserved languages. Using Georgian as the central case study, the episode explains tokenizer fertility — how tokenizers optimized for high-resource languages fragment Georgian words into six to eight subword pieces, consuming context budget and degrading both accuracy and inference speed.
The episode traces the lineage of tokenizer transfer methods from WECHSEL through FOCUS and ZETT, each of which initializes new embeddings by finding semantically similar source tokens via bilingual dictionaries or FastText projections. MATT's contribution — Attention-Informed Mapping (AIM) — reframes the problem: rather than asking which source tokens are semantically closest, it asks which embeddings are most compatible with what the model's attention layers already know how to route. This is grounded in mechanistic interpretability research showing that factual knowledge resides in FFN layers, not embeddings, making tokenizer swap feasible in principle.
The episode includes a detailed comparison with the Cartridges approach, which tackles a closely related problem from a different architectural angle. Four parallel threads are developed: the Structured Continual Initialization parallel, the key-as-router insight, the separation of FFN knowledge from attention routing, and the FFN token-ID binding risk that MATT's evaluation never directly probes. The discussion argues this last point represents the sharpest untested assumption in the paper — whether feed-forward layers develop token-specific associations that break silently when vocabulary changes.
Interactive Visualization: Model-Aware Tokenizer Transfer for Multilingual LLMs