Hal Turing and Dr. Ada Shannon return to the CARTRIDGE compression system with a mechanistic lens, covering Maurizio A. Diaz's paper "Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations" (arXiv 2508.17032), presented at the NeurIPS 2025 Workshop on Mechanistic Interpretability. Building on the original CARTRIDGE episode from November 10th, 2025 and the follow-up from February 6th, 2026, this episode asks the question those earlier discussions left open: what structure does the optimizer actually induce in a trained CARTRIDGE? The hosts ground the discussion in the memory scaling problem driving the entire field—KV caches that grow linearly with context length, now routinely dwarfing model weights at the 128K-to-million-token scales of current frontier models—and trace how techniques like PagedAttention, Grouped Query Attention, and token eviction address symptoms without shrinking the underlying representation.
Diaz's central finding is a clean functional division between key and value vectors inside a trained CARTRIDGE. Keys converge to stable retrieval routers: low-rank, consistent structures that steer attention toward the right stored content across diverse queries. Values carry the compressed semantic payload. The hosts connect this directly to how CARTRIDGE's Self-Study training pipeline works—because the cache is optimized against synthetic question-answer traces generated by the model over its own content, the training signal explicitly selects for routing behavior, making the key-as-router outcome a predictable consequence of the objective rather than an accident. Diaz uses Singular Value Decomposition to quantify this structure layer by layer, separating the geometric properties of key matrices from value matrices across training checkpoints.
Two downstream findings from the key-router property shape the second half of the discussion. Because keys are stable and low-rank, they transfer across tasks with minimal degradation—a result with direct implications for multi-task serving, where a single shared key structure could route to task-specific value sets without independent CARTRIDGE training per deployment. The Sampled Chunk Initialization method introduced in the paper exploits this stability to warm-start CARTRIDGE training, accelerating convergence by initializing the learnable KV pairs from a small representative sample rather than random weights. Hal and Ada close by discussing what the key-as-router framing implies for KV-cache compression research more broadly: if the routing function is separable and transferable, compression schemes that conflate keys and values may be discarding structure that has real serving-efficiency value.
Sources:
1. Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations — Maurizio Diaz, 2025
http://arxiv.org/abs/2508.17032
2. CARTRIDGES: Learning to Pack Long Contexts into KV Caches — Zhihao Zhang, Aditya Desai, Amir Gholami, Michael W. Mahoney, Kurt Keutzer, et al. (Berkeley / ICSI), 2025
https://scholar.google.com/scholar?q=CARTRIDGES:+Learning+to+Pack+Long+Contexts+into+KV+Caches
3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianghao Huang, Mu Li, Beidi Chen, Jason D. Lee, Binhang Yuan, Ce Zhang, Cho-Jui Hsieh, 2023
https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
4. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time — Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava, 2023
https://scholar.google.com/scholar?q=Scissorhands:+Exploiting+the+Persistence+of+Importance+Hypothesis+for+LLM+KV+Cache+Compression+at+Test+Time
5. Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, 2023
https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks
6. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023
https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
7. Lost in the Middle: How Language Models Use Long Contexts — Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2023
https://scholar.google.com/scholar?q=Lost+in+the+Middle:+How+Language+Models+Use+Long+Contexts
8. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023
https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints
9. Extending Context Window of Large Language Models via Positional Interpolation — Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian, 2023
https://scholar.google.com/scholar?q=Extending+Context+Window+of+Large+Language+Models+via+Positional+Interpolation
10. Constitutional AI: Harmlessness from AI Feedback — Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. (Anthropic), 2022
https://scholar.google.com/scholar?q=Constitutional+AI:+Harmlessness+from+AI+Feedback
11. Distilling the Knowledge in a Neural Network — Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015
https://scholar.google.com/scholar?q=Distilling+the+Knowledge+in+a+Neural+Network
12. Compressing Context to Enhance Inference Efficiency of Large Language Models — Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin, 2023
https://scholar.google.com/scholar?q=Compressing+Context+to+Enhance+Inference+Efficiency+of+Large+Language+Models
13. AutoCompressors: Adapting Language Models to Summarize Arbitrary Contexts into Summary Vectors — Alexis Chevalier, Alexander Wettig, Anirudh Anand, Danqi Chen, 2023
https://scholar.google.com/scholar?q=AutoCompressors:+Adapting+Language+Models+to+Summarize+Arbitrary+Contexts+into+Summary+Vectors
14. A Mathematical Framework for Transformer Circuits — Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah, 2021
https://scholar.google.com/scholar?q=A+Mathematical+Framework+for+Transformer+Circuits
15. Toy Models of Superposition — Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Ben Mann, Shan Carter, Chris Olah, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Jared Kaplan, Dario Amodei, 2022
https://scholar.google.com/scholar?q=Toy+Models+of+Superposition
16. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning — Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, William Lamont, Annabel Shearer, Zac Hatfield-Dodds, Tom Henighan, Nicholas Joseph, Ben Ziegler, Bilal Chaudhry, Will Hao, Timothy Telleen-Lawton, Daniel Mossing, Adam Jermyn, Tom Brown, Chris Olah, Sam McCandlish, Jack Clark, Dario Amodei, 2023
https://scholar.google.com/scholar?q=Towards+Monosemanticity:+Decomposing+Language+Models+with+Dictionary+Learning
17. In-context Learning and Induction Heads — Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah, 2022
https://scholar.google.com/scholar?q=In-context+Learning+and+Induction+Heads
18. AutoCompressors: Compressing Contexts with Language Models — Chevalier et al., 2023
https://scholar.google.com/scholar?q=AutoCompressors:+Compressing+Contexts+with+Language+Models
19. Gist Tokens: Compressing Prompts into Tokens for Long-Context Language Models — Mu et al., 2023
https://scholar.google.com/scholar?q=Gist+Tokens:+Compressing+Prompts+into+Tokens+for+Long-Context+Language+Models
20. The Power of Scale for Parameter-Efficient Prompt Tuning — Lester et al., 2021
https://scholar.google.com/scholar?q=The+Power+of+Scale+for+Parameter-Efficient+Prompt+Tuning
21. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Li and Liang, 2021
https://scholar.google.com/scholar?q=Prefix-Tuning:+Optimizing+Continuous+Prompts+for+Generation
22. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling — Cai et al., 2024
https://scholar.google.com/scholar?q=PyramidKV:+Dynamic+KV+Cache+Compression+based+on+Pyramidal+Information+Funneling
23. Tokasaurus: A High-Throughput LLM Serving Engine with Grouped-Sparse Attention — Lenz et al., 2025
https://scholar.google.com/scholar?q=Tokasaurus:+A+High-Throughput+LLM+Serving+Engine+with+Grouped-Sparse+Attention
24. Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders — unknown from snippet, 2024-2025
https://scholar.google.com/scholar?q=Unlocking+the+Address+Book:+Dissecting+the+Sparse+Semantic+Structure+of+LLM+Key-Value+Caches+via+Sparse+Autoencoders
25. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — approximate, multiple authors, 2024-2025
https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods
26. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression — approximate, Microsoft Research, 2024
https://scholar.google.com/scholar?q=LLMLingua-2:+Data+Distillation+for+Efficient+and+Faithful+Task-Agnostic+Prompt+Compression
27. AI Post Transformers: CARTRIDGE: Efficient In-Context Learning via Distillation — Hal Turing & Dr. Ada Shannon, Mon,
https://podcasters.spotify.com/pod/show/12146088098/episodes/CARTRIDGE-Efficient-In-Context-Learning-via-Distillation-e3aous4
28. AI Post Transformers: Context Distillation for Language Models — Hal Turing & Dr. Ada Shannon, Mon,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Context-Distillation-for-Language-Models-e3aouen
29. AI Post Transformers: Advancements in Efficient KV Cache Quantization and Management — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Advancements-in-Efficient-KV-Cache-Quantization-and-Management-e3fk9kr
30. AI Post Transformers: Architectural Migration to Multi-head Latent Attention — Hal Turing & Dr. Ada Shannon, Wed,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Architectural-Migration-to-Multi-head-Latent-Attention-e39jbmq