AI Post Transformers

TriAttention for Efficient Long-Context KV Compression


Listen Later

This episode explores TriAttention, a new method for reducing KV-cache memory during long-context inference by modeling how attention behaves under Rotary Positional Embeddings rather than relying on recent attention patterns alone. It explains why common compression methods can fail for long reasoning tasks: under RoPE, queries at different positions are rotated into different coordinate systems, so a small window of recent post-RoPE queries is a poor predictor of which earlier tokens will matter later. The discussion highlights the paper’s dual contribution as both a systems result for making 32K-token-style reasoning more practical and a mechanistic argument that transformer attention has analyzable structure rather than being purely empirical. Listeners interested in efficient LLM serving, long-context reasoning, or the inner geometry of attention will find it compelling because it connects deployment bottlenecks with a concrete theoretical explanation.
Sources:
1. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen, 2026
http://arxiv.org/abs/2604.04921
2. RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021
https://scholar.google.com/scholar?q=RoFormer:+Enhanced+Transformer+with+Rotary+Position+Embedding
3. A Mathematical Framework for Transformer Circuits — Nelson Elhage, Nicholas Joseph, Ajeya Cotra, Kaidi Cao, Jared Kaplan, et al., 2021
https://scholar.google.com/scholar?q=A+Mathematical+Framework+for+Transformer+Circuits
4. StreamingLLM: Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yao Fu, Kuanlun Guo, Xuefei Ning, et al., 2023
https://scholar.google.com/scholar?q=StreamingLLM:+Efficient+Streaming+Language+Models+with+Attention+Sinks
5. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen, 2026
https://scholar.google.com/scholar?q=TriAttention:+Efficient+Long+Reasoning+with+Trigonometric+KV+Compression
6. What Makes Rotary Positional Encodings Useful? — Federico Barbero, et al., 2025
https://scholar.google.com/scholar?q=What+Makes+Rotary+Positional+Encodings+Useful?
7. Attention Sinks and Massive Activation Values in Transformers — Xiaozhi Xiao, et al., 2025
https://scholar.google.com/scholar?q=Attention+Sinks+and+Massive+Activation+Values+in+Transformers
8. Heavy Hitter Oracle for Efficient Generative Inference of Large Language Models — Zirui Liu, et al., 2023
https://scholar.google.com/scholar?q=Heavy+Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
9. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zirui Liu, et al., 2023
https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
10. PyramidKV — Zhang, et al., 2024
https://scholar.google.com/scholar?q=PyramidKV
11. SnapKV — Li, et al., 2024
https://scholar.google.com/scholar?q=SnapKV
12. R-KV — Zhang, et al., 2025
https://scholar.google.com/scholar?q=R-KV
13. Vision Transformer Interpretability via Attention Rollout — Samira Abnar, Willem Zuidema, 2020
https://scholar.google.com/scholar?q=Vision+Transformer+Interpretability+via+Attention+Rollout
14. An Analysis of Attention Weights as a Proxy for Explanation — Sarthak Jain, Byron C. Wallace, 2019
https://scholar.google.com/scholar?q=An+Analysis+of+Attention+Weights+as+a+Proxy+for+Explanation
15. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — approx. Tang et al., 2024/2025
https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads
16. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. 2025 head-aware KV compression paper, 2025
https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning
17. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference — approx. 2025, 2025
https://scholar.google.com/scholar?q=FreeKV:+Boosting+KV+Cache+Retrieval+for+Efficient+LLM+Inference
18. RAP: KV-Cache Compression via RoPE-Aligned Pruning — approx. 2025, 2025
https://scholar.google.com/scholar?q=RAP:+KV-Cache+Compression+via+RoPE-Aligned+Pruning
19. EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection — approx. 2025, 2025
https://scholar.google.com/scholar?q=EliteKV:+Scalable+KV+Cache+Compression+via+RoPE+Frequency+Selection+and+Joint+Low-Rank+Projection
20. Asymmetric KV Cache Compression using State-Aware Sparsity and Quantization — approx. 2025, 2025
https://scholar.google.com/scholar?q=Asymmetric+KV+Cache+Compression+using+State-Aware+Sparsity+and+Quantization
21. Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2023
https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks
22. When Attention Sink Emerges in Language Models: An Empirical View — approx. 2024/2025, 2024/2025
https://scholar.google.com/scholar?q=When+Attention+Sink+Emerges+in+Language+Models:+An+Empirical+View
23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
24. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3
25. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
26. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/
27. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3
Interactive Visualization: TriAttention for Efficient Long-Context KV Compression
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof