This episode explores TurboQuant, a method for compressing high-dimensional vectors online without learning a dataset-specific codebook first, aimed at settings like LLM KV-cache compression and approximate nearest neighbor search. It explains why vector quantization is a different problem from ordinary weight quantization, and why preserving inner products can matter just as much as minimizing reconstruction error for retrieval quality and attention behavior. The discussion focuses on the paper’s central idea that a random rotation can regularize vectors enough for simple scalar quantization to approach information-theoretic distortion limits, at least under the paper’s theoretical assumptions. Listeners would find it interesting because it connects rate-distortion theory to concrete systems bottlenecks in modern AI, while also critically examining where the paper’s theoretical strength outpaces its empirical validation.
Sources:
1. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni, 2025
http://arxiv.org/abs/2504.19874
2. Product Quantization for Nearest Neighbor Search — Herve Jegou, Matthijs Douze, Cordelia Schmid, 2011
https://scholar.google.com/scholar?q=Product+Quantization+for+Nearest+Neighbor+Search
3. Quantization based Fast Inner Product Search — Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, David Simcha, 2016
https://scholar.google.com/scholar?q=Quantization+based+Fast+Inner+Product+Search
4. Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search — Xinyan Dai, Xiao Yan, Kelvin K. W. Ng, Jiu Liu, James Cheng, 2020
https://scholar.google.com/scholar?q=Norm-Explicit+Quantization:+Improving+Vector+Quantization+for+Maximum+Inner+Product+Search
5. Accelerating Large-Scale Inference with Anisotropic Vector Quantization — Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar, 2020
https://scholar.google.com/scholar?q=Accelerating+Large-Scale+Inference+with+Anisotropic+Vector+Quantization
6. QJL: 1-bit Quantized JL Transform for KV Cache Quantization with Zero Overhead — Amir Zandieh, Majid Daliri, Iman Han, 2024
https://scholar.google.com/scholar?q=QJL:+1-bit+Quantized+JL+Transform+for+KV+Cache+Quantization+with+Zero+Overhead
7. PolarQuant: Quantizing KV Caches with Polar Transformation — Iman Han, Prannay Kacham, Amin Karbasi, Vahab Mirrokni, Amir Zandieh, 2025
https://scholar.google.com/scholar?q=PolarQuant:+Quantizing+KV+Caches+with+Polar+Transformation
8. Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search — Jianqiao Gao, Yuxuan Gou, Yiming Xu, Yuting Yang, Cheng Long, Raymond Chi-Wing Wong, 2024
https://scholar.google.com/scholar?q=Practical+and+Asymptotically+Optimal+Quantization+of+High-Dimensional+Vectors+in+Euclidean+Space+for+Approximate+Nearest+Neighbor+Search
9. KIVI: A Tuning-Free Asymmetric 2-bit Quantization for KV Cache — Zefan Liu, Jiapeng Yuan, Hongyin Jin, Shanghang Zhong, Zhiyuan Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 2024
https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2-bit+Quantization+for+KV+Cache
10. KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs — Zunhai Su, Kehong Yuan, 2025
https://scholar.google.com/scholar?q=KVSink:+Understanding+and+Enhancing+the+Preservation+of+Attention+Sinks+in+KV+Cache+Quantization+for+LLMs
11. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification — Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 2024
https://scholar.google.com/scholar?q=ZipCache:+Accurate+and+Efficient+KV+Cache+Quantization+with+Salient+Token+Identification
12. AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models — Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, Kehong Yuan, 2025
https://scholar.google.com/scholar?q=AKVQ-VL:+Attention-Aware+KV+Cache+Adaptive+2-Bit+Quantization+for+Vision-Language+Models
13. SpinQuant: LLM Quantization with Learned Rotations — Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort, 2024
https://scholar.google.com/scholar?q=SpinQuant:+LLM+Quantization+with+Learned+Rotations
14. Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer — Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo, 2025
https://scholar.google.com/scholar?q=Rotate,+Clip,+and+Partition:+Towards+W2A4KV4+Quantization+by+Integrating+Rotation+and+Learnable+Non-uniform+Quantizer
15. Locally-Adaptive Quantization for Streaming Vector Search — Cecilia Aguerrebere, Mark Hildebrand, Ishwar Singh Bhati, Theodore Willke, Mariano Tepper, 2024
https://scholar.google.com/scholar?q=Locally-Adaptive+Quantization+for+Streaming+Vector+Search
16. Sampling Methods for Inner Product Sketching — Majid Daliri, Juliana Freire, Christopher Musco, Aecio Santos, Haoxiang Zhang, 2024
https://scholar.google.com/scholar?q=Sampling+Methods+for+Inner+Product+Sketching
17. AI Post Transformers: Memory Traffic Saturation in Transformer Decode — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-20-memory-traffic-saturation-in-transformer-cd4961.mp3
18. AI Post Transformers: LAQ for Smarter KV Cache Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-23-laq-for-smarter-kv-cache-eviction-3ea2b8.mp3
19. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
20. AI Post Transformers: AWQ: On-Device LLM Compression and Acceleration — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/awq-on-device-llm-compression-and-acceleration/
21. AI Post Transformers: Sentence-BERT: Siamese Networks for Sentence Embeddings — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/sentence-bert-siamese-networks-for-sentence-embeddings/
22. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
Interactive Visualization: Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate