April 04, 2026

FlatAttention for Tile-Based Accelerator Inference

35 minutes

This episode explores a 2026 paper on “FlatAttention,” which argues that attention inference should be co-designed with on-chip communication primitives to fully exploit tile-based accelerators rather than reusing GPU-style kernels. It explains how these accelerators differ from GPUs: computation is spread across many tiles with local SRAM and an on-chip network, making data placement, multicast, and reduction central to performance. The discussion highlights why attention has become a growing inference bottleneck—especially for long-context models and MoE systems—and contrasts prefill vs. decode behavior, KV-cache movement costs, and variants like MHA, MQA, GQA, and MLA. Listeners would find it interesting for its careful framing of both the promise and the fairness concerns of hardware-software co-design, especially in comparison to FlashAttention’s IO-aware optimization on GPUs.

Sources:

1. FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators — Chi Zhang, Luca Colagrande, Renzo Andri, Luca Benini, 2026

http://arxiv.org/abs/2604.02110

2. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, and others, 2017

https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit

3. A Domain-Specific Supercomputer for Training Deep Neural Networks — Norman P. Jouppi, George Kurian, Sheng Li, and others, 2021

https://scholar.google.com/scholar?q=A+Domain-Specific+Supercomputer+for+Training+Deep+Neural+Networks

4. A Wafer-Scale Engine for Deep Learning — Sean Lie, Andrew H. Putnam, David Firestone, and Cerebras Systems team, 2021

https://scholar.google.com/scholar?q=A+Wafer-Scale+Engine+for+Deep+Learning

5. Scaling Graph Neural Networks with the Graphcore IPU — James H. Smith, et al. (Graphcore-affiliated authors in IPU architecture/application literature), 2022

https://scholar.google.com/scholar?q=Scaling+Graph+Neural+Networks+with+the+Graphcore+IPU

6. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices — Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, Vivienne Sze, 2019

https://scholar.google.com/scholar?q=Eyeriss+v2:+A+Flexible+Accelerator+for+Emerging+Deep+Neural+Networks+on+Mobile+Devices

7. In-Network Computing for Machine Learning: Opportunities and Challenges — various survey authors in networking/ML systems literature; representative surveys include works by Mohammad Alizadeh, Yibo Zhu, and collaborators, 2021

https://scholar.google.com/scholar?q=In-Network+Computing+for+Machine+Learning:+Opportunities+and+Challenges

8. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019

https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism

9. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

10. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023

https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning

11. FlashAttention-3 — Tri Dao and collaborators, 2024

https://scholar.google.com/scholar?q=FlashAttention-3

12. FlashMLA — DeepSeek team, 2025

https://scholar.google.com/scholar?q=FlashMLA

13. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023

https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints

14. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019

https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need

15. DeepSeek-V3 Technical Report — DeepSeek-AI, 2024

https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report

16. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017

https://scholar.google.com/scholar?q=Attention+Is+All+You+Need

17. Wafer-Scale Deep Learning — Daniel Lie, Gary Lauterbach, Sean Lie and collaborators at Cerebras, 2021

https://scholar.google.com/scholar?q=Wafer-Scale+Deep+Learning

18. Distributed Deep Learning on a Wafer-Scale Engine — Cerebras Systems authors, 2022

https://scholar.google.com/scholar?q=Distributed+Deep+Learning+on+a+Wafer-Scale+Engine

19. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — approx. enterprise systems / LLM serving authors, 2024

https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference

20. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems — approx. LLM systems authors, 2024

https://scholar.google.com/scholar?q=HotPrefix:+Hotness-Aware+KV+Cache+Scheduling+for+Efficient+Prefix+Sharing+in+LLM+Inference+Systems

21. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — approx. security / systems authors, 2024

https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference

22. MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching — approx. MoE inference systems authors, 2024

https://scholar.google.com/scholar?q=MoE-Gen:+High-Throughput+MoE+Inference+on+a+Single+GPU+with+Module-Based+Batching

23. Accelerating Distributed MoE Training and Inference with Lina — approx. distributed systems / ML systems authors, 2024

https://scholar.google.com/scholar?q=Accelerating+Distributed+MoE+Training+and+Inference+with+Lina

24. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference — approx. MoE deployment authors, 2024

https://scholar.google.com/scholar?q=Towards+MoE+Deployment:+Mitigating+Inefficiencies+in+Mixture-of-Expert+(MoE)+Inference

25. MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices — approx. accelerator architecture authors, 2024

https://scholar.google.com/scholar?q=MAS-Attention:+Memory-Aware+Stream+Processing+for+Attention+Acceleration+on+Resource-Constrained+Edge+Devices

26. REATA: An Efficient Vision Transformer Accelerator Featuring a Resource-Optimized Attention Design on Versal ACAP — approx. FPGA / accelerator authors, 2024

https://scholar.google.com/scholar?q=REATA:+An+Efficient+Vision+Transformer+Accelerator+Featuring+a+Resource-Optimized+Attention+Design+on+Versal+ACAP

27. Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning — approx. systems / compiler authors, 2024

https://scholar.google.com/scholar?q=Concerto:+Automatic+Communication+Optimization+and+Scheduling+for+Large-Scale+Deep+Learning

28. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

29. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3

30. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3

31. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3

32. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/

33. AI Post Transformers: SGLang: Efficient Language Model Program Execution — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/sglang-efficient-language-model-program-execution/

34. AI Post Transformers: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3

35. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3

36. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/

Interactive Visualization: FlatAttention for Tile-Based Accelerator Inference

...more

View all episodes

By mcgrof