
Sign up to save your podcasts
Or


Why it matters. Autoregressive image generation has been limited by a fundamental tension: larger codebooks give better reconstruction but make next-token prediction harder. "BitDance: Scaling Autoregressive Generative Models with Binary Tokens" resolves this by replacing codebook indices with 256-dimensional binary vectors — giving each token 2²⁵⁶ possible states (more than atoms in the observable universe) while making sampling tractable through a novel binary diffusion head. On ImageNet 256×256, BitDance achieves an FID of 1.24, the best among all autoregressive models, and its next-patch decoding delivers 30× speedup at 1024×1024 resolution with 5.4× fewer parameters than prior parallel AR methods.
ByteDance, MMLab @ CUHK, and collaborators. This work is a collaboration between ByteDance, MMLab at The Chinese University of Hong Kong, Shanghai Jiao Tong University, the Institute of Automation at the Chinese Academy of Sciences, and the National University of Singapore. The paper is on arXiv (2602.14041). Code is available on GitHub, model weights on Hugging Face, and an interactive demo at the project page. All models are released under Apache 2.0.
The Researchers. Jiaming Han is a PhD student at MMLab@CUHK advised by Xiangyu Yue and serves as lead developer on the project (corresponding author). Hao Chen (Google Scholar) at ByteDance is a corresponding author specializing in video processing and representation learning. Xiangyu Yue is a professor at CUHK with affiliations at UC Berkeley and Stanford, known for multimodal learning and computer vision research. Huaibo Huang is at the Chinese Academy of Sciences working on generative models and face synthesis.
Key Technical Concepts. BitDance's three innovations: (1) A lookup-free binary tokenizer using group-wise binary quantization with entropy regularization — achieving 25.29 PSNR, surpassing even continuous VAEs (23.54 PSNR for Stable Diffusion's VAE). (2) A binary diffusion head that replaces the impossible softmax-over-2²⁵⁶-classes with a small diffusion model that jointly generates all 256 bits using velocity matching. (3) Next-patch diffusion that predicts multiple tokens in parallel (up to 64 per step), enabling massive speedups without quality loss. The approach builds on ideas from VQ-VAE, LlamaGen, and Open-MAGVIT2, while fundamentally rethinking the discrete representation paradigm.
Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.
By Daily Tech FeedWhy it matters. Autoregressive image generation has been limited by a fundamental tension: larger codebooks give better reconstruction but make next-token prediction harder. "BitDance: Scaling Autoregressive Generative Models with Binary Tokens" resolves this by replacing codebook indices with 256-dimensional binary vectors — giving each token 2²⁵⁶ possible states (more than atoms in the observable universe) while making sampling tractable through a novel binary diffusion head. On ImageNet 256×256, BitDance achieves an FID of 1.24, the best among all autoregressive models, and its next-patch decoding delivers 30× speedup at 1024×1024 resolution with 5.4× fewer parameters than prior parallel AR methods.
ByteDance, MMLab @ CUHK, and collaborators. This work is a collaboration between ByteDance, MMLab at The Chinese University of Hong Kong, Shanghai Jiao Tong University, the Institute of Automation at the Chinese Academy of Sciences, and the National University of Singapore. The paper is on arXiv (2602.14041). Code is available on GitHub, model weights on Hugging Face, and an interactive demo at the project page. All models are released under Apache 2.0.
The Researchers. Jiaming Han is a PhD student at MMLab@CUHK advised by Xiangyu Yue and serves as lead developer on the project (corresponding author). Hao Chen (Google Scholar) at ByteDance is a corresponding author specializing in video processing and representation learning. Xiangyu Yue is a professor at CUHK with affiliations at UC Berkeley and Stanford, known for multimodal learning and computer vision research. Huaibo Huang is at the Chinese Academy of Sciences working on generative models and face synthesis.
Key Technical Concepts. BitDance's three innovations: (1) A lookup-free binary tokenizer using group-wise binary quantization with entropy regularization — achieving 25.29 PSNR, surpassing even continuous VAEs (23.54 PSNR for Stable Diffusion's VAE). (2) A binary diffusion head that replaces the impossible softmax-over-2²⁵⁶-classes with a small diffusion model that jointly generates all 256 bits using velocity matching. (3) Next-patch diffusion that predicts multiple tokens in parallel (up to 64 per step), enabling massive speedups without quality loss. The approach builds on ideas from VQ-VAE, LlamaGen, and Open-MAGVIT2, while fundamentally rethinking the discrete representation paradigm.
Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.