
Sign up to save your podcasts
Or


Arxiv: https://arxiv.org/abs/2507.22058
This episode of "The AI Research Deep Dive" unpacks "X-Omni," a paper from Tencent that makes a bold claim: reinforcement learning can make autoregressive image models "great again." The host explains how this method tackles the historical weaknesses of autoregressive models, like blurry images and notoriously bad spelling. Listeners will learn about X-Omni's clever three-part architecture, which uses a large language model as a high-level planner, a semantic tokenizer for visual concepts, and a powerful diffusion model as a renderer. The episode's core focus is the sophisticated reinforcement learning loop that fine-tunes the model using a panel of "expert" reward models—including an "art critic" and a "spelling bee judge"—to achieve state-of-the-art results in generating coherent images with long, perfectly-spelled text.
By The AI Research Deep DiveArxiv: https://arxiv.org/abs/2507.22058
This episode of "The AI Research Deep Dive" unpacks "X-Omni," a paper from Tencent that makes a bold claim: reinforcement learning can make autoregressive image models "great again." The host explains how this method tackles the historical weaknesses of autoregressive models, like blurry images and notoriously bad spelling. Listeners will learn about X-Omni's clever three-part architecture, which uses a large language model as a high-level planner, a semantic tokenizer for visual concepts, and a powerful diffusion model as a renderer. The episode's core focus is the sophisticated reinforcement learning loop that fine-tunes the model using a panel of "expert" reward models—including an "art critic" and a "spelling bee judge"—to achieve state-of-the-art results in generating coherent images with long, perfectly-spelled text.