
Sign up to save your podcasts
Or
This document introduces T2I-R1, a novel text-to-image generation model that uses Reinforcement Learning (RL) and a bi-level Chain-of-Thought (CoT) process to improve image generation. Unlike traditional methods, T2I-R1 leverages semantic-level CoT for high-level planning based on the text prompt and token-level CoT for detailed, patch-by-patch image generation. A key component is BiCoT-GRPO, an RL method that optimizes both levels of CoT simultaneously, utilizing an ensemble of vision experts to provide diverse and robust generation rewards. By applying this approach to a Unified Large Multimodal Model (ULM), T2I-R1 achieves superior performance on established benchmarks, outperforming baselines and state-of-the-art models.
This document introduces T2I-R1, a novel text-to-image generation model that uses Reinforcement Learning (RL) and a bi-level Chain-of-Thought (CoT) process to improve image generation. Unlike traditional methods, T2I-R1 leverages semantic-level CoT for high-level planning based on the text prompt and token-level CoT for detailed, patch-by-patch image generation. A key component is BiCoT-GRPO, an RL method that optimizes both levels of CoT simultaneously, utilizing an ensemble of vision experts to provide diverse and robust generation rewards. By applying this approach to a Unified Large Multimodal Model (ULM), T2I-R1 achieves superior performance on established benchmarks, outperforming baselines and state-of-the-art models.