This paper details the development and capabilities of a new, groundbreaking multimodal model called Emu3. Emu3 surpasses previous models by leveraging solely next-token prediction, enabling it to excel in diverse tasks, including image generation, video generation, and vision-language understanding. This breakthrough in artificial general intelligence (AGI) simplifies complex multimodal model designs and highlights the promise of next-token prediction for future development. The authors further showcase Emu3's advancements through detailed comparisons to existing models and qualitative examples of its capabilities.