Introduces World Action Models (WAMs) that jointly predict video and actions via a 14B-parameter autoregressive diffusion model, enabling state-of-the-art zero-shot generalization on manipulation benchmarks like MolmoSpaces and RoboArena through emergent internal world modeling.