February 21, 2026

DreamZero — World Action Models are Zero-shot Policies

28 minutes

Episode 011: DreamZero — World Action Models are Zero-shot Policies

A single-paper deep dive into DreamZero, a 14B World Action Model from NVIDIA that jointly predicts future video states and robot actions — enabling zero-shot generalization to unseen tasks, cross-embodiment transfer from 10–20 minutes of video, and real-time closed-loop control at 7Hz. This paper proposes a new paradigm: rather than grounding robot policies in language models, ground them in video generation. Physics lives in video. And video generation keeps getting better.

The Paper

Why it matters. "World Action Models are Zero-shot Policies" introduces DreamZero, a World Action Model (WAM) built on a pretrained video diffusion backbone. Unlike Vision-Language-Action (VLA) models, which inherit semantic knowledge from image-text datasets but lack physical dynamics, WAMs learn spatiotemporal physics from video and jointly generate future frames and motor actions. DreamZero achieves over 2× improvement on generalization benchmarks versus state-of-the-art VLAs, 39.5% average task progress on completely unseen tasks (vs. near-zero for VLAs), 42% relative improvement from video-only cross-embodiment data, and new-embodiment adaptation with only 30 minutes of play data.

Project page: dreamzero0.github.io — includes a living gallery of 100+ zero-shot tasks the robot discovered through interactive prompting: fanning burgers, pressing elevator buttons, playing xylophone, shaking tambourine, watering plants, and more.

Open-source release: github.com/dreamzero0/dreamzero — model weights, inference code, and code to run three public benchmarks: DROID, PolaRiS, and Genie Sim 3.0 (100 simulation tasks).

arXiv: arXiv:2602.15922 — submitted February 17, 2026. HTML version also available.

Key Results

AgiBot seen tasks: 62.2% average task progress (zero-shot, novel environments) vs. 27.4% for best pretrained VLA — over 2× improvement.

AgiBot unseen tasks (e.g., untying shoelaces, shaking hands): 39.5% task progress vs. near-zero for VLAs. VLAs default to pick-and-place regardless of instruction. DreamZero attempts the correct motion.

DROID unseen verbs (actions absent from public DROID dataset): 49% vs. 25–32% for state-of-the-art VLAs. Trained on DROID, one of the most heterogeneous open-source robotics datasets.

Cross-embodiment transfer: Video-only demonstrations from another robot (YAM) or humans yield a 42% relative improvement on unseen task performance with only 10–20 minutes of data. No action labels required — only video.

New embodiment adaptation: DreamZero pretrained on AgiBot G1 adapts to an entirely new robot (YAM) with only 30 minutes of play data (55 unstructured trajectories), retaining zero-shot generalization to novel objects including teddy bears, pumpkins, and paper bags.

Real-time inference: 38× speedup from naive baseline (5.7 seconds/chunk) to 150ms/chunk, enabling 7Hz closed-loop control with a 14B parameter model.

Architecture: How DreamZero Works

DreamZero is a 14B autoregressive diffusion transformer (DiT) initialized from the WAN image-to-video model (Team Wan, 2025), one of the strongest open video generation systems. The backbone was pretrained on web-scale video data, encoding spatiotemporal physical dynamics that VLMs trained on static image-text pairs lack.

Joint prediction: Given language instruction, visual observation history, and proprioceptive state, DreamZero simultaneously generates future video frames and motor actions. This is mathematically equivalent to factoring the policy into autoregressive video prediction and an inverse dynamics model (IDM) — but training end-to-end achieves tighter video-action alignment than training them separately.

Autoregressive + KV caching: Video is predicted in chunks. Each chunk conditions on clean previous chunks via KV caching, enabling fast inference. In the closed-loop robot setting, ground-truth camera observations replace predicted frames in the KV cache after each action chunk executes — eliminating the compounding error problem that normally plagues autoregressive video generation.

The 38× speedup stack:

DreamZero-Flash: Decoupled video and action denoising schedules; DiT caching reuses velocity predictions when cosine similarity between steps exceeds a threshold, reducing effective steps from 16 to 4.

CFG Parallelism: Classifier-free guidance's two forward passes distributed across two GPUs, cutting per-step latency 47%.

Torch Compile + CUDA Graphs: Fused operators, eliminated CPU overhead.

NVFP4 quantization: Weights and activations quantized to NVFP4 on Blackwell architecture, with QKV and Softmax kept at FP8.

Training: Flow-matching objective with teacher forcing. Shared denoising timestep between video and action modalities for faster convergence. Trained on approximately 500 hours of real-world robot data (AgiBot G1 pretraining) — heterogeneous behavioral data rather than repeated task demonstrations.

The VLA vs. WAM Distinction

VLAs (Vision-Language-Action models): Built on top of VLMs (e.g., RT-2, pi0, GR00T-N1, Gemini Robotics). Inherit semantic knowledge but are pretrained on static image-text data. Excel at object and semantic generalization. Fail at novel physical motions because they lack spatiotemporal physical dynamics priors.

WAMs (World Action Models): Built on video diffusion backbones. Inherit spatiotemporal physical dynamics from web-scale video pretraining. Learn from every consecutive frame pair in training data, not just episode-level demonstrations. The key insight: video is a dense representation of how the physical world evolves — it encodes geometry, dynamics, and motor control in a way static images cannot.

Why "World Action Models" not "Video Action Models": The paper coins WAM deliberately — video is one possible world modeling objective, but future WAMs may align actions with tactile sensing, force feedback, or learned latent representations. The name reflects the broader paradigm, not just the current implementation.

Related Work and Context

Prior WAMs this paper builds on and distinguishes from:

UniSim / IRASim — video generation for robot data synthesis

Dreamitate (Pai et al., 2025) — joint video and action prediction, focused on repeated demonstrations

CogAct (Kim et al., 2026) — WAM from pretrained video backbone

NaVILA (Liang et al., 2025) — video-grounded navigation policies

DreamZero distinguishes itself by: (1) systematic exploration of data diversity over repeated demonstrations, (2) autoregressive architecture for long-horizon modeling and KV cache efficiency, (3) state-of-the-art generalization across both novel tasks and environments, and (4) state-of-the-art cross-embodiment transfer including few-shot embodiment adaptation.

Baselines compared: GR00T-N1 (NVIDIA's prior robotics model), pi0 (Physical Intelligence), OpenVLA, and DreamZero trained from scratch (without the video pretraining backbone) as an ablation.

Training data: DROID dataset (Stanford/Berkeley open-source robotics dataset, one of the most heterogeneous available) for the public reproduction experiments. Proprietary AgiBot G1 data for the main pretraining results.

Key Researchers

Seonghyeon Ye (lead author) — NVIDIA Research, previously at KAIST. Works on robot learning and embodied AI.

Yuke Zhu (project lead) — UT Austin / NVIDIA Research. Robotics and embodied AI, co-creator of RoboSuite.

Linxi "Jim" Fan (project lead) — NVIDIA Research. Co-creator of MineDreamer, Voyager, and GROOT. Work spans simulation-to-real transfer and open-world AI agents.

Joel Jang (project lead) — NVIDIA Research. Language-conditioned robot learning.

Scott Reed — NVIDIA Research, formerly Google DeepMind. Co-creator of DALL-E, Gato, and Flamingo.

Yilun Du — Harvard / MIT. Diffusion models for planning and robot learning. Co-creator of Diffusion Policy foundational work.

Danfei Xu — Georgia Tech. Created and maintains the DROID dataset.

Jan Kautz — VP of Learning and Perception Research at NVIDIA.

Yevgen Chebotar — NVIDIA Research, formerly Google. Reinforcement learning and dexterous manipulation.

Johan Bjorck — NVIDIA Research. Previously published GR00T-N1, NVIDIA's prior robot foundation model.

Benchmarks and Evaluation Infrastructure

RoboArena — NVIDIA's real-world robot evaluation framework for AgiBot G1. Used for the main pretraining and cross-embodiment evaluations. 80 rollouts per checkpoint across 4 robots in different environments.

DROID — Public open-source robotics dataset for reproducible evaluation on Franka robot arm. 20 seen tasks + 20 unseen verb tasks.

PolaRiS — Simulation benchmark (being open-sourced with this paper).

Genie Sim 3.0 — 100-task simulation benchmark. DreamZero achieves non-trivial performance without any simulation training data (trained only on ~500 hours of real-world data).

The Broader Picture: Why This Paradigm Matters

The thesis of DreamZero can be stated simply: policy performance is fundamentally tied to video generation quality. If this holds, then the path to better robot policies is the same path as the path to better video generation — and hundreds of billions of dollars are flowing into video generation for entertainment, synthetic data, and film production. Every advance in video AI potentially advances robot AI.

This is a fundamentally different answer than "collect more robot demonstrations" or "scale up the VLM backbone." It connects the robotics scaling curve to the video generation scaling curve, and the video scaling curve is extremely steep right now.

The open-source release of DreamZero weights and benchmarks makes this accessible to the broader research community — not just labs with massive teleoperation infrastructure.