
Sign up to save your podcasts
Or


A single-paper deep dive into DreamZero, a 14B World Action Model from NVIDIA that jointly predicts future video states and robot actions — enabling zero-shot generalization to unseen tasks, cross-embodiment transfer from 10–20 minutes of video, and real-time closed-loop control at 7Hz. This paper proposes a new paradigm: rather than grounding robot policies in language models, ground them in video generation. Physics lives in video. And video generation keeps getting better.
Why it matters. "World Action Models are Zero-shot Policies" introduces DreamZero, a World Action Model (WAM) built on a pretrained video diffusion backbone. Unlike Vision-Language-Action (VLA) models, which inherit semantic knowledge from image-text datasets but lack physical dynamics, WAMs learn spatiotemporal physics from video and jointly generate future frames and motor actions. DreamZero achieves over 2× improvement on generalization benchmarks versus state-of-the-art VLAs, 39.5% average task progress on completely unseen tasks (vs. near-zero for VLAs), 42% relative improvement from video-only cross-embodiment data, and new-embodiment adaptation with only 30 minutes of play data.
Project page: dreamzero0.github.io — includes a living gallery of 100+ zero-shot tasks the robot discovered through interactive prompting: fanning burgers, pressing elevator buttons, playing xylophone, shaking tambourine, watering plants, and more.
Open-source release: github.com/dreamzero0/dreamzero — model weights, inference code, and code to run three public benchmarks: DROID, PolaRiS, and Genie Sim 3.0 (100 simulation tasks).
arXiv: arXiv:2602.15922 — submitted February 17, 2026. HTML version also available.
DreamZero is a 14B autoregressive diffusion transformer (DiT) initialized from the WAN image-to-video model (Team Wan, 2025), one of the strongest open video generation systems. The backbone was pretrained on web-scale video data, encoding spatiotemporal physical dynamics that VLMs trained on static image-text pairs lack.
Joint prediction: Given language instruction, visual observation history, and proprioceptive state, DreamZero simultaneously generates future video frames and motor actions. This is mathematically equivalent to factoring the policy into autoregressive video prediction and an inverse dynamics model (IDM) — but training end-to-end achieves tighter video-action alignment than training them separately.
Autoregressive + KV caching: Video is predicted in chunks. Each chunk conditions on clean previous chunks via KV caching, enabling fast inference. In the closed-loop robot setting, ground-truth camera observations replace predicted frames in the KV cache after each action chunk executes — eliminating the compounding error problem that normally plagues autoregressive video generation.
The 38× speedup stack:
Training: Flow-matching objective with teacher forcing. Shared denoising timestep between video and action modalities for faster convergence. Trained on approximately 500 hours of real-world robot data (AgiBot G1 pretraining) — heterogeneous behavioral data rather than repeated task demonstrations.
VLAs (Vision-Language-Action models): Built on top of VLMs (e.g., RT-2, pi0, GR00T-N1, Gemini Robotics). Inherit semantic knowledge but are pretrained on static image-text data. Excel at object and semantic generalization. Fail at novel physical motions because they lack spatiotemporal physical dynamics priors.
WAMs (World Action Models): Built on video diffusion backbones. Inherit spatiotemporal physical dynamics from web-scale video pretraining. Learn from every consecutive frame pair in training data, not just episode-level demonstrations. The key insight: video is a dense representation of how the physical world evolves — it encodes geometry, dynamics, and motor control in a way static images cannot.
Why "World Action Models" not "Video Action Models": The paper coins WAM deliberately — video is one possible world modeling objective, but future WAMs may align actions with tactile sensing, force feedback, or learned latent representations. The name reflects the broader paradigm, not just the current implementation.
Prior WAMs this paper builds on and distinguishes from:
DreamZero distinguishes itself by: (1) systematic exploration of data diversity over repeated demonstrations, (2) autoregressive architecture for long-horizon modeling and KV cache efficiency, (3) state-of-the-art generalization across both novel tasks and environments, and (4) state-of-the-art cross-embodiment transfer including few-shot embodiment adaptation.
Baselines compared: GR00T-N1 (NVIDIA's prior robotics model), pi0 (Physical Intelligence), OpenVLA, and DreamZero trained from scratch (without the video pretraining backbone) as an ablation.
Training data: DROID dataset (Stanford/Berkeley open-source robotics dataset, one of the most heterogeneous available) for the public reproduction experiments. Proprietary AgiBot G1 data for the main pretraining results.
The thesis of DreamZero can be stated simply: policy performance is fundamentally tied to video generation quality. If this holds, then the path to better robot policies is the same path as the path to better video generation — and hundreds of billions of dollars are flowing into video generation for entertainment, synthetic data, and film production. Every advance in video AI potentially advances robot AI.
This is a fundamentally different answer than "collect more robot demonstrations" or "scale up the VLM backbone." It connects the robotics scaling curve to the video generation scaling curve, and the video scaling curve is extremely steep right now.
The open-source release of DreamZero weights and benchmarks makes this accessible to the broader research community — not just labs with massive teleoperation infrastructure.
Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.
This episode was researched and written with AI assistance. Technical claims have been verified against the primary paper.
By Daily Tech FeedA single-paper deep dive into DreamZero, a 14B World Action Model from NVIDIA that jointly predicts future video states and robot actions — enabling zero-shot generalization to unseen tasks, cross-embodiment transfer from 10–20 minutes of video, and real-time closed-loop control at 7Hz. This paper proposes a new paradigm: rather than grounding robot policies in language models, ground them in video generation. Physics lives in video. And video generation keeps getting better.
Why it matters. "World Action Models are Zero-shot Policies" introduces DreamZero, a World Action Model (WAM) built on a pretrained video diffusion backbone. Unlike Vision-Language-Action (VLA) models, which inherit semantic knowledge from image-text datasets but lack physical dynamics, WAMs learn spatiotemporal physics from video and jointly generate future frames and motor actions. DreamZero achieves over 2× improvement on generalization benchmarks versus state-of-the-art VLAs, 39.5% average task progress on completely unseen tasks (vs. near-zero for VLAs), 42% relative improvement from video-only cross-embodiment data, and new-embodiment adaptation with only 30 minutes of play data.
Project page: dreamzero0.github.io — includes a living gallery of 100+ zero-shot tasks the robot discovered through interactive prompting: fanning burgers, pressing elevator buttons, playing xylophone, shaking tambourine, watering plants, and more.
Open-source release: github.com/dreamzero0/dreamzero — model weights, inference code, and code to run three public benchmarks: DROID, PolaRiS, and Genie Sim 3.0 (100 simulation tasks).
arXiv: arXiv:2602.15922 — submitted February 17, 2026. HTML version also available.
DreamZero is a 14B autoregressive diffusion transformer (DiT) initialized from the WAN image-to-video model (Team Wan, 2025), one of the strongest open video generation systems. The backbone was pretrained on web-scale video data, encoding spatiotemporal physical dynamics that VLMs trained on static image-text pairs lack.
Joint prediction: Given language instruction, visual observation history, and proprioceptive state, DreamZero simultaneously generates future video frames and motor actions. This is mathematically equivalent to factoring the policy into autoregressive video prediction and an inverse dynamics model (IDM) — but training end-to-end achieves tighter video-action alignment than training them separately.
Autoregressive + KV caching: Video is predicted in chunks. Each chunk conditions on clean previous chunks via KV caching, enabling fast inference. In the closed-loop robot setting, ground-truth camera observations replace predicted frames in the KV cache after each action chunk executes — eliminating the compounding error problem that normally plagues autoregressive video generation.
The 38× speedup stack:
Training: Flow-matching objective with teacher forcing. Shared denoising timestep between video and action modalities for faster convergence. Trained on approximately 500 hours of real-world robot data (AgiBot G1 pretraining) — heterogeneous behavioral data rather than repeated task demonstrations.
VLAs (Vision-Language-Action models): Built on top of VLMs (e.g., RT-2, pi0, GR00T-N1, Gemini Robotics). Inherit semantic knowledge but are pretrained on static image-text data. Excel at object and semantic generalization. Fail at novel physical motions because they lack spatiotemporal physical dynamics priors.
WAMs (World Action Models): Built on video diffusion backbones. Inherit spatiotemporal physical dynamics from web-scale video pretraining. Learn from every consecutive frame pair in training data, not just episode-level demonstrations. The key insight: video is a dense representation of how the physical world evolves — it encodes geometry, dynamics, and motor control in a way static images cannot.
Why "World Action Models" not "Video Action Models": The paper coins WAM deliberately — video is one possible world modeling objective, but future WAMs may align actions with tactile sensing, force feedback, or learned latent representations. The name reflects the broader paradigm, not just the current implementation.
Prior WAMs this paper builds on and distinguishes from:
DreamZero distinguishes itself by: (1) systematic exploration of data diversity over repeated demonstrations, (2) autoregressive architecture for long-horizon modeling and KV cache efficiency, (3) state-of-the-art generalization across both novel tasks and environments, and (4) state-of-the-art cross-embodiment transfer including few-shot embodiment adaptation.
Baselines compared: GR00T-N1 (NVIDIA's prior robotics model), pi0 (Physical Intelligence), OpenVLA, and DreamZero trained from scratch (without the video pretraining backbone) as an ablation.
Training data: DROID dataset (Stanford/Berkeley open-source robotics dataset, one of the most heterogeneous available) for the public reproduction experiments. Proprietary AgiBot G1 data for the main pretraining results.
The thesis of DreamZero can be stated simply: policy performance is fundamentally tied to video generation quality. If this holds, then the path to better robot policies is the same path as the path to better video generation — and hundreds of billions of dollars are flowing into video generation for entertainment, synthetic data, and film production. Every advance in video AI potentially advances robot AI.
This is a fundamentally different answer than "collect more robot demonstrations" or "scale up the VLM backbone." It connects the robotics scaling curve to the video generation scaling curve, and the video scaling curve is extremely steep right now.
The open-source release of DreamZero weights and benchmarks makes this accessible to the broader research community — not just labs with massive teleoperation infrastructure.
Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.
This episode was researched and written with AI assistance. Technical claims have been verified against the primary paper.