April 12, 2026

Learning Latent Action World Models from Video

This episode explores a 2026 paper on learning latent-action world models directly from large-scale, unlabeled “in-the-wild” video, asking whether models can infer action-like variables without access to true action labels. It explains how world models differ from standard predictive or supervised models by focusing on dynamics and control, and how latent action modeling uses an inverse dynamics model plus a forward model to separate “what changed” from “what happens next.” The discussion highlights the core challenge: passive internet video contains many confounds—camera motion, edits, other agents, and noise—so a latent action can easily collapse into a generic future-information shortcut rather than something genuinely controllable. Listeners would find it interesting because it tackles a major bottleneck in AI—abundant video but scarce action-labeled data—while digging into why bottlenecks like constrained continuous latents or vector-quantized actions are crucial for learning usable, action-like representations instead of cheating predictors.

Sources:

1. Learning Latent Action World Models In The Wild — Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat, 2026

http://arxiv.org/abs/2601.05230

2. Unsupervised Learning of Object Landmarks Through Conditional Image Generation — Pavel Tokmakov, Cordelia Schmid, Karteek Alahari, 2019

https://scholar.google.com/scholar?q=Unsupervised+Learning+of+Object+Landmarks+Through+Conditional+Image+Generation

3. Unsupervised State Representation Learning with Robotic Priors: A Robustness Benchmark — Max Jaderberg and related contemporaneous robotic representation learning community; benchmark context often associated with Cédric Colas, Olivier Sigaud, Pierre-Yves Oudeyer and others, 2019

https://scholar.google.com/scholar?q=Unsupervised+State+Representation+Learning+with+Robotic+Priors:+A+Robustness+Benchmark

4. Latent Actions for Learning World Models from Videos — Representative recent authors include Menapace and collaborators; related 2022-era latent-action world-model work, 2022

https://scholar.google.com/scholar?q=Latent+Actions+for+Learning+World+Models+from+Videos

5. Learning Latent Action World Models In The Wild — Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat, 2026

https://scholar.google.com/scholar?q=Learning+Latent+Action+World+Models+In+The+Wild

6. Unsupervised Learning of Video Representations using LSTMs — Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, 2015

https://scholar.google.com/scholar?q=Unsupervised+Learning+of+Video+Representations+using+LSTMs

7. PredNet: Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning — William Lotter, Gabriel Kreiman, David Cox, 2016

https://scholar.google.com/scholar?q=PredNet:+Deep+Predictive+Coding+Networks+for+Video+Prediction+and+Unsupervised+Learning

8. VideoGPT: Video Generation using VQ-VAE and Transformers — S. M. Ali Razavi, Aäron van den Oord, Ben Poole and collaborators, 2021

https://scholar.google.com/scholar?q=VideoGPT:+Video+Generation+using+VQ-VAE+and+Transformers

9. Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi and collaborators, 2019

https://scholar.google.com/scholar?q=Learning+Latent+Dynamics+for+Planning+from+Pixels

10. World Models — David Ha, Jürgen Schmidhuber, 2018

https://scholar.google.com/scholar?q=World+Models

11. Dream to Control: Learning Behaviors by Latent Imagination — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019

https://scholar.google.com/scholar?q=Dream+to+Control:+Learning+Behaviors+by+Latent+Imagination

12. PlaNet: Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson, 2019

https://scholar.google.com/scholar?q=PlaNet:+Learning+Latent+Dynamics+for+Planning+from+Pixels

13. Learning Latent Plans from Play — Ben Eysenbach, Abhishek Gupta, Julian Ibarz, Sergey Levine, 2019

https://scholar.google.com/scholar?q=Learning+Latent+Plans+from+Play

14. Visual Behavior Modeling for Robotic Learning from Demonstration — Dmitry Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn, 2019

https://scholar.google.com/scholar?q=Visual+Behavior+Modeling+for+Robotic+Learning+from+Demonstration

15. Playable Environments: Video Manipulation in Space and Time — Malik G. Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, Elisa Ricci, 2022

https://scholar.google.com/scholar?q=Playable+Environments:+Video+Manipulation+in+Space+and+Time

16. Ego4D: Around the World in 3,000 Hours of Egocentric Video — Kristen Grauman et al., 2022

https://scholar.google.com/scholar?q=Ego4D:+Around+the+World+in+3,000+Hours+of+Egocentric+Video

17. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips — Antoine Miech, Ivan Laptev, Josef Sivic, Hao Chen, Andrew Zisserman, 2019

https://scholar.google.com/scholar?q=HowTo100M:+Learning+a+Text-Video+Embedding+by+Watching+Hundred+Million+Narrated+Video+Clips

18. YT-Temporal-1B: A Benchmark for Long-Range Understanding of Video and Language — Sam Zellers et al., 2022

https://scholar.google.com/scholar?q=YT-Temporal-1B:+A+Benchmark+for+Long-Range+Understanding+of+Video+and+Language

19. Mastering Diverse Domains through World Models — Danijar Hafner et al., 2023

https://scholar.google.com/scholar?q=Mastering+Diverse+Domains+through+World+Models

20. Learning to Model the World with Language — Anonymous/related 2024 world-model literature as cited by the paper (e.g., Bar et al., 2024), 2024

https://scholar.google.com/scholar?q=Learning+to+Model+the+World+with+Language

21. Video Action Models / VLA-related latent action papers cited by the authors (e.g., Bu et al., 2025; Gao et al., 2025; Ye et al., 2025) — Various, 2025

https://scholar.google.com/scholar?q=Video+Action+Models+/+VLA-related+latent+action+papers+cited+by+the+authors+(e.g.,+Bu+et+al.,+2025;+Gao+et+al.,+2025;+Ye+et+al.,+2025)

22. What Do Latent Action Models Actually Learn? — approx. recent LAM analysis paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=What+Do+Latent+Action+Models+Actually+Learn?

23. Clam: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations — approx. recent robot learning authors, unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Clam:+Continuous+Latent+Action+Models+for+Robot+Learning+from+Unlabeled+Demonstrations

24. PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning — approx. recent object-centric video modeling authors, unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=PlaySlot:+Learning+Inverse+Latent+Dynamics+for+Controllable+Object-Centric+Video+Prediction+and+Planning

25. Latent Action Diffusion for Cross-Embodiment Manipulation — approx. recent manipulation/robotics authors, unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Latent+Action+Diffusion+for+Cross-Embodiment+Manipulation

26. Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy — approx. recent VLA/robotics authors, unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Grounding+Actions+in+Camera+Space:+Observation-Centric+Vision-Language-Action+Policy

27. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3

28. AI Post Transformers: LeCun's AMI Energy-Based Models and the Path to Autonomous Intelligence — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/lecuns-ami-energy-based-models-and-the-path-to-autonomous-intelligence/

29. AI Post Transformers: Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3

Interactive Visualization: Learning Latent Action World Models from Video

...more

View all episodes

By mcgrof