This episode explores a 2026 paper on learning latent-action world models directly from large-scale, unlabeled “in-the-wild” video, asking whether models can infer action-like variables without access to true action labels. It explains how world models differ from standard predictive or supervised models by focusing on dynamics and control, and how latent action modeling uses an inverse dynamics model plus a forward model to separate “what changed” from “what happens next.” The discussion highlights the core challenge: passive internet video contains many confounds—camera motion, edits, other agents, and noise—so a latent action can easily collapse into a generic future-information shortcut rather than something genuinely controllable. Listeners would find it interesting because it tackles a major bottleneck in AI—abundant video but scarce action-labeled data—while digging into why bottlenecks like constrained continuous latents or vector-quantized actions are crucial for learning usable, action-like representations instead of cheating predictors.
Sources:
1. Learning Latent Action World Models In The Wild — Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat, 2026
http://arxiv.org/abs/2601.05230
2. Unsupervised Learning of Object Landmarks Through Conditional Image Generation — Pavel Tokmakov, Cordelia Schmid, Karteek Alahari, 2019
https://scholar.google.com/scholar?q=Unsupervised+Learning+of+Object+Landmarks+Through+Conditional+Image+Generation
3. Unsupervised State Representation Learning with Robotic Priors: A Robustness Benchmark — Max Jaderberg and related contemporaneous robotic representation learning community; benchmark context often associated with Cédric Colas, Olivier Sigaud, Pierre-Yves Oudeyer and others, 2019
https://scholar.google.com/scholar?q=Unsupervised+State+Representation+Learning+with+Robotic+Priors:+A+Robustness+Benchmark
4. Latent Actions for Learning World Models from Videos — Representative recent authors include Menapace and collaborators; related 2022-era latent-action world-model work, 2022
https://scholar.google.com/scholar?q=Latent+Actions+for+Learning+World+Models+from+Videos
5. Learning Latent Action World Models In The Wild — Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat, 2026
https://scholar.google.com/scholar?q=Learning+Latent+Action+World+Models+In+The+Wild
6. Unsupervised Learning of Video Representations using LSTMs — Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, 2015
https://scholar.google.com/scholar?q=Unsupervised+Learning+of+Video+Representations+using+LSTMs
7. PredNet: Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning — William Lotter, Gabriel Kreiman, David Cox, 2016
https://scholar.google.com/scholar?q=PredNet:+Deep+Predictive+Coding+Networks+for+Video+Prediction+and+Unsupervised+Learning
8. VideoGPT: Video Generation using VQ-VAE and Transformers — S. M. Ali Razavi, Aäron van den Oord, Ben Poole and collaborators, 2021
https://scholar.google.com/scholar?q=VideoGPT:+Video+Generation+using+VQ-VAE+and+Transformers
9. Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi and collaborators, 2019
https://scholar.google.com/scholar?q=Learning+Latent+Dynamics+for+Planning+from+Pixels
10. World Models — David Ha, Jürgen Schmidhuber, 2018
https://scholar.google.com/scholar?q=World+Models
11. Dream to Control: Learning Behaviors by Latent Imagination — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019
https://scholar.google.com/scholar?q=Dream+to+Control:+Learning+Behaviors+by+Latent+Imagination
12. PlaNet: Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson, 2019
https://scholar.google.com/scholar?q=PlaNet:+Learning+Latent+Dynamics+for+Planning+from+Pixels
13. Learning Latent Plans from Play — Ben Eysenbach, Abhishek Gupta, Julian Ibarz, Sergey Levine, 2019
https://scholar.google.com/scholar?q=Learning+Latent+Plans+from+Play
14. Visual Behavior Modeling for Robotic Learning from Demonstration — Dmitry Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn, 2019
https://scholar.google.com/scholar?q=Visual+Behavior+Modeling+for+Robotic+Learning+from+Demonstration
15. Playable Environments: Video Manipulation in Space and Time — Malik G. Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, Elisa Ricci, 2022
https://scholar.google.com/scholar?q=Playable+Environments:+Video+Manipulation+in+Space+and+Time
16. Ego4D: Around the World in 3,000 Hours of Egocentric Video — Kristen Grauman et al., 2022
https://scholar.google.com/scholar?q=Ego4D:+Around+the+World+in+3,000+Hours+of+Egocentric+Video
17. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips — Antoine Miech, Ivan Laptev, Josef Sivic, Hao Chen, Andrew Zisserman, 2019
https://scholar.google.com/scholar?q=HowTo100M:+Learning+a+Text-Video+Embedding+by+Watching+Hundred+Million+Narrated+Video+Clips
18. YT-Temporal-1B: A Benchmark for Long-Range Understanding of Video and Language — Sam Zellers et al., 2022
https://scholar.google.com/scholar?q=YT-Temporal-1B:+A+Benchmark+for+Long-Range+Understanding+of+Video+and+Language
19. Mastering Diverse Domains through World Models — Danijar Hafner et al., 2023
https://scholar.google.com/scholar?q=Mastering+Diverse+Domains+through+World+Models
20. Learning to Model the World with Language — Anonymous/related 2024 world-model literature as cited by the paper (e.g., Bar et al., 2024), 2024
https://scholar.google.com/scholar?q=Learning+to+Model+the+World+with+Language
21. Video Action Models / VLA-related latent action papers cited by the authors (e.g., Bu et al., 2025; Gao et al., 2025; Ye et al., 2025) — Various, 2025
https://scholar.google.com/scholar?q=Video+Action+Models+/+VLA-related+latent+action+papers+cited+by+the+authors+(e.g.,+Bu+et+al.,+2025;+Gao+et+al.,+2025;+Ye+et+al.,+2025)
22. What Do Latent Action Models Actually Learn? — approx. recent LAM analysis paper, authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=What+Do+Latent+Action+Models+Actually+Learn?
23. Clam: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations — approx. recent robot learning authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Clam:+Continuous+Latent+Action+Models+for+Robot+Learning+from+Unlabeled+Demonstrations
24. PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning — approx. recent object-centric video modeling authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=PlaySlot:+Learning+Inverse+Latent+Dynamics+for+Controllable+Object-Centric+Video+Prediction+and+Planning
25. Latent Action Diffusion for Cross-Embodiment Manipulation — approx. recent manipulation/robotics authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Latent+Action+Diffusion+for+Cross-Embodiment+Manipulation
26. Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy — approx. recent VLA/robotics authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Grounding+Actions+in+Camera+Space:+Observation-Centric+Vision-Language-Action+Policy
27. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3
28. AI Post Transformers: LeCun's AMI Energy-Based Models and the Path to Autonomous Intelligence — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/lecuns-ami-energy-based-models-and-the-path-to-autonomous-intelligence/
29. AI Post Transformers: Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
Interactive Visualization: Learning Latent Action World Models from Video