AI Post Transformers

LeWorldModel: Stable Joint-Embedding World Models from Pixels


Listen Later

In this episode, the hosts examine LeWorldModel, a 2026 paper from researchers across Mila, Universite de Montreal, NYU, Samsung SAIL, and Brown that asks whether a joint-embedding predictive architecture can finally be trained end to end from raw pixels without the usual stack of stabilizers. The discussion situates the work in the broader world-model lineage from Ha and Schmidhuber through Dreamer, and in the JEPA program associated with Yann LeCun’s predictive-learning agenda. The paper’s core claim is unusually narrow and concrete: a small model, around 15 million parameters, can learn action-conditioned dynamics directly from images using just next-embedding prediction plus a Gaussian latent regularizer called SIGReg, avoiding EMA teachers, pretrained encoders, reconstruction losses, and other auxiliary machinery that many related systems rely on.
The conversation focuses on why that claim matters. The hosts explain that JEPA-style methods are attractive because they predict semantic embeddings rather than reconstructing every pixel, but they have been plagued by representation collapse and fragile training recipes. Most of the technical attention therefore goes to how LeWorldModel tries to keep the latent space informative while staying simple enough to train jointly on a single GPU in a few hours. They walk through the paper’s framing around offline control and latent-space planning, where forecasting compact future states can make imagined rollouts cheap. They also discuss the project-page claim that LeWorldModel can plan up to roughly 48 times faster than DINO-WM because each frame is compressed to a single 192-dimensional token, while noting that this is part of the system’s pitch and should be separated from broader claims about downstream capability.
The episode also digs into the evidence and the limits. The hosts cover the benchmark results across Two-Room, Reacher, Push-T, and OGBench-Cube, where LeWorldModel appears competitive overall, broadly stronger than PLDM, and better than DINO-WM on Push-T and Reacher, while DINO-WM still looks stronger on the more visually complex OGBench-Cube setting, likely because richer pretrained visual priors still help there. They also discuss the paper and project-page attempts to show “physical understanding” through latent probing, decoded latent visualizations, and surprise-style tests for implausible events. Throughout, the description stays skeptical about the gap between paper evidence and project-page marketing: the system looks like a real simplification of JEPA world modeling, but not yet a final verdict that minimal end-to-end predictive learning has solved robust visual control.
Sources:
1. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels — Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero, 2026
http://arxiv.org/abs/2603.19312v1
2. World Models — David Ha, Jürgen Schmidhuber, 2018
https://scholar.google.com/scholar?q=World+Models
3. Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson, 2018
https://scholar.google.com/scholar?q=Learning+Latent+Dynamics+for+Planning+from+Pixels
4. Dream to Control: Learning Behaviors by Latent Imagination — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019
https://scholar.google.com/scholar?q=Dream+to+Control:+Learning+Behaviors+by+Latent+Imagination
5. TD-MPC2: Scalable, Robust World Models for Continuous Control — Nicklas Hansen, Hao Su, Xiaolong Wang, 2024
https://scholar.google.com/scholar?q=TD-MPC2:+Scalable,+Robust+World+Models+for+Continuous+Control
6. A Path Towards Autonomous Machine Intelligence — Yann LeCun, 2022
https://scholar.google.com/scholar?q=A+Path+Towards+Autonomous+Machine+Intelligence
7. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture — Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, 2023
https://scholar.google.com/scholar?q=Self-Supervised+Learning+from+Images+with+a+Joint-Embedding+Predictive+Architecture
8. Revisiting Feature Prediction for Learning Visual Representations from Video — Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas, 2024
https://scholar.google.com/scholar?q=Revisiting+Feature+Prediction+for+Learning+Visual+Representations+from+Video
9. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels — Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero, 2026
https://scholar.google.com/scholar?q=LeWorldModel:+Stable+End-to-End+Joint-Embedding+Predictive+Architecture+from+Pixels
10. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning — Jean-Bastien Grill, Florian Strub, Florent Altche, Corentin Tallec and others, 2020
https://scholar.google.com/scholar?q=Bootstrap+Your+Own+Latent:+A+New+Approach+to+Self-Supervised+Learning
11. Exploring Simple Siamese Representation Learning — Xinlei Chen, Kaiming He, 2021
https://scholar.google.com/scholar?q=Exploring+Simple+Siamese+Representation+Learning
12. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning — Adrien Bardes, Jean Ponce, Yann LeCun, 2021
https://scholar.google.com/scholar?q=VICReg:+Variance-Invariance-Covariance+Regularization+for+Self-Supervised+Learning
13. Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models — Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun, 2025
https://scholar.google.com/scholar?q=Learning+from+Reward-Free+Offline+Data:+A+Case+for+Planning+with+Latent+Dynamics+Models
14. Human-level Control through Deep Reinforcement Learning — Volodymyr Mnih, Koray Kavukcuoglu, David Silver and others, 2015
https://scholar.google.com/scholar?q=Human-level+Control+through+Deep+Reinforcement+Learning
15. Mastering Diverse Domains through World Models — Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap, 2023
https://scholar.google.com/scholar?q=Mastering+Diverse+Domains+through+World+Models
16. TD-MPC: Learning to Plan in Latent Space for Visual Control — Nicklas Hansen, Xiaolong Wang, Hao Su, 2022
https://scholar.google.com/scholar?q=TD-MPC:+Learning+to+Plan+in+Latent+Space+for+Visual+Control
17. PlaNet: Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019
https://scholar.google.com/scholar?q=PlaNet:+Learning+Latent+Dynamics+for+Planning+from+Pixels
18. Rethinking JEPA: Compute-Efficient Video Self-Supervised Learning with Frozen Teachers — approx. unknown from snippet, 2025-2026
https://scholar.google.com/scholar?q=Rethinking+JEPA:+Compute-Efficient+Video+Self-Supervised+Learning+with+Frozen+Teachers
19. Efficient reinforcement learning through adaptively pretrained visual encoder — approx. unknown from snippet, 2023-2026
https://scholar.google.com/scholar?q=Efficient+reinforcement+learning+through+adaptively+pretrained+visual+encoder
20. Adaworld: Learning adaptable world models with latent actions — approx. unknown from snippet, 2023-2026
https://scholar.google.com/scholar?q=Adaworld:+Learning+adaptable+world+models+with+latent+actions
21. AI Post Transformers: Unified Latents (UL): How to train your latents — Hal Turing & Dr. Ada Shannon, Sat,
https://podcast.do-not-panic.com/episodes/unified-latents-ul-how-to-train-your-latents/
22. AI Post Transformers: Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning — Hal Turing & Dr. Ada Shannon, Fri,
https://podcast.do-not-panic.com/episodes/contrastive-behavioral-similarity-embeddings-for-generalization-in-reinforcement/
Interactive Visualization: LeWorldModel: Stable Joint-Embedding World Models from Pixels
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof