March 26, 2026

LeWorldModel: Stable Joint-Embedding World Models from Pixels

In this episode, the hosts examine LeWorldModel, a 2026 paper from researchers across Mila, Universite de Montreal, NYU, Samsung SAIL, and Brown that asks whether a joint-embedding predictive architecture can finally be trained end to end from raw pixels without the usual stack of stabilizers. The discussion situates the work in the broader world-model lineage from Ha and Schmidhuber through Dreamer, and in the JEPA program associated with Yann LeCun’s predictive-learning agenda. The paper’s core claim is unusually narrow and concrete: a small model, around 15 million parameters, can learn action-conditioned dynamics directly from images using just next-embedding prediction plus a Gaussian latent regularizer called SIGReg, avoiding EMA teachers, pretrained encoders, reconstruction losses, and other auxiliary machinery that many related systems rely on.

The conversation focuses on why that claim matters. The hosts explain that JEPA-style methods are attractive because they predict semantic embeddings rather than reconstructing every pixel, but they have been plagued by representation collapse and fragile training recipes. Most of the technical attention therefore goes to how LeWorldModel tries to keep the latent space informative while staying simple enough to train jointly on a single GPU in a few hours. They walk through the paper’s framing around offline control and latent-space planning, where forecasting compact future states can make imagined rollouts cheap. They also discuss the project-page claim that LeWorldModel can plan up to roughly 48 times faster than DINO-WM because each frame is compressed to a single 192-dimensional token, while noting that this is part of the system’s pitch and should be separated from broader claims about downstream capability.

The episode also digs into the evidence and the limits. The hosts cover the benchmark results across Two-Room, Reacher, Push-T, and OGBench-Cube, where LeWorldModel appears competitive overall, broadly stronger than PLDM, and better than DINO-WM on Push-T and Reacher, while DINO-WM still looks stronger on the more visually complex OGBench-Cube setting, likely because richer pretrained visual priors still help there. They also discuss the paper and project-page attempts to show “physical understanding” through latent probing, decoded latent visualizations, and surprise-style tests for implausible events. Throughout, the description stays skeptical about the gap between paper evidence and project-page marketing: the system looks like a real simplification of JEPA world modeling, but not yet a final verdict that minimal end-to-end predictive learning has solved robust visual control.

Sources:

1. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels — Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero, 2026

http://arxiv.org/abs/2603.19312v1

2. World Models — David Ha, Jürgen Schmidhuber, 2018

https://scholar.google.com/scholar?q=World+Models

3. Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson, 2018

https://scholar.google.com/scholar?q=Learning+Latent+Dynamics+for+Planning+from+Pixels

4. Dream to Control: Learning Behaviors by Latent Imagination — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019

https://scholar.google.com/scholar?q=Dream+to+Control:+Learning+Behaviors+by+Latent+Imagination

5. TD-MPC2: Scalable, Robust World Models for Continuous Control — Nicklas Hansen, Hao Su, Xiaolong Wang, 2024

https://scholar.google.com/scholar?q=TD-MPC2:+Scalable,+Robust+World+Models+for+Continuous+Control

6. A Path Towards Autonomous Machine Intelligence — Yann LeCun, 2022

https://scholar.google.com/scholar?q=A+Path+Towards+Autonomous+Machine+Intelligence

7. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture — Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, 2023

https://scholar.google.com/scholar?q=Self-Supervised+Learning+from+Images+with+a+Joint-Embedding+Predictive+Architecture

8. Revisiting Feature Prediction for Learning Visual Representations from Video — Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas, 2024

https://scholar.google.com/scholar?q=Revisiting+Feature+Prediction+for+Learning+Visual+Representations+from+Video

9. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels — Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero, 2026

https://scholar.google.com/scholar?q=LeWorldModel:+Stable+End-to-End+Joint-Embedding+Predictive+Architecture+from+Pixels

10. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning — Jean-Bastien Grill, Florian Strub, Florent Altche, Corentin Tallec and others, 2020

https://scholar.google.com/scholar?q=Bootstrap+Your+Own+Latent:+A+New+Approach+to+Self-Supervised+Learning

11. Exploring Simple Siamese Representation Learning — Xinlei Chen, Kaiming He, 2021

https://scholar.google.com/scholar?q=Exploring+Simple+Siamese+Representation+Learning

12. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning — Adrien Bardes, Jean Ponce, Yann LeCun, 2021

https://scholar.google.com/scholar?q=VICReg:+Variance-Invariance-Covariance+Regularization+for+Self-Supervised+Learning

13. Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models — Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun, 2025

https://scholar.google.com/scholar?q=Learning+from+Reward-Free+Offline+Data:+A+Case+for+Planning+with+Latent+Dynamics+Models

14. Human-level Control through Deep Reinforcement Learning — Volodymyr Mnih, Koray Kavukcuoglu, David Silver and others, 2015

https://scholar.google.com/scholar?q=Human-level+Control+through+Deep+Reinforcement+Learning

15. Mastering Diverse Domains through World Models — Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap, 2023

https://scholar.google.com/scholar?q=Mastering+Diverse+Domains+through+World+Models

16. TD-MPC: Learning to Plan in Latent Space for Visual Control — Nicklas Hansen, Xiaolong Wang, Hao Su, 2022

https://scholar.google.com/scholar?q=TD-MPC:+Learning+to+Plan+in+Latent+Space+for+Visual+Control

17. PlaNet: Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019

https://scholar.google.com/scholar?q=PlaNet:+Learning+Latent+Dynamics+for+Planning+from+Pixels

18. Rethinking JEPA: Compute-Efficient Video Self-Supervised Learning with Frozen Teachers — approx. unknown from snippet, 2025-2026

https://scholar.google.com/scholar?q=Rethinking+JEPA:+Compute-Efficient+Video+Self-Supervised+Learning+with+Frozen+Teachers

19. Efficient reinforcement learning through adaptively pretrained visual encoder — approx. unknown from snippet, 2023-2026

https://scholar.google.com/scholar?q=Efficient+reinforcement+learning+through+adaptively+pretrained+visual+encoder

20. Adaworld: Learning adaptable world models with latent actions — approx. unknown from snippet, 2023-2026

https://scholar.google.com/scholar?q=Adaworld:+Learning+adaptable+world+models+with+latent+actions

21. AI Post Transformers: Unified Latents (UL): How to train your latents — Hal Turing & Dr. Ada Shannon, Sat,

https://podcast.do-not-panic.com/episodes/unified-latents-ul-how-to-train-your-latents/

22. AI Post Transformers: Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning — Hal Turing & Dr. Ada Shannon, Fri,

https://podcast.do-not-panic.com/episodes/contrastive-behavioral-similarity-embeddings-for-generalization-in-reinforcement/

Interactive Visualization: LeWorldModel: Stable Joint-Embedding World Models from Pixels

...more

View all episodes

By mcgrof