
Sign up to save your podcasts
Or


The paper introduces LeWorldModel (LeWM), the first Joint-Embedding Predictive Architecture (JEPA) capable of stable, end-to-end training directly from raw pixels. Existing world models often rely on complex multi-term losses or pre-trained encoders to avoid representation collapse, but LeWM simplifies this process using a streamlined two-term objective.
Performance and Evaluation
LeWM was evaluated across diverse 2D and 3D tasks, including navigation and robotic manipulation. It consistently outperformed or remained competitive with state-of-the-art baselines like PLDM and DINO-WM while offering superior training stability and faster planning speeds. Additionally, the researchers observed that latent trajectories in LeWM naturally become "straighter" over time—a phenomenon linked to improved temporal dynamics—without any explicit regularization.
By Yun WuThe paper introduces LeWorldModel (LeWM), the first Joint-Embedding Predictive Architecture (JEPA) capable of stable, end-to-end training directly from raw pixels. Existing world models often rely on complex multi-term losses or pre-trained encoders to avoid representation collapse, but LeWM simplifies this process using a streamlined two-term objective.
Performance and Evaluation
LeWM was evaluated across diverse 2D and 3D tasks, including navigation and robotic manipulation. It consistently outperformed or remained competitive with state-of-the-art baselines like PLDM and DINO-WM while offering superior training stability and faster planning speeds. Additionally, the researchers observed that latent trajectories in LeWM naturally become "straighter" over time—a phenomenon linked to improved temporal dynamics—without any explicit regularization.