April 12, 2026

VL-JEPA for Vision-Language Semantic Prediction

This episode explores VL-JEPA, a vision-language model that replaces token-by-token text generation during training with prediction in a semantic embedding space. It explains how the approach differs from both standard autoregressive VLMs and CLIP-style contrastive models: instead of merely aligning images and text, it conditionally predicts the meaning of an answer from visual input plus a query. The discussion highlights the paper’s core argument that semantic prediction could reduce wasted computation, especially for streaming video and other latency-sensitive applications, by enabling selective decoding and dynamic inference. It also digs into an important skepticism: whether the gains come from a fundamentally better objective or from relying on a particularly strong text-side target embedding space.

Sources:

1. VL-JEPA: Joint Embedding Predictive Architecture for Vision-language — Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung, 2025

http://arxiv.org/abs/2512.10942

2. Learning Transferable Visual Models From Natural Language Supervision — Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, 2021

https://scholar.google.com/scholar?q=Learning+Transferable+Visual+Models+From+Natural+Language+Supervision

3. Sigmoid Loss for Language Image Pre-Training — Zhai Xiaohua, Mustafa Dehghani, et al., 2023

https://scholar.google.com/scholar?q=Sigmoid+Loss+for+Language+Image+Pre-Training

4. Flamingo: a Visual Language Model for Few-Shot Learning — Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Kelsey FitzGerald, et al., 2022

https://scholar.google.com/scholar?q=Flamingo:+a+Visual+Language+Model+for+Few-Shot+Learning

5. Joint Embedding Predictive Architectures from Self-Supervised Learning to World Models — Yann LeCun, 2024

https://scholar.google.com/scholar?q=Joint+Embedding+Predictive+Architectures+from+Self-Supervised+Learning+to+World+Models

6. SigLIP 2 — Michal Tschannen and colleagues, 2025

https://scholar.google.com/scholar?q=SigLIP+2

7. Perception Encoder — Daniel Bolya and colleagues, 2025

https://scholar.google.com/scholar?q=Perception+Encoder

8. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models — Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi, 2023

https://scholar.google.com/scholar?q=BLIP-2:+Bootstrapping+Language-Image+Pre-training+with+Frozen+Image+Encoders+and+Large+Language+Models

9. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning — Dai and colleagues, 2023

https://scholar.google.com/scholar?q=InstructBLIP:+Towards+General-purpose+Vision-Language+Models+with+Instruction+Tuning

10. LLaVA: Visual Instruction Tuning — Haotian Liu and colleagues, 2023

https://scholar.google.com/scholar?q=LLaVA:+Visual+Instruction+Tuning

11. A Path Towards Autonomous Machine Intelligence — Yann LeCun, 2022

https://scholar.google.com/scholar?q=A+Path+Towards+Autonomous+Machine+Intelligence

12. VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks — approx. Fang et al. / contemporary multimodal embedding authors, 2024

https://scholar.google.com/scholar?q=VLM2Vec:+Training+Vision-Language+Models+for+Massive+Multimodal+Embedding+Tasks

13. TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings — approx. contemporary universal embedding authors, 2024

https://scholar.google.com/scholar?q=TSEmbed:+Unlocking+Task+Scaling+in+Universal+Multimodal+Embeddings

14. Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions — approx. contemporary CLIP/compositional reasoning authors, 2024

https://scholar.google.com/scholar?q=Enhancing+Compositional+Reasoning+in+CLIP+via+Reconstruction+and+Alignment+of+Text+Descriptions

15. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3

16. AI Post Transformers: Latent Space as a New Computational Paradigm — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-05-latent-space-as-a-new-computational-para-810f39.mp3

17. AI Post Transformers: UniVideo: Unified Video Understanding, Generation, and Editing — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/univideo-unified-video-understanding-generation-and-editing/

18. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3

19. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3

Interactive Visualization: VL-JEPA for Vision-Language Semantic Prediction

...more

View all episodes

By mcgrof