AI Post Transformers

VL-JEPA for Vision-Language Semantic Prediction


Listen Later

This episode explores VL-JEPA, a vision-language model that replaces token-by-token text generation during training with prediction in a semantic embedding space. It explains how the approach differs from both standard autoregressive VLMs and CLIP-style contrastive models: instead of merely aligning images and text, it conditionally predicts the meaning of an answer from visual input plus a query. The discussion highlights the paper’s core argument that semantic prediction could reduce wasted computation, especially for streaming video and other latency-sensitive applications, by enabling selective decoding and dynamic inference. It also digs into an important skepticism: whether the gains come from a fundamentally better objective or from relying on a particularly strong text-side target embedding space.
Sources:
1. VL-JEPA: Joint Embedding Predictive Architecture for Vision-language — Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung, 2025
http://arxiv.org/abs/2512.10942
2. Learning Transferable Visual Models From Natural Language Supervision — Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, 2021
https://scholar.google.com/scholar?q=Learning+Transferable+Visual+Models+From+Natural+Language+Supervision
3. Sigmoid Loss for Language Image Pre-Training — Zhai Xiaohua, Mustafa Dehghani, et al., 2023
https://scholar.google.com/scholar?q=Sigmoid+Loss+for+Language+Image+Pre-Training
4. Flamingo: a Visual Language Model for Few-Shot Learning — Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Kelsey FitzGerald, et al., 2022
https://scholar.google.com/scholar?q=Flamingo:+a+Visual+Language+Model+for+Few-Shot+Learning
5. Joint Embedding Predictive Architectures from Self-Supervised Learning to World Models — Yann LeCun, 2024
https://scholar.google.com/scholar?q=Joint+Embedding+Predictive+Architectures+from+Self-Supervised+Learning+to+World+Models
6. SigLIP 2 — Michal Tschannen and colleagues, 2025
https://scholar.google.com/scholar?q=SigLIP+2
7. Perception Encoder — Daniel Bolya and colleagues, 2025
https://scholar.google.com/scholar?q=Perception+Encoder
8. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models — Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi, 2023
https://scholar.google.com/scholar?q=BLIP-2:+Bootstrapping+Language-Image+Pre-training+with+Frozen+Image+Encoders+and+Large+Language+Models
9. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning — Dai and colleagues, 2023
https://scholar.google.com/scholar?q=InstructBLIP:+Towards+General-purpose+Vision-Language+Models+with+Instruction+Tuning
10. LLaVA: Visual Instruction Tuning — Haotian Liu and colleagues, 2023
https://scholar.google.com/scholar?q=LLaVA:+Visual+Instruction+Tuning
11. A Path Towards Autonomous Machine Intelligence — Yann LeCun, 2022
https://scholar.google.com/scholar?q=A+Path+Towards+Autonomous+Machine+Intelligence
12. VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks — approx. Fang et al. / contemporary multimodal embedding authors, 2024
https://scholar.google.com/scholar?q=VLM2Vec:+Training+Vision-Language+Models+for+Massive+Multimodal+Embedding+Tasks
13. TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings — approx. contemporary universal embedding authors, 2024
https://scholar.google.com/scholar?q=TSEmbed:+Unlocking+Task+Scaling+in+Universal+Multimodal+Embeddings
14. Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions — approx. contemporary CLIP/compositional reasoning authors, 2024
https://scholar.google.com/scholar?q=Enhancing+Compositional+Reasoning+in+CLIP+via+Reconstruction+and+Alignment+of+Text+Descriptions
15. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3
16. AI Post Transformers: Latent Space as a New Computational Paradigm — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-05-latent-space-as-a-new-computational-para-810f39.mp3
17. AI Post Transformers: UniVideo: Unified Video Understanding, Generation, and Editing — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/univideo-unified-video-understanding-generation-and-editing/
18. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3
19. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3
Interactive Visualization: VL-JEPA for Vision-Language Semantic Prediction
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof