Daily Tech Feed: From the Labs

From Shadows to Worlds


Listen Later

Episode 0024: From Shadows to Worlds

Why it matters. Language models can quote the manual on a bicycle and still miss a broken chain. Beyond Language Modeling: An Exploration of Multimodal Pretraining argues that this is structural, not incidental: text is a lossy compression of reality, and models trained only on it master the description of shadows without seeing the objects casting them. The paper runs controlled, from-scratch pretraining experiments using the Transfusion framework — combining next-token prediction for language with diffusion for vision — across text, image-text pairs, video, and action-conditioned video. The result is four concrete design insights for multimodal architecture, delivered without the confound of inherited language pretraining.

Meta FAIR and NYU Courant. This paper comes out of Meta FAIR and NYU Courant, with contributions spanning vision, language, and representation learning. The full paper is at arXiv 2603.03276. A project page with additional material is at beyond-llms.github.io. The paper is also indexed on Hugging Face Papers.

The Researchers. Lead author Shengbang Tong heads a broad collaboration. Senior contributors include Yann LeCun (Turing Award, Chief AI Scientist at Meta, NYU Silver Professor), Saining Xie (NYU Courant / Google DeepMind), and Luke Zettlemoyer (UW and Meta FAIR). Also notable: Rob Fergus (NYU / Meta), Mike Lewis, Marjan Ghazvininejad, and Nicolas Ballas.

Key Technical Concepts. The paper's four findings hang on specific architectural choices. First, the Transfusion framework unifies next-token prediction and diffusion-based generation in a single training objective — no post-hoc adapter bolted onto a frozen language backbone. Second, the Representation Autoencoder (RAE), benchmarked against alternatives including CLIP-style encoders and VAE-based tokenizers, emerges as the optimal unified representation: compact enough for semantic understanding, detailed enough for generation. Third, Mixture of Experts (MoE) enables emergent modality specialization — experts naturally route toward language or visual patterns — providing efficient capacity scaling without forcing a single monolithic path. Fourth, an IsoFLOP analysis reveals vision is far more data-hungry than language: equal compute budget assumptions, standard in language-only scaling, misallocate resources in multimodal settings and produce systems with eloquent speech and weak depth perception.

Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.

...more
View all episodesView all episodes
Download on the App Store

Daily Tech Feed: From the LabsBy Daily Tech Feed