
Sign up to save your podcasts
Or


Paper: One Model to Learn Them All
Authors: Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, et al. (Google Brain)
Published: 2017
Link: arXiv:1706.05137
🧠What’s This Paper About?
In One Model to Learn Them All, researchers at Google Brain took aim at a tantalizing idea: Could a single model learn how to handle completely different tasks—translation, speech, image recognition—without needing to build a separate architecture for each one?
This wasn’t about building the best model for any single task. It was about creating a general-purpose learner that could competently do many things. Think of it as an early prototype of the “foundation model” mindset, well before that term became popular.
🔍 Key Ideas
* Modality-Specific Encoders: Each data type (text, image, speech) has its own preprocessing stack—like convolutions for images or spectrograms for audio—which feeds into a shared computation core.
* Unified Model Core: Once inputs are transformed into embeddings, they’re processed by a mixture of components—attention layers, convolutional layers, and mixture-of-experts (MoE)—within a single model.
* Task Specialization Without Tokens: Unlike later models that use task-specific prompt tokens (like T5), this paper distinguishes tasks via architectural differences and training schedules—not via tokens.
📊 What Tasks Were Tested?
The model was trained and evaluated on a mix of tasks, including:
* Machine translation
* Speech recognition
* Image classification (ImageNet)
* Image captioning (COCO dataset)
* Parsing
The goal was to test whether shared training helped performance across these diverse domains.
⚙️ Did It Work?
Yes—at least well enough to justify the experiment.
* The model achieved competitive results on many tasks, particularly in data-scarce settings.
* Joint training on multiple tasks didn’t drag performance down—it often helped, particularly for lower-resource tasks like parsing and speech.
* While it wasn’t state-of-the-art, it proved the concept that a single model could effectively handle multiple, very different tasks.
đź§ Why This Paper Still Matters
This paper was a philosophical leap forward. It:
* Set the stage for multimodal and multi-task models
* Influenced later models like T5 and PaLM, which pursue task generalization at scale
* Hinted at the unification strategies we now see in ImageBind, Flamingo, and GPT-4o
And although it didn’t use the canonical Transformer, the architecture shares some DNA through its use of attention mechanisms and sequence modeling—bridging the early days of domain-specific deep learning with today’s general-purpose AI.
🎧 Podcast Note
Today’s podcast episode was produced with the Audio Overview tool in Google NotebookLM. The sources used to create the “notebook” included all of the sources listed below, plus this article. The hosts you hear are AI-generated.
📚 Appendix A: Sources
* Original paper on arXiv (1706.05137)
* ar5iv readable version
* Good Papers summary
* Grainger CS546 Lecture Slides
* Meta’s ImageBind announcement for comparison
* HAI Foundation Model reflection
#onemodeltolearnthemall #googlebrain #30daysofAIpapers #deeplearning #deeplearningwiththewolf #AIfundamentals #AIbasics #machinetranslation #speechrecognition #imagenet #imageclassification #imagecaptioning #parsing
By Diana Wolf TorresPaper: One Model to Learn Them All
Authors: Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, et al. (Google Brain)
Published: 2017
Link: arXiv:1706.05137
🧠What’s This Paper About?
In One Model to Learn Them All, researchers at Google Brain took aim at a tantalizing idea: Could a single model learn how to handle completely different tasks—translation, speech, image recognition—without needing to build a separate architecture for each one?
This wasn’t about building the best model for any single task. It was about creating a general-purpose learner that could competently do many things. Think of it as an early prototype of the “foundation model” mindset, well before that term became popular.
🔍 Key Ideas
* Modality-Specific Encoders: Each data type (text, image, speech) has its own preprocessing stack—like convolutions for images or spectrograms for audio—which feeds into a shared computation core.
* Unified Model Core: Once inputs are transformed into embeddings, they’re processed by a mixture of components—attention layers, convolutional layers, and mixture-of-experts (MoE)—within a single model.
* Task Specialization Without Tokens: Unlike later models that use task-specific prompt tokens (like T5), this paper distinguishes tasks via architectural differences and training schedules—not via tokens.
📊 What Tasks Were Tested?
The model was trained and evaluated on a mix of tasks, including:
* Machine translation
* Speech recognition
* Image classification (ImageNet)
* Image captioning (COCO dataset)
* Parsing
The goal was to test whether shared training helped performance across these diverse domains.
⚙️ Did It Work?
Yes—at least well enough to justify the experiment.
* The model achieved competitive results on many tasks, particularly in data-scarce settings.
* Joint training on multiple tasks didn’t drag performance down—it often helped, particularly for lower-resource tasks like parsing and speech.
* While it wasn’t state-of-the-art, it proved the concept that a single model could effectively handle multiple, very different tasks.
đź§ Why This Paper Still Matters
This paper was a philosophical leap forward. It:
* Set the stage for multimodal and multi-task models
* Influenced later models like T5 and PaLM, which pursue task generalization at scale
* Hinted at the unification strategies we now see in ImageBind, Flamingo, and GPT-4o
And although it didn’t use the canonical Transformer, the architecture shares some DNA through its use of attention mechanisms and sequence modeling—bridging the early days of domain-specific deep learning with today’s general-purpose AI.
🎧 Podcast Note
Today’s podcast episode was produced with the Audio Overview tool in Google NotebookLM. The sources used to create the “notebook” included all of the sources listed below, plus this article. The hosts you hear are AI-generated.
📚 Appendix A: Sources
* Original paper on arXiv (1706.05137)
* ar5iv readable version
* Good Papers summary
* Grainger CS546 Lecture Slides
* Meta’s ImageBind announcement for comparison
* HAI Foundation Model reflection
#onemodeltolearnthemall #googlebrain #30daysofAIpapers #deeplearning #deeplearningwiththewolf #AIfundamentals #AIbasics #machinetranslation #speechrecognition #imagenet #imageclassification #imagecaptioning #parsing