May 02, 2025

The Wolf Reads AI – Day 9: "One Model to Learn Them All"

12 minutes

Paper: One Model to Learn Them All

Authors: Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, et al. (Google Brain)

Published: 2017

Link: arXiv:1706.05137

🧠 What’s This Paper About?

In One Model to Learn Them All, researchers at Google Brain took aim at a tantalizing idea: Could a single model learn how to handle completely different tasks—translation, speech, image recognition—without needing to build a separate architecture for each one?

This wasn’t about building the best model for any single task. It was about creating a general-purpose learner that could competently do many things. Think of it as an early prototype of the “foundation model” mindset, well before that term became popular.

🔍 Key Ideas

* Modality-Specific Encoders: Each data type (text, image, speech) has its own preprocessing stack—like convolutions for images or spectrograms for audio—which feeds into a shared computation core.

* Unified Model Core: Once inputs are transformed into embeddings, they’re processed by a mixture of components—attention layers, convolutional layers, and mixture-of-experts (MoE)—within a single model.

* Task Specialization Without Tokens: Unlike later models that use task-specific prompt tokens (like T5), this paper distinguishes tasks via architectural differences and training schedules—not via tokens.

📊 What Tasks Were Tested?

The model was trained and evaluated on a mix of tasks, including:

* Machine translation

* Speech recognition

* Image classification (ImageNet)

* Image captioning (COCO dataset)

* Parsing

The goal was to test whether shared training helped performance across these diverse domains.

⚙️ Did It Work?

Yes—at least well enough to justify the experiment.

* The model achieved competitive results on many tasks, particularly in data-scarce settings.

* Joint training on multiple tasks didn’t drag performance down—it often helped, particularly for lower-resource tasks like parsing and speech.

* While it wasn’t state-of-the-art, it proved the concept that a single model could effectively handle multiple, very different tasks.

🧠 Why This Paper Still Matters

This paper was a philosophical leap forward. It:

* Set the stage for multimodal and multi-task models

* Influenced later models like T5 and PaLM, which pursue task generalization at scale

* Hinted at the unification strategies we now see in ImageBind, Flamingo, and GPT-4o

And although it didn’t use the canonical Transformer, the architecture shares some DNA through its use of attention mechanisms and sequence modeling—bridging the early days of domain-specific deep learning with today’s general-purpose AI.

🎧 Podcast Note

Today’s podcast episode was produced with the Audio Overview tool in Google NotebookLM. The sources used to create the “notebook” included all of the sources listed below, plus this article. The hosts you hear are AI-generated.

📚 Appendix A: Sources

* Original paper on arXiv (1706.05137)

* ar5iv readable version

* Good Papers summary

* Grainger CS546 Lecture Slides

* Meta’s ImageBind announcement for comparison

* HAI Foundation Model reflection

#onemodeltolearnthemall #googlebrain #30daysofAIpapers #deeplearning #deeplearningwiththewolf #AIfundamentals #AIbasics #machinetranslation #speechrecognition #imagenet #imageclassification #imagecaptioning #parsing

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit dianawolftorres.substack.com

...more

View all episodes

By Diana Wolf Torres

May 02, 2025

The Wolf Reads AI – Day 9: "One Model to Learn Them All"

12 minutes

Paper: One Model to Learn Them All

Authors: Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, et al. (Google Brain)

Published: 2017

Link: arXiv:1706.05137

🧠 What’s This Paper About?

🔍 Key Ideas

📊 What Tasks Were Tested?

The model was trained and evaluated on a mix of tasks, including:

* Machine translation

* Speech recognition

* Image classification (ImageNet)

* Image captioning (COCO dataset)

* Parsing

The goal was to test whether shared training helped performance across these diverse domains.

⚙️ Did It Work?

Yes—at least well enough to justify the experiment.

* The model achieved competitive results on many tasks, particularly in data-scarce settings.

* Joint training on multiple tasks didn’t drag performance down—it often helped, particularly for lower-resource tasks like parsing and speech.

* While it wasn’t state-of-the-art, it proved the concept that a single model could effectively handle multiple, very different tasks.

🧠 Why This Paper Still Matters

This paper was a philosophical leap forward. It:

* Set the stage for multimodal and multi-task models

* Influenced later models like T5 and PaLM, which pursue task generalization at scale

* Hinted at the unification strategies we now see in ImageBind, Flamingo, and GPT-4o

🎧 Podcast Note

📚 Appendix A: Sources

* Original paper on arXiv (1706.05137)

* ar5iv readable version

* Good Papers summary

* Grainger CS546 Lecture Slides

* Meta’s ImageBind announcement for comparison

* HAI Foundation Model reflection

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit dianawolftorres.substack.com

...more

Share The Wolf Reads AI – Day 9: "One Model to Learn Them All"

Sign up to save your podcasts

The Wolf Reads AI – Day 9: "One Model to Learn Them All"

The Wolf Reads AI – Day 9: "One Model to Learn Them All"