The paper "Flamingo: a Visual Language Model for Few-Shot Learning" introduces a family of Visual Language Models (VLMs) developed by DeepMind that can rapidly adapt to novel image and video understanding tasks using only a handful of examples.
Here is a short summary of its key contributions and findings:
- Innovative Architecture: Flamingo leverages powerful, pretrained, and frozen vision and language models, which preserves their accumulated knowledge while saving compute. It bridges these models using two novel components:
- Training on Interleaved Web Data: To endow the models with strong in-context few-shot learning capabilities, Flamingo is trained on large-scale web corpora. Crucially, this includes the MultiModal MassiveWeb (M3W) dataset, which contains arbitrarily interleaved text and images scraped from webpages, alongside traditional image-text and video-text pairs.
- State-of-the-Art Performance: Flamingo sets a new state of the art in few-shot learning across 16 diverse multimodal benchmarks, including open-ended visual question-answering, captioning, and visual dialogue. Remarkably, by simply prompting the model with 32 task-specific examples, a single set of Flamingo weights outperforms fully fine-tuned state-of-the-art models on 6 of these tasks, despite using roughly 1000 times less task-specific training data.