
Sign up to save your podcasts
Or


The paper "Robust Speech Recognition via Large-Scale Weak Supervision" introduces Whisper, a highly robust and versatile speech processing system developed by researchers at OpenAI.
Instead of relying on small, highly-curated datasets or purely unsupervised pre-training, Whisper is trained on 680,000 hours of weakly supervised, multilingual, and multitask audio data collected from the internet. By using a standard encoder-decoder Transformer architecture, a single Whisper model can handle a comprehensive pipeline of speech tasks, including English and multilingual speech recognition, any-to-English speech translation, spoken language identification, and voice activity detection.
The key takeaway from the paper is that scaling up weakly supervised pre-training allows the model to achieve highly effective zero-shot transfer to standard benchmarks without requiring any dataset-specific fine-tuning. Consequently, Whisper approaches human-level accuracy and demonstrates exceptional robustness to real-world noise and out-of-distribution datasets, significantly outperforming prior models that suffer from brittleness when tested outside of their specific training distributions like LibriSpeech.
By Yun WuThe paper "Robust Speech Recognition via Large-Scale Weak Supervision" introduces Whisper, a highly robust and versatile speech processing system developed by researchers at OpenAI.
Instead of relying on small, highly-curated datasets or purely unsupervised pre-training, Whisper is trained on 680,000 hours of weakly supervised, multilingual, and multitask audio data collected from the internet. By using a standard encoder-decoder Transformer architecture, a single Whisper model can handle a comprehensive pipeline of speech tasks, including English and multilingual speech recognition, any-to-English speech translation, spoken language identification, and voice activity detection.
The key takeaway from the paper is that scaling up weakly supervised pre-training allows the model to achieve highly effective zero-shot transfer to standard benchmarks without requiring any dataset-specific fine-tuning. Consequently, Whisper approaches human-level accuracy and demonstrates exceptional robustness to real-world noise and out-of-distribution datasets, significantly outperforming prior models that suffer from brittleness when tested outside of their specific training distributions like LibriSpeech.