
Sign up to save your podcasts
Or


This research paper introduces Whisper, a speech recognition system trained on a massive, weakly supervised dataset of 680,000 hours of audio. The paper argues that scaling weakly supervised training has been underappreciated in speech recognition and that Whisper's robust, zero-shot performance demonstrates its ability to generalize well across different domains, languages, and tasks, even surpassing human accuracy in some areas. The authors explore the system's scaling properties, both in terms of model size and dataset size and analyze the impact of multitasking and multilingual training. They also discuss Whisper's performance on language identification and its robustness to noise. The paper concludes with a discussion of potential limitations and areas for future work.
By KenpachiThis research paper introduces Whisper, a speech recognition system trained on a massive, weakly supervised dataset of 680,000 hours of audio. The paper argues that scaling weakly supervised training has been underappreciated in speech recognition and that Whisper's robust, zero-shot performance demonstrates its ability to generalize well across different domains, languages, and tasks, even surpassing human accuracy in some areas. The authors explore the system's scaling properties, both in terms of model size and dataset size and analyze the impact of multitasking and multilingual training. They also discuss Whisper's performance on language identification and its robustness to noise. The paper concludes with a discussion of potential limitations and areas for future work.