Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker.
2022: Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
https://arxiv.org/pdf/2201.01763v1.pdf