Multi-talker audio-visual speech recognition towards diverse scenarios
Yuxiao LIN , Tao JIN , Xize CHENG , Zhou ZHAO , Fei WU
Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (11) : 2310 -2323.
Multi-talker audio-visual speech recognition towards diverse scenarios
Recently, audio-visual speech recognition (AVSR) has attracted increasing attention. However, most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips. In this work, we study the effect of speaker number and modal misalignment in the AVSR task, and propose an end-to-end AVSR framework under a more realistic condition. Specifically, we propose a speaker-number-aware mixture-of-experts (SA-MoE) mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers, and a cross-modal realignment (CMR) module for robust handling of asynchronous inputs. We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning (CBCL), which forces the model to focus on difficult, challenging data instead of simple data to improve efficiency.
Speech recognition and synthesis / Multi-modal recognition / Curriculum learning / Multi-talker speech recognition
Zhejiang University Press
/
| 〈 |
|
〉 |