Multi-talker audio-visual speech recognition towards diverse scenarios

Yuxiao LIN , Tao JIN , Xize CHENG , Zhou ZHAO , Fei WU

Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (11) : 2310 -2323.

PDF (803KB)
Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (11) :2310 -2323. DOI: 10.1631/FITEE.2500411
Research Article

Multi-talker audio-visual speech recognition towards diverse scenarios

Author information +
History +
PDF (803KB)

Abstract

Recently, audio-visual speech recognition (AVSR) has attracted increasing attention. However, most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips. In this work, we study the effect of speaker number and modal misalignment in the AVSR task, and propose an end-to-end AVSR framework under a more realistic condition. Specifically, we propose a speaker-number-aware mixture-of-experts (SA-MoE) mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers, and a cross-modal realignment (CMR) module for robust handling of asynchronous inputs. We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning (CBCL), which forces the model to focus on difficult, challenging data instead of simple data to improve efficiency.

Keywords

Speech recognition and synthesis / Multi-modal recognition / Curriculum learning / Multi-talker speech recognition

Cite this article

Download citation ▾
Yuxiao LIN, Tao JIN, Xize CHENG, Zhou ZHAO, Fei WU. Multi-talker audio-visual speech recognition towards diverse scenarios. Eng Inform Technol Electron Eng, 2025, 26(11): 2310-2323 DOI:10.1631/FITEE.2500411

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Zhejiang University Press

AI Summary AI Mindmap
PDF (803KB)

40

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/