Audio-visual voice activity detection

Front. Electr. Electron. Eng. ›› 2006, Vol. 1 ›› Issue (4) : 425 -430.

PDF (544KB)
Front. Electr. Electron. Eng. ›› 2006, Vol. 1 ›› Issue (4) : 425 -430. DOI: 10.1007/s11460-006-0081-5

Audio-visual voice activity detection

Author information +
History +
PDF (544KB)

Abstract

In speech signal processing systems, frameenergy based voice activity detection (VAD) method may be interfered with the background noise and non-stationary characteristic of the frame-energy in voice segment. The purpose of this paper is to improve the performance and robustness of VAD by introducing visual information. Meanwhile, data-driven linear transformation is adopted in visual feature extraction, and a general statistical VAD model is designed. Using the general model and a two-stage fusion strategy presented in this paper, a concrete multimodal VAD system is built. Experiments show that a 55.0 % relative reduction in frame error rate and a 98.5 % relative reduction in sentence-breaking error rate are obtained when using multimodal VAD, compared to frame-energy based audio VAD. The results show that using multimodal method, sentence-breaking errors are almost avoided, and frame-detection performance is clearly improved, which proves the effectiveness of the visual modal in VAD.

Keywords

speech recognition, voice activity detection, multimodal

Cite this article

Download citation ▾
null. Audio-visual voice activity detection. Front. Electr. Electron. Eng., 2006, 1(4): 425-430 DOI:10.1007/s11460-006-0081-5

登录浏览全文

4963

注册一个新账户 忘记密码

References

AI Summary AI Mindmap
PDF (544KB)

604

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/