Department of Electronic Engineering, Tsinghua University, Beijing 100084, China;
Show less
History+
Published
05 Dec 2006
Issue Date
05 Dec 2006
Abstract
In speech signal processing systems, frameenergy based voice activity detection (VAD) method may be interfered with the background noise and non-stationary characteristic of the frame-energy in voice segment. The purpose of this paper is to improve the performance and robustness of VAD by introducing visual information. Meanwhile, data-driven linear transformation is adopted in visual feature extraction, and a general statistical VAD model is designed. Using the general model and a two-stage fusion strategy presented in this paper, a concrete multimodal VAD system is built. Experiments show that a 55.0 % relative reduction in frame error rate and a 98.5 % relative reduction in sentence-breaking error rate are obtained when using multimodal VAD, compared to frame-energy based audio VAD. The results show that using multimodal method, sentence-breaking errors are almost avoided, and frame-detection performance is clearly improved, which proves the effectiveness of the visual modal in VAD.
LIU Peng, WANG Zuo-ying.
Audio-visual voice activity detection. Front. Electr. Electron. Eng., 2006, 1(4): 425‒430 https://doi.org/10.1007/s11460-006-0081-5
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
This is a preview of subscription content, contact us for subscripton.
AI Summary ×
Note: Please note that the content below is AI-generated. Frontiers Journals website shall not be held liable for any consequences associated with the use of this content.