Audio-guided self-supervised learning for disentangled visual speech representations

Dalu FENG; Shuang YANG; Shiguang SHAN; Xilin CHEN

doi:10.1007/s11704-024-3787-8

Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (6) :186353 DOI: 10.1007/s11704-024-3787-8

Artificial Intelligence

LETTER

Audio-guided self-supervised learning for disentangled visual speech representations

Dalu FENG ¹^,²
, Shuang YANG ¹^,²^,^†
, Shiguang SHAN ¹^,²
, Xilin CHEN ¹^,²

Author information +

History +

PDF (625KB)

Graphical abstract

Cite this article

Download citation ▾

Dalu FENG, Shuang YANG, Shiguang SHAN, Xilin CHEN. Audio-guided self-supervised learning for disentangled visual speech representations. Front. Comput. Sci., 2024, 18 (6) : 186353 DOI:10.1007/s11704-024-3787-8

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Shi B, Hsu W N, Lakhotia K, Mohamed A. Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[2]	Hsu W N, Shi B. u-HuBERT: unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1538

[3]	Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 3652−3656

[4]	Ma P, Martinez B, Petridis S, Pantic M. Towards practical lipreading with distilled and efficient models. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7608−7612

[5]	Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 2856−2865

[6]	Koumparoulis A, Potamianos G. Accurate and resource-efficient lipreading with efficientnetv2 and transformers. In: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 8467−8471

[7]	Ma P, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7613−7617

[8]	Ma P, Petridis S, Pantic M . Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 2022, 4( 11): 930–939

[9]	Ma P, Haliassos A, Fernandez-Lopez A, Chen H, Petridis S, Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2023, 1−5

[10]	Yang Y, Zhuang Y, Pan Y . Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558