Audio-guided self-supervised learning for disentangled visual speech representations

Dalu FENG , Shuang YANG , Shiguang SHAN , Xilin CHEN

Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (6) : 186353

PDF (625KB)
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (6) : 186353 DOI: 10.1007/s11704-024-3787-8
Artificial Intelligence
LETTER

Audio-guided self-supervised learning for disentangled visual speech representations

Author information +
History +
PDF (625KB)

Graphical abstract

Cite this article

Download citation ▾
Dalu FENG, Shuang YANG, Shiguang SHAN, Xilin CHEN. Audio-guided self-supervised learning for disentangled visual speech representations. Front. Comput. Sci., 2024, 18(6): 186353 DOI:10.1007/s11704-024-3787-8

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Shi B, Hsu W N, Lakhotia K, Mohamed A. Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[2]

Hsu W N, Shi B. u-HuBERT: unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1538

[3]

Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 3652−3656

[4]

Ma P, Martinez B, Petridis S, Pantic M. Towards practical lipreading with distilled and efficient models. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7608−7612

[5]

Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 2856−2865

[6]

Koumparoulis A, Potamianos G. Accurate and resource-efficient lipreading with efficientnetv2 and transformers. In: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 8467−8471

[7]

Ma P, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7613−7617

[8]

Ma P, Petridis S, Pantic M . Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 2022, 4( 11): 930–939

[9]

Ma P, Haliassos A, Fernandez-Lopez A, Chen H, Petridis S, Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2023, 1−5

[10]

Yang Y, Zhuang Y, Pan Y . Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (625KB)

Supplementary files

FCS-23787-OF-DF_suppl_1

FCS-23787-OF-DF_suppl_2

934

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/