Audio-guided self-supervised learning for disentangled visual speech representations

Dalu FENG, Shuang YANG, Shiguang SHAN, Xilin CHEN

PDF(625 KB)
PDF(625 KB)
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (6) : 186353. DOI: 10.1007/s11704-024-3787-8
Artificial Intelligence
LETTER

Audio-guided self-supervised learning for disentangled visual speech representations

Author information +
History +

Graphical abstract

Cite this article

Download citation ▾
Dalu FENG, Shuang YANG, Shiguang SHAN, Xilin CHEN. Audio-guided self-supervised learning for disentangled visual speech representations. Front. Comput. Sci., 2024, 18(6): 186353 https://doi.org/10.1007/s11704-024-3787-8

References

[1]
Shi B, Hsu W N, Lakhotia K, Mohamed A. Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 10th International Conference on Learning Representations. 2022
[2]
Hsu W N, Shi B. u-HuBERT: unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1538
[3]
Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 3652−3656
[4]
Ma P, Martinez B, Petridis S, Pantic M. Towards practical lipreading with distilled and efficient models. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7608−7612
[5]
Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 2856−2865
[6]
Koumparoulis A, Potamianos G. Accurate and resource-efficient lipreading with efficientnetv2 and transformers. In: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 8467−8471
[7]
Ma P, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7613−7617
[8]
Ma P, Petridis S, Pantic M . Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 2022, 4( 11): 930–939
[9]
Ma P, Haliassos A, Fernandez-Lopez A, Chen H, Petridis S, Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2023, 1−5
[10]
Yang Y, Zhuang Y, Pan Y . Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 62276247, 62076250). Thanks for the help provided by Bingquan Xia in the experiments and by Yuanhang Zhang in proofreading.

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

Electronic supplementary material

Supplementary material is available in the online version of this article at journal.hep.com.cn and link.springer.com.

RIGHTS & PERMISSIONS

2024 Higher Education Press
AI Summary AI Mindmap
PDF(625 KB)

Accesses

Citations

Detail

Sections
Recommended

/