Speech-driven facial animation with spectral gathering and temporal attention

Yujin CHAI, Yanlin WENG, Lvdi WANG, Kun ZHOU

PDF(10280 KB)
PDF(10280 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (3) : 163703. DOI: 10.1007/s11704-020-0133-7
Image and Graphics
RESEARCH ARTICLE

Speech-driven facial animation with spectral gathering and temporal attention

Author information +
History +

Abstract

In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.

Graphical abstract

Keywords

speech-driven facial animation / spectral-dimensional bidirectional long short-term memory / temporal attention / deformation gradients

Cite this article

Download citation ▾
Yujin CHAI, Yanlin WENG, Lvdi WANG, Kun ZHOU. Speech-driven facial animation with spectral gathering and temporal attention. Front. Comput. Sci., 2022, 16(3): 163703 https://doi.org/10.1007/s11704-020-0133-7

References

[1]
Cao C , Hou Q , Zhou K . Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics, 2014, 33( 4): 1– 10
[2]
Nagano K, Saito S, Goldwhite L, San K, Hong A, Hu L, Wei L, Xing J, Xu Q, Kung H W, Kuang J, Agarwal A, Castellanos E, Seo J, Fursund J, Li H. Pinscreen avatars in your pocket: mobile pagan engine and personalized gaming. In: Proceedings of SIGGRAPH Asia 2018 RealTime Live!. 2018, 1–1
[3]
Edwards P , Landreth C , Fiume E , Singh K . JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 2016, 35( 4): 1– 11
[4]
Karras T , Aila T , Laine S , Herva A , Lehtinen J . Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 2017, 36( 4): 1– 12
[5]
Pham H X, Wang Y, Pavlovic V. End-to-end learning for 3d facial animation from speech. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 361–365
[6]
Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black M J. Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10101–10111
[7]
Hati Y, Rousseaux F, Duhart C. Text-driven mouth animation for human computer interaction with personal assistant. In: Proceedings of the 25th International Conference on Auditory Display. 2019, 75–82
[8]
Jurafsky D, Martin J H. Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Upper Saddle River, New Jersey: Pearson Prentice Hall, 2009
[9]
Suwajanakorn S , Seitz S M , Kemelmacher-Shlizerman I . Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 2017, 36( 4): 1– 13
[10]
Taylor S , Kim T , Yue Y , Mahler M , Krahe J , Rodriguez A G , Hodgins J , Matthews I . A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 2017, 36( 4): 1– 11
[11]
Hussen Abdelaziz A, Theobald B J, Binder J, Fanelli G, Dixon P, Apostoloff N, Weise T, Kajareker S. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models. In: Proceedings of the 2019 International Conference on Multimodal Interaction. 2019, 220–225
[12]
Hochreiter S , Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735– 1780
[13]
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv: 14125567, 2014
[14]
Pham H X, Cheung S, Pavlovic V. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2328–2336
[15]
Tian G, Yuan Y, Liu Y. Audio2face: generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops. 2019, 366–371
[16]
Tzirakis P, Papaioannou A, Lattas A, Tarasiou M, Schuller B, Zafeiriou S. Synthesising 3d facial motion from “in-the-wild” speech. arXiv preprint arXiv: 190407002, 2019
[17]
Nishimura R, Sakata N, Tominaga T, Hijikata Y, Harada K, Kiyokawa K. Speech-driven facial animation by lstm-rnn for communication use. In: Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces. 2019, 1102–1103
[18]
Sumner R W , Popović J . Deformation transfer for triangle meshes. ACM Transactions on Graphics, 2004, 23( 3): 399– 405
[19]
Wu Q, Zhang J, Lai Y K, Zheng J, Cai J. Alive caricature from 2d to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7336–7345
[20]
Gao L , Lai Y , Yang J , Zhang L X , Kobbelt L , Xia S . Sparse data driven mesh deformation. IEEE Transactions on Visualization and Computer Graphics, 2019,
[21]
Orvalho V, Bastos P, Parke F I, Oliveira B, Alvarez X. A facial rigging survey. In: Proceedings of Eurographics 2012 - State of the Art Reports. 2012, 183–204
[22]
Kent R D , Minifie F D . Coarticulation in recent speech production models. Journal of Phonetics, 1977, 5( 2): 115– 133
[23]
Pelachaud C , Badler N I , Steedman M . Generating facial expressions for speech. Cognitive Science, 1996, 20( 1): 1– 46
[24]
Wang A, Emmi M, Faloutsos P. Assembling an expressive facial animation system. In: Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. 2007, 21–26
[25]
Cohen M M, Massaro D W. Modeling coarticulation in synthetic visual speech. In: Proceedings of Models and Techniques in Computer Animation. 1993, 139–156
[26]
Xu Y, Feng A W, Marsella S, Shapiro A. A practical and configurable lip sync method for games. In: Proceedings of Motion on Games. 2013, 131–140
[27]
Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997, 353–360
[28]
Ezzat T , Geiger G , Poggio T . Trainable videorealistic speech animation. ACM Transactions on Graphics, 2002, 21( 3): 388– 398
[29]
Taylor S L, Mahler M, Theobald B J, Matthews I. Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/ Eurographics Conference on Computer Animation. 2012, 275–284
[30]
Brand M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 1999, 21–28
[31]
Xie L , Liu Z Q . Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia, 2007, 9( 3): 500– 510
[32]
Wang L, Han W, Soong F K, Huo Q. Text driven 3d photo-realistic talking head. In: Proceedings of Interspeech. 2011, 3307–3308
[33]
Zhang X, Wang L, Li G, Seide F, Soong F K. A new language independent, photo-realistic talking head driven by voice only. In: Proceedings of Interspeech. 2013, 2743–2747
[34]
Shimba T, Sakurai R, Yamazoe H, Lee J H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration. 2015, 100–105
[35]
Fan B , Xie L , Yang S , Wang L , Soong F K . A deep bidirectional lstm approach for video-realistic talking head. Multimedia Tools and Applications, 2016, 75( 9): 5287– 5309
[36]
Eskimez S E, Maddox R K, Xu C, Duan Z. Generating talking face landmarks from speech. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 2018, 372–381
[37]
Aneja D, Li W. Real-time lip sync for live 2d animation. arXiv preprint arXiv: 191008685, 2019
[38]
Greenwood D, Matthews I, Laycock S. Joint learning of facial expression and head pose from speech. In: Proceedings of Interspeech. 2018, 2484–2488
[39]
Websdale D, Taylor S, Milner B. The effect of real-time constraints on automatic speech animation. In: Proceedings of Interspeech. 2018, 2479–2483
[40]
Schwartz J L , Savariaux C . No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLOS Computational Biology, 2014, 10( 7): e1003743–
[41]
Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018, 4779–4783
[42]
Prenger R, Valle R, Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 3617–3621
[43]
Vougioukas K , Petridis S , Pantic M . Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 2020, 128( 5): 1398– 1413
[44]
Chen L, Maddox R K, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7832–7841
[45]
Abdel-Hamid O , Mohamed A r , Jiang H , Deng L , Penn G , Yu D . Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22( 10): 1533– 1545
[46]
Sainath T N, Li B. Modeling time-frequency patterns with lstm vs. convolutional architectures for lvcsr tasks. In: Proceedings of Interspeech. 2016, 813–817
[47]
Liu Y, Wang D. Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2017, 5600–5604
[48]
Denil M , Bazzani L , Larochelle H , de Freitas N . Learning where to attend with deep architectures for image tracking. Neural Computation, 2012, 24( 8): 2151– 2184
[49]
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 14090473, 2014
[50]
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems 2017 Workshop on Autodiff. 2017
[51]
Kingma D P, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv: 14126980, 2014
[52]
Ekman P, Friesen W V, Hager J C. Facial Action Coding System: The Manual on CD-ROM. Instructor’s Guide. Salt Lake City: Network Information Research Co., 2002
[53]
Mori M , MacDorman K F , Kageki N . The uncanny valley. IEEE Robotics and Automation Magazine, 2012, 19( 2): 98– 100
[54]
Kim C, Shin H V, Oh T H, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision. 2018, 276–292
[55]
Vielzeuf V, Kervadec C, Pateux S, Lechervy A, Jurie F. An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 589–593
[56]
Avots E , Sapiński T , Bachmann M , Kamińska D . Audiovisual emotion recognition in wild. Machine Vision and Applications, 2019, 30( 5): 975– 985
[57]
Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2face: learning the face behind a voice. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7539–7548
[58]
Wang R, Liu X, Cheung Y m, Cheng K, Wang N, Fan W. Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1881–1884
[59]
Zhu H, Luo M, Wang R, Zheng A, He R. Deep audio-visual learning: a survey. arXiv preprint arXiv: 200104758, 2020
[60]
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J. Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3497–3506

Acknowledgements

We would like to thank VOCA group for publishing their database. This work was partially supported by the National Key Research & Development Program of China (2016YFB1001403) and the National Natural Science Foundation of China (Grant No. 61572429).

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(10280 KB)

Accesses

Citations

Detail

Sections
Recommended

/