Speech-driven facial animation with spectral gathering and temporal attention
Yujin CHAI , Yanlin WENG , Lvdi WANG , Kun ZHOU
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (3) : 163703
Speech-driven facial animation with spectral gathering and temporal attention
In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.
speech-driven facial animation / spectral-dimensional bidirectional long short-term memory / temporal attention / deformation gradients
| [1] |
|
| [2] |
Nagano K, Saito S, Goldwhite L, San K, Hong A, Hu L, Wei L, Xing J, Xu Q, Kung H W, Kuang J, Agarwal A, Castellanos E, Seo J, Fursund J, Li H. Pinscreen avatars in your pocket: mobile pagan engine and personalized gaming. In: Proceedings of SIGGRAPH Asia 2018 RealTime Live!. 2018, 1–1 |
| [3] |
|
| [4] |
|
| [5] |
Pham H X, Wang Y, Pavlovic V. End-to-end learning for 3d facial animation from speech. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 361–365 |
| [6] |
Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black M J. Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10101–10111 |
| [7] |
Hati Y, Rousseaux F, Duhart C. Text-driven mouth animation for human computer interaction with personal assistant. In: Proceedings of the 25th International Conference on Auditory Display. 2019, 75–82 |
| [8] |
Jurafsky D, Martin J H. Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Upper Saddle River, New Jersey: Pearson Prentice Hall, 2009 |
| [9] |
|
| [10] |
|
| [11] |
Hussen Abdelaziz A, Theobald B J, Binder J, Fanelli G, Dixon P, Apostoloff N, Weise T, Kajareker S. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models. In: Proceedings of the 2019 International Conference on Multimodal Interaction. 2019, 220–225 |
| [12] |
|
| [13] |
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv: 14125567, 2014 |
| [14] |
Pham H X, Cheung S, Pavlovic V. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2328–2336 |
| [15] |
Tian G, Yuan Y, Liu Y. Audio2face: generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops. 2019, 366–371 |
| [16] |
Tzirakis P, Papaioannou A, Lattas A, Tarasiou M, Schuller B, Zafeiriou S. Synthesising 3d facial motion from “in-the-wild” speech. arXiv preprint arXiv: 190407002, 2019 |
| [17] |
Nishimura R, Sakata N, Tominaga T, Hijikata Y, Harada K, Kiyokawa K. Speech-driven facial animation by lstm-rnn for communication use. In: Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces. 2019, 1102–1103 |
| [18] |
|
| [19] |
Wu Q, Zhang J, Lai Y K, Zheng J, Cai J. Alive caricature from 2d to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7336–7345 |
| [20] |
|
| [21] |
Orvalho V, Bastos P, Parke F I, Oliveira B, Alvarez X. A facial rigging survey. In: Proceedings of Eurographics 2012 - State of the Art Reports. 2012, 183–204 |
| [22] |
|
| [23] |
|
| [24] |
Wang A, Emmi M, Faloutsos P. Assembling an expressive facial animation system. In: Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. 2007, 21–26 |
| [25] |
Cohen M M, Massaro D W. Modeling coarticulation in synthetic visual speech. In: Proceedings of Models and Techniques in Computer Animation. 1993, 139–156 |
| [26] |
Xu Y, Feng A W, Marsella S, Shapiro A. A practical and configurable lip sync method for games. In: Proceedings of Motion on Games. 2013, 131–140 |
| [27] |
Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997, 353–360 |
| [28] |
|
| [29] |
Taylor S L, Mahler M, Theobald B J, Matthews I. Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/ Eurographics Conference on Computer Animation. 2012, 275–284 |
| [30] |
Brand M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 1999, 21–28 |
| [31] |
|
| [32] |
Wang L, Han W, Soong F K, Huo Q. Text driven 3d photo-realistic talking head. In: Proceedings of Interspeech. 2011, 3307–3308 |
| [33] |
Zhang X, Wang L, Li G, Seide F, Soong F K. A new language independent, photo-realistic talking head driven by voice only. In: Proceedings of Interspeech. 2013, 2743–2747 |
| [34] |
Shimba T, Sakurai R, Yamazoe H, Lee J H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration. 2015, 100–105 |
| [35] |
|
| [36] |
Eskimez S E, Maddox R K, Xu C, Duan Z. Generating talking face landmarks from speech. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 2018, 372–381 |
| [37] |
Aneja D, Li W. Real-time lip sync for live 2d animation. arXiv preprint arXiv: 191008685, 2019 |
| [38] |
Greenwood D, Matthews I, Laycock S. Joint learning of facial expression and head pose from speech. In: Proceedings of Interspeech. 2018, 2484–2488 |
| [39] |
Websdale D, Taylor S, Milner B. The effect of real-time constraints on automatic speech animation. In: Proceedings of Interspeech. 2018, 2479–2483 |
| [40] |
|
| [41] |
Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018, 4779–4783 |
| [42] |
Prenger R, Valle R, Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 3617–3621 |
| [43] |
|
| [44] |
Chen L, Maddox R K, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7832–7841 |
| [45] |
|
| [46] |
Sainath T N, Li B. Modeling time-frequency patterns with lstm vs. convolutional architectures for lvcsr tasks. In: Proceedings of Interspeech. 2016, 813–817 |
| [47] |
Liu Y, Wang D. Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2017, 5600–5604 |
| [48] |
|
| [49] |
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 14090473, 2014 |
| [50] |
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems 2017 Workshop on Autodiff. 2017 |
| [51] |
Kingma D P, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv: 14126980, 2014 |
| [52] |
Ekman P, Friesen W V, Hager J C. Facial Action Coding System: The Manual on CD-ROM. Instructor’s Guide. Salt Lake City: Network Information Research Co., 2002 |
| [53] |
|
| [54] |
Kim C, Shin H V, Oh T H, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision. 2018, 276–292 |
| [55] |
Vielzeuf V, Kervadec C, Pateux S, Lechervy A, Jurie F. An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 589–593 |
| [56] |
|
| [57] |
Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2face: learning the face behind a voice. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7539–7548 |
| [58] |
Wang R, Liu X, Cheung Y m, Cheng K, Wang N, Fan W. Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1881–1884 |
| [59] |
Zhu H, Luo M, Wang R, Zheng A, He R. Deep audio-visual learning: a survey. arXiv preprint arXiv: 200104758, 2020 |
| [60] |
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J. Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3497–3506 |
Higher Education Press
/
| 〈 |
|
〉 |