Speech-driven facial animation with spectral gathering and temporal attention
Yujin CHAI, Yanlin WENG, Lvdi WANG, Kun ZHOU
Speech-driven facial animation with spectral gathering and temporal attention
In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.
speech-driven facial animation / spectral-dimensional bidirectional long short-term memory / temporal attention / deformation gradients
[1] |
Cao C , Hou Q , Zhou K . Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics, 2014, 33( 4): 1– 10
|
[2] |
Nagano K, Saito S, Goldwhite L, San K, Hong A, Hu L, Wei L, Xing J, Xu Q, Kung H W, Kuang J, Agarwal A, Castellanos E, Seo J, Fursund J, Li H. Pinscreen avatars in your pocket: mobile pagan engine and personalized gaming. In: Proceedings of SIGGRAPH Asia 2018 RealTime Live!. 2018, 1–1
|
[3] |
Edwards P , Landreth C , Fiume E , Singh K . JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 2016, 35( 4): 1– 11
|
[4] |
Karras T , Aila T , Laine S , Herva A , Lehtinen J . Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 2017, 36( 4): 1– 12
|
[5] |
Pham H X, Wang Y, Pavlovic V. End-to-end learning for 3d facial animation from speech. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 361–365
|
[6] |
Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black M J. Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10101–10111
|
[7] |
Hati Y, Rousseaux F, Duhart C. Text-driven mouth animation for human computer interaction with personal assistant. In: Proceedings of the 25th International Conference on Auditory Display. 2019, 75–82
|
[8] |
Jurafsky D, Martin J H. Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Upper Saddle River, New Jersey: Pearson Prentice Hall, 2009
|
[9] |
Suwajanakorn S , Seitz S M , Kemelmacher-Shlizerman I . Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 2017, 36( 4): 1– 13
|
[10] |
Taylor S , Kim T , Yue Y , Mahler M , Krahe J , Rodriguez A G , Hodgins J , Matthews I . A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 2017, 36( 4): 1– 11
|
[11] |
Hussen Abdelaziz A, Theobald B J, Binder J, Fanelli G, Dixon P, Apostoloff N, Weise T, Kajareker S. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models. In: Proceedings of the 2019 International Conference on Multimodal Interaction. 2019, 220–225
|
[12] |
Hochreiter S , Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735– 1780
|
[13] |
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv: 14125567, 2014
|
[14] |
Pham H X, Cheung S, Pavlovic V. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2328–2336
|
[15] |
Tian G, Yuan Y, Liu Y. Audio2face: generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops. 2019, 366–371
|
[16] |
Tzirakis P, Papaioannou A, Lattas A, Tarasiou M, Schuller B, Zafeiriou S. Synthesising 3d facial motion from “in-the-wild” speech. arXiv preprint arXiv: 190407002, 2019
|
[17] |
Nishimura R, Sakata N, Tominaga T, Hijikata Y, Harada K, Kiyokawa K. Speech-driven facial animation by lstm-rnn for communication use. In: Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces. 2019, 1102–1103
|
[18] |
Sumner R W , Popović J . Deformation transfer for triangle meshes. ACM Transactions on Graphics, 2004, 23( 3): 399– 405
|
[19] |
Wu Q, Zhang J, Lai Y K, Zheng J, Cai J. Alive caricature from 2d to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7336–7345
|
[20] |
Gao L , Lai Y , Yang J , Zhang L X , Kobbelt L , Xia S . Sparse data driven mesh deformation. IEEE Transactions on Visualization and Computer Graphics, 2019,
|
[21] |
Orvalho V, Bastos P, Parke F I, Oliveira B, Alvarez X. A facial rigging survey. In: Proceedings of Eurographics 2012 - State of the Art Reports. 2012, 183–204
|
[22] |
Kent R D , Minifie F D . Coarticulation in recent speech production models. Journal of Phonetics, 1977, 5( 2): 115– 133
|
[23] |
Pelachaud C , Badler N I , Steedman M . Generating facial expressions for speech. Cognitive Science, 1996, 20( 1): 1– 46
|
[24] |
Wang A, Emmi M, Faloutsos P. Assembling an expressive facial animation system. In: Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. 2007, 21–26
|
[25] |
Cohen M M, Massaro D W. Modeling coarticulation in synthetic visual speech. In: Proceedings of Models and Techniques in Computer Animation. 1993, 139–156
|
[26] |
Xu Y, Feng A W, Marsella S, Shapiro A. A practical and configurable lip sync method for games. In: Proceedings of Motion on Games. 2013, 131–140
|
[27] |
Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997, 353–360
|
[28] |
Ezzat T , Geiger G , Poggio T . Trainable videorealistic speech animation. ACM Transactions on Graphics, 2002, 21( 3): 388– 398
|
[29] |
Taylor S L, Mahler M, Theobald B J, Matthews I. Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/ Eurographics Conference on Computer Animation. 2012, 275–284
|
[30] |
Brand M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 1999, 21–28
|
[31] |
Xie L , Liu Z Q . Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia, 2007, 9( 3): 500– 510
|
[32] |
Wang L, Han W, Soong F K, Huo Q. Text driven 3d photo-realistic talking head. In: Proceedings of Interspeech. 2011, 3307–3308
|
[33] |
Zhang X, Wang L, Li G, Seide F, Soong F K. A new language independent, photo-realistic talking head driven by voice only. In: Proceedings of Interspeech. 2013, 2743–2747
|
[34] |
Shimba T, Sakurai R, Yamazoe H, Lee J H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration. 2015, 100–105
|
[35] |
Fan B , Xie L , Yang S , Wang L , Soong F K . A deep bidirectional lstm approach for video-realistic talking head. Multimedia Tools and Applications, 2016, 75( 9): 5287– 5309
|
[36] |
Eskimez S E, Maddox R K, Xu C, Duan Z. Generating talking face landmarks from speech. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 2018, 372–381
|
[37] |
Aneja D, Li W. Real-time lip sync for live 2d animation. arXiv preprint arXiv: 191008685, 2019
|
[38] |
Greenwood D, Matthews I, Laycock S. Joint learning of facial expression and head pose from speech. In: Proceedings of Interspeech. 2018, 2484–2488
|
[39] |
Websdale D, Taylor S, Milner B. The effect of real-time constraints on automatic speech animation. In: Proceedings of Interspeech. 2018, 2479–2483
|
[40] |
Schwartz J L , Savariaux C . No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLOS Computational Biology, 2014, 10( 7): e1003743–
|
[41] |
Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018, 4779–4783
|
[42] |
Prenger R, Valle R, Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 3617–3621
|
[43] |
Vougioukas K , Petridis S , Pantic M . Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 2020, 128( 5): 1398– 1413
|
[44] |
Chen L, Maddox R K, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7832–7841
|
[45] |
Abdel-Hamid O , Mohamed A r , Jiang H , Deng L , Penn G , Yu D . Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22( 10): 1533– 1545
|
[46] |
Sainath T N, Li B. Modeling time-frequency patterns with lstm vs. convolutional architectures for lvcsr tasks. In: Proceedings of Interspeech. 2016, 813–817
|
[47] |
Liu Y, Wang D. Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2017, 5600–5604
|
[48] |
Denil M , Bazzani L , Larochelle H , de Freitas N . Learning where to attend with deep architectures for image tracking. Neural Computation, 2012, 24( 8): 2151– 2184
|
[49] |
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 14090473, 2014
|
[50] |
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems 2017 Workshop on Autodiff. 2017
|
[51] |
Kingma D P, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv: 14126980, 2014
|
[52] |
Ekman P, Friesen W V, Hager J C. Facial Action Coding System: The Manual on CD-ROM. Instructor’s Guide. Salt Lake City: Network Information Research Co., 2002
|
[53] |
Mori M , MacDorman K F , Kageki N . The uncanny valley. IEEE Robotics and Automation Magazine, 2012, 19( 2): 98– 100
|
[54] |
Kim C, Shin H V, Oh T H, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision. 2018, 276–292
|
[55] |
Vielzeuf V, Kervadec C, Pateux S, Lechervy A, Jurie F. An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 589–593
|
[56] |
Avots E , Sapiński T , Bachmann M , Kamińska D . Audiovisual emotion recognition in wild. Machine Vision and Applications, 2019, 30( 5): 975– 985
|
[57] |
Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2face: learning the face behind a voice. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7539–7548
|
[58] |
Wang R, Liu X, Cheung Y m, Cheng K, Wang N, Fan W. Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1881–1884
|
[59] |
Zhu H, Luo M, Wang R, Zheng A, He R. Deep audio-visual learning: a survey. arXiv preprint arXiv: 200104758, 2020
|
[60] |
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J. Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3497–3506
|
/
〈 | 〉 |