Efficient decoding self-attention for end-to-end speech synthesis

Wei ZHAO; Li XU

doi:10.1631/FITEE.2100501

PDF(778 KB)

Front. Inform. Technol. Electron. Eng ›› 2022, Vol. 23 ›› Issue (7) : 1127-1138. DOI: 10.1631/FITEE.2100501

Orginal Article

Efficient decoding self-attention for end-to-end speech synthesis

Wei ZHAO¹^,² ,
Li XU¹^,²

Author information +

History +

Abstract

Self-attention has been innovatively applied to text-to-speech (TTS) because of its parallel structure and superior strength in modeling sequential data. However, when used in end-to-end speech synthesis with an autoregressive decoding scheme, its inference speed becomes relatively low due to the quadratic complexity in sequence length. This problem becomes particularly severe on devices without graphics processing units (GPUs). To alleviate the dilemma, we propose an efficient decoding self-attention (EDSA) module as an alternative. Combined with a dynamic programming decoding procedure, TTS model inference can be effectively accelerated to have a linear computation complexity. We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve 720% and 50% higher inference speed on the central processing unit (CPU) and GPU respectively, with almost the same performance. Thus, this method may make the deployment of such models easier when there are limited GPU resources. In addition, our model may perform better than the baseline Transformer TTS on out-of-domain utterances.

Keywords

Efficient decoding / End-to-end / Self-attention / Speech synthesis

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Wei ZHAO, Li XU. Efficient decoding self-attention for end-to-end speech synthesis. Front. Inform. Technol. Electron. Eng, 2022, 23(7): 1127‒1138 https://doi.org/10.1631/FITEE.2100501