CBSLCN TransparentLogo

2021.11.10 Prof. LEE Tan

Professor, Department of Electronic Engineering

Associate Dean for Education, Faculty of Engineering

The Chinese University of Hong Kong

Profile

Synthesizing natural speech with expressiveness and personal style

Speech is the most natural and preferred means of human communication. People speak and express in different styles, which are personal and situational. Proper choice and use of expression style are the key to effective speech communication. Text-to-speech (TTS) is the technology of automatically generating speech in accordance to given text input. The latest TTS systems based on end-to-end deep neural network (DNN) models can generate fluent speech that are perceptually comparable to human speech. However, for naturalistic human-machine interaction, e.g., voice chatbot, virtual reality, computer-synthesized speech is expected to not only be clear and accurate but also carry appropriate or desired speaking style. Expressiveness as an innate characteristic of speech is largely missing in existing speech synthesis technologies. Being conceptually similar to conventional analysis-synthesis model of speech, the encoder-decoder type of DNN model has been widely adopted for the analysis of latent factors of variation in speech. Specifically an input speech utterance can be decomposed into components that are related to linguistic content, speaker identity and expression style. Control of expression style and speaker characteristics in TTS can be achieved by manipulating the respective embeddings and recombining them with desired content information. In this talk we will present a few recent studies on DNN based speech generation with personalized voice characteristics and controllable expression style. Successful applications in child storytelling and voice banking will be demonstrated.

Playback: Zoom Video of Prof. LEE Tan and Dr. PENG Gang's lectures

20211110 Tan20211110 Tan220211110 TanWangGroup Photo