University of Twente Student Theses


ExpressTTS: Augmentation for Speech Recognition with Expressive Speech Synthesis

Kempen, Lindsay (2022) ExpressTTS: Augmentation for Speech Recognition with Expressive Speech Synthesis.

[img] PDF
Full Text Status:Access to this publication is restricted
Abstract:Current automatic speech recognition (ASR) systems are greatly impacted by expressive speech, causing higher Word Error Rates (WER). Producing a large-scale training corpus with human expressive speech is a very laborious task. Similar to data augmentation, we explore the field of expressiveness within a text-to-speech (TTS) system, creating a larger amount of speech data. Our speech synthesizer, ExpressTTS, aims to separately explore prosodic factors (pitch, energy, duration) and spectral tilt in a regularized latent space while conditioning on the text and speaker. This way, we find expressive patterns that are natural in these contexts. Our non-autoregressive model parallelizes inference, allowing us to generate a large-scale corpus. We focus on the TTS part of the augmentation pipeline. We train the system on a small-scale expressive corpus, padded with neutral speech data. Quantitative analysis shows that the resulting domain mismatch inhibits model and baseline stability. Nonetheless, the model generates speech with prosodic variation, and we find that ExpressTTS consistently generalizes better to unseen in-domain data than the baseline Glow-TTS. The user study suggests our model produces more diverse expressiveness and significantly more emotion and style than the baseline. We conclude with directions on how to use our model for exploring expressiveness.
Item Type:Essay (Master)
Sony, Stuttgart, Germany
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page