University of Twente Student Theses
ExpressTTS: Augmentation for Speech Recognition with Expressive Speech Synthesis
Kempen, Lindsay (2022) ExpressTTS: Augmentation for Speech Recognition with Expressive Speech Synthesis.
PDF
6MB |
Abstract: | Current automatic speech recognition (ASR) systems are greatly impacted by expressive speech, causing higher Word Error Rates (WER). Producing a large-scale training corpus with human expressive speech is a very laborious task. Similar to data augmentation, we explore the field of expressiveness within a text-to-speech (TTS) system, creating a larger amount of speech data. Our speech synthesizer, ExpressTTS, aims to separately explore prosodic factors (pitch, energy, duration) and spectral tilt in a regularized latent space while conditioning on the text and speaker. This way, we find expressive patterns that are natural in these contexts. Our non-autoregressive model parallelizes inference, allowing us to generate a large-scale corpus. We focus on the TTS part of the augmentation pipeline. We train the system on a small-scale expressive corpus, padded with neutral speech data. Quantitative analysis shows that the resulting domain mismatch inhibits model and baseline stability. Nonetheless, the model generates speech with prosodic variation, and we find that ExpressTTS consistently generalizes better to unseen in-domain data than the baseline Glow-TTS. The user study suggests our model produces more diverse expressiveness and significantly more emotion and style than the baseline. We conclude with directions on how to use our model for exploring expressiveness. |
Item Type: | Essay (Master) |
Clients: | Sony, Stuttgart, Germany |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science MSc (60300) |
Link to this item: | https://purl.utwente.nl/essays/89721 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page