Speaker interpolation based data augmentation for automatic speech recognition

Lisa Kristina Kerle*, Michael Pucher, Barbara Schuppler

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

In recent years, the development of automatic speech recognition systems has ensured their widespread use in a broad range of areas. Most of training data, making them less suitable for lowresourced languages and for smaller varieties of
(well-resourced) languages. This paper focuses on improving automatic speech recognition for Austrian German by means of training data augmentation through neural network-based text-to-speech synthesis. For this purpose, speaker embedding vectors are extracted from an existing corpus and subsequent interpolation between these vectors is used for the generation of new voices. Synthesised speech is
then used to train an automatic speech recognition system, while comparing differently large portions of synthesised speech in the training data. Overall,
we find that performance improves when the ratio of real and synthesised speech is in the same order of magnitude.
Original languageEnglish
Title of host publicationProceedings of the 20th International Congress of Phonetic Sciences
Place of PublicationPrague
PublisherGUARANT International spol. s r.o.
Pages3126-3130
ISBN (Electronic)978-80-908 114-2-3
Publication statusPublished - 2023
Event20th International Congress on Phonetic Sciences : ICPhS 2023 - Prag, Czech Republic
Duration: 7 Aug 202311 Aug 2023
https://www.icphs2023.org/call-for-papers/

Conference

Conference20th International Congress on Phonetic Sciences
Abbreviated titleICPhS 2023
Country/TerritoryCzech Republic
CityPrag
Period7/08/2311/08/23
Internet address

Cite this