Speaker interpolation based data augmentation for automatic speech recognition

Lisa Kristina Kerle; Michael Pucher; Barbara Schuppler

Speaker interpolation based data augmentation for automatic speech recognition

Lisa Kristina Kerle^*, Michael Pucher, Barbara Schuppler

^*Corresponding author for this work

Institute of Signal Processing and Speech Communication (4420)

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Abstract

In recent years, the development of automatic speech recognition systems has ensured their widespread use in a broad range of areas. Most of training data, making them less suitable for lowresourced languages and for smaller varieties of
(well-resourced) languages. This paper focuses on improving automatic speech recognition for Austrian German by means of training data augmentation through neural network-based text-to-speech synthesis. For this purpose, speaker embedding vectors are extracted from an existing corpus and subsequent interpolation between these vectors is used for the generation of new voices. Synthesised speech is
then used to train an automatic speech recognition system, while comparing differently large portions of synthesised speech in the training data. Overall,
we find that performance improves when the ratio of real and synthesised speech is in the same order of magnitude.

Original language	English
Title of host publication	Proceedings of the 20th International Congress of Phonetic Sciences
Place of Publication	Prague
Publisher	GUARANT International spol. s r.o.
Pages	3126-3130
ISBN (Electronic)	978-80-908 114-2-3
Publication status	Published - 2023
Event	20th International Congress on Phonetic Sciences : ICPhS 2023 - Prag, Czech Republic Duration: 7 Aug 2023 → 11 Aug 2023 https://www.icphs2023.org/call-for-papers/

Conference

Conference	20th International Congress on Phonetic Sciences
Abbreviated title	ICPhS 2023
Country/Territory	Czech Republic
City	Prag
Period	7/08/23 → 11/08/23
Internet address	https://www.icphs2023.org/call-for-papers/

Access to Document

https://guarant.cz/icphs2023/585.pdf

FWF - Spontansprache - Cross-layer language models for conversational speech
Schuppler, B.
1/11/19 → 31/10/24
Project: Research project

Cite this

Speaker interpolation based data augmentation for automatic speech recognition. / Kerle, Lisa Kristina; Pucher, Michael ; Schuppler, Barbara.
Proceedings of the 20th International Congress of Phonetic Sciences. Prague: GUARANT International spol. s r.o., 2023. p. 3126-3130 585.

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Kerle, LK, Pucher, M & Schuppler, B 2023, Speaker interpolation based data augmentation for automatic speech recognition. in Proceedings of the 20th International Congress of Phonetic Sciences., 585, GUARANT International spol. s r.o., Prague, pp. 3126-3130, 20th International Congress on Phonetic Sciences , Prag, Czech Republic, 7/08/23. <https://guarant.cz/icphs2023/585.pdf>

@inproceedings{cfa1028d679b4720870d8b8b88354231,

title = "Speaker interpolation based data augmentation for automatic speech recognition",

abstract = "In recent years, the development of automatic speech recognition systems has ensured their widespread use in a broad range of areas. Most of training data, making them less suitable for lowresourced languages and for smaller varieties of(well-resourced) languages. This paper focuses on improving automatic speech recognition for Austrian German by means of training data augmentation through neural network-based text-to-speech synthesis. For this purpose, speaker embedding vectors are extracted from an existing corpus and subsequent interpolation between these vectors is used for the generation of new voices. Synthesised speech isthen used to train an automatic speech recognition system, while comparing differently large portions of synthesised speech in the training data. Overall,we find that performance improves when the ratio of real and synthesised speech is in the same order of magnitude.",

author = "Kerle, {Lisa Kristina} and Michael Pucher and Barbara Schuppler",

year = "2023",

language = "English",

pages = "3126--3130",

booktitle = "Proceedings of the 20th International Congress of Phonetic Sciences",

publisher = "GUARANT International spol. s r.o.",

note = "20th International Congress on Phonetic Sciences : ICPhS 2023, ICPhS 2023 ; Conference date: 07-08-2023 Through 11-08-2023",

url = "https://www.icphs2023.org/call-for-papers/",

}

TY - GEN

T1 - Speaker interpolation based data augmentation for automatic speech recognition

AU - Kerle, Lisa Kristina

AU - Pucher, Michael

AU - Schuppler, Barbara

PY - 2023

Y1 - 2023

N2 - In recent years, the development of automatic speech recognition systems has ensured their widespread use in a broad range of areas. Most of training data, making them less suitable for lowresourced languages and for smaller varieties of(well-resourced) languages. This paper focuses on improving automatic speech recognition for Austrian German by means of training data augmentation through neural network-based text-to-speech synthesis. For this purpose, speaker embedding vectors are extracted from an existing corpus and subsequent interpolation between these vectors is used for the generation of new voices. Synthesised speech isthen used to train an automatic speech recognition system, while comparing differently large portions of synthesised speech in the training data. Overall,we find that performance improves when the ratio of real and synthesised speech is in the same order of magnitude.

AB - In recent years, the development of automatic speech recognition systems has ensured their widespread use in a broad range of areas. Most of training data, making them less suitable for lowresourced languages and for smaller varieties of(well-resourced) languages. This paper focuses on improving automatic speech recognition for Austrian German by means of training data augmentation through neural network-based text-to-speech synthesis. For this purpose, speaker embedding vectors are extracted from an existing corpus and subsequent interpolation between these vectors is used for the generation of new voices. Synthesised speech isthen used to train an automatic speech recognition system, while comparing differently large portions of synthesised speech in the training data. Overall,we find that performance improves when the ratio of real and synthesised speech is in the same order of magnitude.

M3 - Conference paper

SP - 3126

EP - 3130

BT - Proceedings of the 20th International Congress of Phonetic Sciences

PB - GUARANT International spol. s r.o.

CY - Prague

T2 - 20th International Congress on Phonetic Sciences

Y2 - 7 August 2023 through 11 August 2023

ER -

Speaker interpolation based data augmentation for automatic speech recognition

Abstract

Conference

Access to Document

Projects

FWF - Spontansprache - Cross-layer language models for conversational speech

Cite this