Conversational Speech Recognition Needs Data? Experiments with Austrian German

Julian Linke; Philip N. Garner; Gernot Kubin; Barbara Schuppler

Conversational Speech Recognition Needs Data? Experiments with Austrian German

Julian Linke, Philip N. Garner, Gernot Kubin, Barbara Schuppler

Institute of Signal Processing and Speech Communication (4420)

Research output: Contribution to conference › Paper › peer-review

Abstract

Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of large
amounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also show
that the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

Original language	English
Pages	4684–4691
Number of pages	8
Publication status	Published - 2022

FWF - CLCS_2 - Cross-layer prosodic models for conversational speech
Schuppler, B.
1/10/18 → 30/11/21
Project: Research project

Cite this

@conference{c6fe4431f2574571a3f5d532118560a4,

title = "Conversational Speech Recognition Needs Data? Experiments with Austrian German",

abstract = "Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of largeamounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also showthat the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.",

keywords = "Speech Recognition, Conversational Speech, Austrian German, Low-Resource, Wav2vec2.0, Kaldi",

author = "Julian Linke and Garner, {Philip N.} and Gernot Kubin and Barbara Schuppler",

year = "2022",

language = "English",

pages = "4684–4691",

}

TY - CONF

T1 - Conversational Speech Recognition Needs Data? Experiments with Austrian German

AU - Linke, Julian

AU - Garner, Philip N.

AU - Kubin, Gernot

AU - Schuppler, Barbara

PY - 2022

Y1 - 2022

N2 - Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of largeamounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also showthat the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

AB - Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of largeamounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also showthat the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

KW - Speech Recognition

KW - Conversational Speech

KW - Austrian German

KW - Low-Resource

KW - Wav2vec2.0

KW - Kaldi

M3 - Paper

SP - 4684

EP - 4691

ER -

Conversational Speech Recognition Needs Data? Experiments with Austrian German

Abstract

Fingerprint

Projects

FWF - CLCS_2 - Cross-layer prosodic models for conversational speech

Cite this