Robust Speech Recognition for German and Dialectal Broadcast Programmes

Stadtschnitzer, Michael

Volltext

Dokument öffnen (6.9MB)

Autor

Stadtschnitzer, Michael

Art der Hochschulschrift

Dissertation

Prüfungsdatum

17.10.2018

Datum der Veröffentlichung

25.10.2018

Erstgutachter

Bauckhage, Christian

Zweitgutachter

Wrobel, Stefan

Beteiligte Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/7658
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5n-52369

Inhalt

Audio mining systems automatically analyse large amounts of heterogeneous media files such as television and radio programmes so that the analysed audio content can be efficiently searched for spoken words. Typically audio mining systems such as the Fraunhofer IAIS audio mining system consist of several modules to structure and analyse the data.
The most important module is the large vocabulary continuous speech recognition (LVCSR) module, which is responsible to transform the audio signal into written text. Because of the tremendous developments in the field of speech recognition and to provide the customers with a high-performance audio mining system, the LVCSR module has to be trained and updated regularly by using the latest state-of-the-art algorithms provided by the research community and also by employing large amounts of training data. Today speech recognition systems usually perform very well in clean conditions, however when noise, reverberation or dialectal speakers are present, the performance of these systems degrade considerably. In broadcast media typically a large number of different speakers with high variability are present, like anchormen, interviewers, interviewees, speaking colloquial or planned speech, with or without dialect, or even with voice-overs. Especially in regional programmes of public broadcast, a considerable fraction of the speakers speak with an accent or a dialect. Also, a large amount of different background noises appears in the data, like background speech, or background music. Post-processing algorithms like compression, expansion, and stereo effect processing, which are generously used in broadcast media, further manipulate the audio data. All these issues make speech recognition in the broadcast domain a challenging task.
This thesis focuses on the development and the optimisation of the German broadcast LVCSR system, which is part of the Fraunhofer IAIS audio mining system, over the course of several years, dealing with robustness related problems that arise for German broadcast media and also dealing with the requirements for the employment of the ASR system in a productive audiomining system for the industrial use including stability, decoding time and memory consumption.
We approach the following three problems: the continuous development and optimisation of the German broadcast LVCSR system over a long period, rapidly finding the optimal ASR decoder parameters automatically and dealing with German dialects in the German broadcast LVCSR system.
To guarantee superb performance over long periods of time, we regularly re-train the system using the latest algorithms and system architectures that became available by the research community, and evaluate the performance of the algorithms on German broadcast speech. We also drastically increase the training data by annotating a large and novel German broadcast speech corpus, which is unique in Germany.
After training an automatic speech recognition (ASR) system, a speech recognition decoder is responsible to decode the most likely text hypothesis for a certain audio signal given the ASR model. Typically the ASR decoder comes with a large number of hyperparameters, which are usually set to default values or manually optimised. These parameters are often far from the optimum in terms of accuracy and decoding speed. State-of-the-art decoder parameter optimisation algorithms take a long time to converge. Hence, we approach the automatic decoder parameter optimisation in the context of German broadcast speech recognition in this thesis for both unconstrained and constrained (in terms of decoding speed) decoding, by introducing and extending an optimisation algorithm that has not been used for the task of speech recognitinon before to ASR decoder parameter optimisation.
Germany has a large variety of dialects that are also often present in broadcast media especially in regional programmes. Dialectal speakers cause severely degraded performance of the speech recognition system due to the mismatch in phonetics and grammar. In this thesis, we approach the large variety of German dialects by introducing a dialect identification system to infer the dialect of the speaker in order to use adapted dialectal speech recognition models to retrieve the spoken text. To train the dialect identification system, a novel database was collected and annotated.
By approaching the three issues we arrive at an audio mining system that includes a high-performance speech recognition system, which is able to cope with dialectal speakers and with optimal decoder parameters that can be inferred quickly.

Schlagwörter

Robuste Spracherkennung, Automatische Dekoderparameteroptimierung, Dialektale Robustheit, Robust Speech Recognition, Automatic Decoder Parameter Optimisation, Dialectal Robustness

Klassifikation (DDC)

004 Informatik

Zitiervorschlag
BibTeX

Stadtschnitzer, Michael: Robust Speech Recognition for German and Dialectal Broadcast Programmes. - Bonn, 2018. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5n-52369

@phdthesis{handle:20.500.11811/7658,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5n-52369,
author = {{Michael Stadtschnitzer}},
title = {Robust Speech Recognition for German and Dialectal Broadcast Programmes},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2018,
month = oct,
note = {Audio mining systems automatically analyse large amounts of heterogeneous media files such as television and radio programmes so that the analysed audio content can be efficiently searched for spoken words. Typically audio mining systems such as the Fraunhofer IAIS audio mining system consist of several modules to structure and analyse the data.
The most important module is the large vocabulary continuous speech recognition (LVCSR) module, which is responsible to transform the audio signal into written text. Because of the tremendous developments in the field of speech recognition and to provide the customers with a high-performance audio mining system, the LVCSR module has to be trained and updated regularly by using the latest state-of-the-art algorithms provided by the research community and also by employing large amounts of training data. Today speech recognition systems usually perform very well in clean conditions, however when noise, reverberation or dialectal speakers are present, the performance of these systems degrade considerably. In broadcast media typically a large number of different speakers with high variability are present, like anchormen, interviewers, interviewees, speaking colloquial or planned speech, with or without dialect, or even with voice-overs. Especially in regional programmes of public broadcast, a considerable fraction of the speakers speak with an accent or a dialect. Also, a large amount of different background noises appears in the data, like background speech, or background music. Post-processing algorithms like compression, expansion, and stereo effect processing, which are generously used in broadcast media, further manipulate the audio data. All these issues make speech recognition in the broadcast domain a challenging task.
This thesis focuses on the development and the optimisation of the German broadcast LVCSR system, which is part of the Fraunhofer IAIS audio mining system, over the course of several years, dealing with robustness related problems that arise for German broadcast media and also dealing with the requirements for the employment of the ASR system in a productive audiomining system for the industrial use including stability, decoding time and memory consumption.
We approach the following three problems: the continuous development and optimisation of the German broadcast LVCSR system over a long period, rapidly finding the optimal ASR decoder parameters automatically and dealing with German dialects in the German broadcast LVCSR system.
To guarantee superb performance over long periods of time, we regularly re-train the system using the latest algorithms and system architectures that became available by the research community, and evaluate the performance of the algorithms on German broadcast speech. We also drastically increase the training data by annotating a large and novel German broadcast speech corpus, which is unique in Germany.
After training an automatic speech recognition (ASR) system, a speech recognition decoder is responsible to decode the most likely text hypothesis for a certain audio signal given the ASR model. Typically the ASR decoder comes with a large number of hyperparameters, which are usually set to default values or manually optimised. These parameters are often far from the optimum in terms of accuracy and decoding speed. State-of-the-art decoder parameter optimisation algorithms take a long time to converge. Hence, we approach the automatic decoder parameter optimisation in the context of German broadcast speech recognition in this thesis for both unconstrained and constrained (in terms of decoding speed) decoding, by introducing and extending an optimisation algorithm that has not been used for the task of speech recognitinon before to ASR decoder parameter optimisation.
Germany has a large variety of dialects that are also often present in broadcast media especially in regional programmes. Dialectal speakers cause severely degraded performance of the speech recognition system due to the mismatch in phonetics and grammar. In this thesis, we approach the large variety of German dialects by introducing a dialect identification system to infer the dialect of the speaker in order to use adapted dialectal speech recognition models to retrieve the spoken text. To train the dialect identification system, a novel database was collected and annotated.
By approaching the three issues we arrive at an audio mining system that includes a high-performance speech recognition system, which is able to cope with dialectal speakers and with optimal decoder parameters that can be inferred quickly.},
url = {https://hdl.handle.net/20.500.11811/7658}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: