A Hybrid ASR System for Southern Dutch
Abstract
Classical hybrid models for automatic speech recognition were recently outperformed by end-toend models on popular benchmarks such as LibriSpeech. However, in many real life situations, hybrid systems can prevail due to independent training, optimization and tuning of the acoustic and language models. In this work, we implemented a state-of-the-art hybrid system for Southern Dutch. For the acoustic model, we train a HMM-DNN on 155 hrs of the Corpus Gesproken Nederlands (CGN) with a rather standard Kaldi recipe. As reference, we reused language models developed during our N-Best 2008 evaluation. We further investigated the effect of language model order and size on WER for a variety of test sets (held out data from CGN, N-Best dev and test sets). Best results, 10.12% WER on the N-Best test set, are obtained with a 400k lexicon and a 4-gram language model (with 231M parameters). This new hybrid system outperforms our older HMM-GMM based N-Best system by over 40%. Pruning away 90% of the LM parameters yields a compact model suitable for small scale real-time apps while only taking a 10% relative hit on performance.