Expanding n-gram training data for language models based on morpho-syntactic transformations

Lyan Verwimp; Joris Pelemans; Hugo van Hamme; Patrick Wambacq

Authors

Lyan Verwimp KU Leuven, Leuven, Belgium
Joris Pelemans KU Leuven, Leuven, Belgium
Hugo van Hamme KU Leuven, Leuven, Belgium
Patrick Wambacq KU Leuven, Leuven, Belgium

Abstract

The subject of this paper is the expansion of n-gram training data with the aid of morphosyntactic transformations, in order to create a larger amount of reliable n-grams for Dutch language models. The main aim of this technique is to alleviate a classical problem for language models: data sparsity. Moreover, since language models for automatic speech recognition are usually trained on written language resources while they are tested on spoken language, certain patterns that are typical for spontanous spoken language will be under-represented and patterns characteristic of written language will be over-represented. By adding transformed n-grams, we hope to adapt the language model such that it matches better with spoken language. We investigate whether a language model trained on the expanded data performs better than a baseline n-gram model with modified Kneser-Ney smoothing in terms of perplexity and word error rate. Several alternatives for the probability estimation of the transformed n-grams are explored, and an approach to deal with separable verbs in Dutch is also discussed.

Expanding n-gram training data for language models based on morpho-syntactic transformations

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)