Expanding n-gram training data for language models based on morpho-syntactic transformations
Abstract
The subject of this paper is the expansion of n-gram training data with the aid of morphosyntactic transformations, in order to create a larger amount of reliable n-grams for Dutch language models. The main aim of this technique is to alleviate a classical problem for language models: data sparsity. Moreover, since language models for automatic speech recognition are usually trained on written language resources while they are tested on spoken language, certain patterns that are typical for spontanous spoken language will be under-represented and patterns characteristic of written language will be over-represented. By adding transformed n-grams, we hope to adapt the language model such that it matches better with spoken language. We investigate whether a language model trained on the expanded data performs better than a baseline n-gram model with modified Kneser-Ney smoothing in terms of perplexity and word error rate. Several alternatives for the probability estimation of the transformed n-grams are explored, and an approach to deal with separable verbs in Dutch is also discussed.