Automatically correcting Dutch pronouns "die" and "dat"
Abstract
The correct use of Dutch pronouns die and dat is a stumbling block for both native and nonnative speakers of Dutch due to the multiplicity of syntactic functions and the dependency on the antecedent’s grammatical gender and number. Drawing on previous research on neural contextdependent dt-mistake correction (Heyman et al. 2018), this study constructs the first neural network model for Dutch demonstrative and relative pronoun correction that specifically focuses on die and dat. Several datasets are built with sentences obtained from the Dutch Europarl corpus (Koehn 2005) - which contains the proceedings of the European Parliament from 1996 to the present - and the SoNaR corpus (Oostdijk et al. 2013) - which contains Dutch texts from a variety of domains such as newspapers, blogs and legal texts. Firstly, a binary classification model solely predicts the correct die or dat. The classifier with a bidirectional long short-term memory architecture achieves 84.56% accuracy. Secondly, a multitask classification model simultaneously predicts the correct die or dat and its part-of-speech tag. The model consisting of a sentence and context encoder with both a bidirectional long short-term memory architecture yields 88.63% accuracy for die/dat prediction and 87.73% accuracy for part-of-speech prediction. More evenly-balanced data, larger word embeddings, an extra bidirectional long short-term memory layer and integrated partof-speech knowledge positively affects die/dat prediction performance, while a context encoder architecture raises part-of-speech prediction performance. This study shows promising results and can serve as a starting point for future research on automated, fine-grained grammar correction.