Historical Dutch Spelling Normalization with Pretrained Language Models
Abstract
The Dutch language has undergone several spelling reforms since the 19th century. Normalizing 19th-century Dutch spelling to its modern equivalent can increase performance on various NLP tasks, such as machine translation or entity tagging. Van Cranenburgh and van Noord (2022) presented a rule-based system to normalize historical Dutch texts to their modern equivalent, but building and extending such a system requires careful engineering to ensure good coverage while not introducing incorrect normalizations. Recently, pretrained language models have become state-of-the-art for most NLP tasks. In this paper, we combine these approaches by building sequence-to-sequence language models trained on automatically corrected texts from the rule-based system (i.e., silver data). We experiment with several types of language models and approaches. First, we finetune two T5 models: Flan-T5 (Chung et al., 2022), an instruction-fine-tuned multilingual version of the original T5, and ByT5 (Xue et al., 2022), a token-free model which operates directly on the raw text and characters. Second, we pretrain ByT5 with the pretraining data used for BERTje (de Vries et al., 2019) and finetune this model afterward. For evaluation, we use three manually-corrected novels from the same source and compare all trained models with the original rule-based system used to generate the training data. This allows for a direct comparison between the rule-based and pretrained language models to analyze which yields the best performance. Our pretrained ByT5 model finetuned with our largest finetuning dataset achieved the best results on all three novels. This model not only outperformed the rule-based system, but also also made generalizations beyond the training data. In addition to an intrinsic evaluation of the spelling normalization itself, we also perform an extrinsic evaluation on downstream tasks, namely parsing and coreference. Results show that the neural system tends to outperform the rule-based method, although the differences are small. All code, data, and models used in this paper are available at https://github.com/andreasvc/neuralspellnorm.