Linguistic enrichment of historical Dutch using deep Learning

Silke Creten; Peter Dekker; Vincent Vandeghinste

Authors

Silke Creten KU Leuven
Peter Dekker Vrije Universiteit Brussel
Vincent Vandeghinste Instituut voor de Nederlandse taal

Abstract

This article discusses the automatic linguistic enrichment of historical Dutch corpora through the use of part-of-speech tagging and lemmatization. Such a type of enrichment facilitates linguistic research where manual annotation is unfeasible. We built a neural network-based model using the PIE framework and performed an in-depth error analysis, in order to identify the strengths and weaknesses of each approach with respect to labeling historical data. In order to do so, we experimented with two data sets: the Corpus Gysseling (13th century
texts) and the Corpus van Reenen/Mulder (14th century texts). We used two different statistical approaches (MBT and HunPos) as baselines for our neural approach. MBT is a memory-based tagger frequently used for modern Dutch, while HunPos is an open source trigram tagger. We present thoroughly analyzed results. In general, the neural model scores better than the two baselines, even with limited training data. Based on the error analysis, we propose several strategies for future research in order to improve the labeling of historical Dutch.

Linguistic enrichment of historical Dutch using deep Learning

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)