Delemmatization strategies for Dutch

Authors

  • Louis Onrust Radboud University Nijmegen, CLS, Linguistics / Computer Science
  • Hans van Halteren Radboud University Nijmegen, CLS, Linguistics / Computer Science

Abstract

In this paper we investigate whether, for Dutch open class words, it is possible to generate the surface form on the basis of the lemma and the POS tag, using a lexicon and a machine learning system. When testing on an annotated corpus, we are able to generate more than 97% of the gold standard types correctly and over 99% of the gold standard tokens. The most efficient strategy appears to be the pure machine learning one, even for those words that do occur in the lexicon. Apart from overall statistics, we look at specific machine learner settings in more detail and also investigate the errors made by the best scoring strategy.

Downloads

Published

2013-12-01

How to Cite

Onrust, L., & van Halteren, H. (2013). Delemmatization strategies for Dutch. Computational Linguistics in the Netherlands Journal, 3, 19–33. Retrieved from https://clinjournal.org/index.php/clinj/article/view/23

Issue

Section

Articles