DutchSemCor: Aiming at the ideal sense-tagged corpus

  • Piek Vossen Computational Lexicology & Terminology Lab, Vrije Universiteit Amsterdam, The Netherlands
  • Rub´en Izquierdo Computational Lexicology & Terminology Lab, Vrije Universiteit Amsterdam, The Netherlands
  • Attila Görög Computational Lexicology & Terminology Lab, Vrije Universiteit Amsterdam, The Netherlands

Abstract

The most-frequent-sense and the predominant domain sense play an important role in the debate on Word Sense Disambiguation (WSD). This discussion is, however, biased by the way sense-tagged corpora are built. In this paper, we argue that current sense-tagged corpora neglect rare senses and contexts and, as a result, do not represent a good corpus for training and testing word-sensedisambiguation. We defined three quality criteria for sense-tagged corpora and a methodology to satisfy these criteria with minimal effort. Following this method, we built a Dutch sensetagged corpus that tried to meet these criteria. The corpus was evaluated by deriving wordsense-disambiguation systems and testing these on different subsets of the corpus in various ways. The performance of our systems and the quality of the derived data are equal to state-of-the-art English systems and corpora. Finally, we used the systems to annotate a chunk of the Dutch SoNaR-corpus and create a subcorpus of over 47 million sense-tagged tokens spread over a large variety of genres, domains and usages of Dutch. The results of the project can be downloaded freely from the project website.

Published
2013-12-01
How to Cite
Vossen, P., Izquierdo, R., & Görög, A. (2013). DutchSemCor: Aiming at the ideal sense-tagged corpus. Computational Linguistics in the Netherlands Journal, 3, 49-62. Retrieved from https://clinjournal.org/clinj/article/view/25
Section
Articles