Developing a part-of-speech tagger for Dutch tweets

  • Tetske Avontuur Tilburg University, Tilburg, The Netherlands
  • Iris Balemans Tilburg University, Tilburg, The Netherlands
  • Laura Elshof Tilburg University, Tilburg, The Netherlands
  • Nanne van Noord Tilburg University, Tilburg, The Netherlands
  • Menno van Zaanen Tilburg University, Tilburg, The Netherlands

Abstract

In this article we describe the design and creation of a part-of-speech tagger specifically for Dutch data from the popular microblogging service Twitter. Starting from the D-Coi part-of-speech tag set, which is also used in the SoNaR project, we added several Twitter-specific tags to allow the tagging of hashtags, @ mentions, emoticons and URLs. The tagger consists of the Frog tagger combined with a post-processing module that incorporates the new, Twitter-specific tags in the Frog part-of-speech output. Running the Frog tagger and the post-processing module sequentially leads to a part-of-speech tagger for Dutch tweets. Approximately 1 million tweets collected in the context of the SoNaR project were tagged by Frog and the post-processor combined. A sub-set of annotated tweets have been manually checked. Lastly, we evaluated the adapted part-of-speech tagger.

This project was accomplished by eight Master’s students from Tilburg University, who had just completed a course in natural language processing. In addition to the theoretical knowledge they acquired during the course, this project, which took approximately a week, offered them hands-on experience.

Published
2012-12-01
How to Cite
Avontuur, T., Balemans, I., Elshof, L., van Noord, N., & van Zaanen, M. (2012). Developing a part-of-speech tagger for Dutch tweets. Computational Linguistics in the Netherlands Journal, 2, 34-51. Retrieved from https://clinjournal.org/clinj/article/view/14
Section
Articles