Developing a part-of-speech tagger for Dutch tweets

Tetske Avontuur; Iris Balemans; Laura Elshof; Nanne van Noord; Menno van Zaanen

Authors

Tetske Avontuur Tilburg University, Tilburg, The Netherlands
Iris Balemans Tilburg University, Tilburg, The Netherlands
Laura Elshof Tilburg University, Tilburg, The Netherlands
Nanne van Noord Tilburg University, Tilburg, The Netherlands
Menno van Zaanen Tilburg University, Tilburg, The Netherlands

Abstract

In this article we describe the design and creation of a part-of-speech tagger specifically for Dutch data from the popular microblogging service Twitter. Starting from the D-Coi part-of-speech tag set, which is also used in the SoNaR project, we added several Twitter-specific tags to allow the tagging of hashtags, @ mentions, emoticons and URLs. The tagger consists of the Frog tagger combined with a post-processing module that incorporates the new, Twitter-specific tags in the Frog part-of-speech output. Running the Frog tagger and the post-processing module sequentially leads to a part-of-speech tagger for Dutch tweets. Approximately 1 million tweets collected in the context of the SoNaR project were tagged by Frog and the post-processor combined. A sub-set of annotated tweets have been manually checked. Lastly, we evaluated the adapted part-of-speech tagger.

This project was accomplished by eight Master’s students from Tilburg University, who had just completed a course in natural language processing. In addition to the theoretical knowledge they acquired during the course, this project, which took approximately a week, offered them hands-on experience.

Developing a part-of-speech tagger for Dutch tweets

Authors

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)