Developing a part-of-speech tagger for Dutch tweets
Abstract
In this article we describe the design and creation of a part-of-speech tagger specifically for Dutch data from the popular microblogging service Twitter. Starting from the D-Coi part-of-speech tag set, which is also used in the SoNaR project, we added several Twitter-specific tags to allow the tagging of hashtags, @ mentions, emoticons and URLs. The tagger consists of the Frog tagger combined with a post-processing module that incorporates the new, Twitter-specific tags in the Frog part-of-speech output. Running the Frog tagger and the post-processing module sequentially leads to a part-of-speech tagger for Dutch tweets. Approximately 1 million tweets collected in the context of the SoNaR project were tagged by Frog and the post-processor combined. A sub-set of annotated tweets have been manually checked. Lastly, we evaluated the adapted part-of-speech tagger.
This project was accomplished by eight Master’s students from Tilburg University, who had just completed a course in natural language processing. In addition to the theoretical knowledge they acquired during the course, this project, which took approximately a week, offered them hands-on experience.