Metadata Induction on a Dutch Twitter Corpus: Initial phases

  • Hans van Halteren Radboud University Nijmegen, CLS, Linguistics

Abstract

In this paper, I pose that metadata induction for the TwiNL collection of Dutch tweets should start at a fundamental level, for which I address three types of classification. Identification of users tweeting predominantly in Dutch is shown to be possible with an F-value over 98%. For the identification of individual humans (as opposed to e.g. groups or bots), no classifier is tested, but a number of useful text-based measurements are presented. Finally, the identification of children of school-going age (by far the largest user group in the collection) is shown to be possible on the basis of just unigram counts with high accuracy (93.5%), increasing to very high accuracy (97%) when at least 1,000 tweets are available.

Published
2015-11-01
How to Cite
van Halteren, H. (2015). Metadata Induction on a Dutch Twitter Corpus: Initial phases. Computational Linguistics in the Netherlands Journal, 5, 37-48. Retrieved from https://clinjournal.org/clinj/article/view/56
Section
Articles

Most read articles by the same author(s)