N-gram Frequencies for Dutch Twitter Data

  • Gosse Bouma University of Groningen

Abstract

This paper presents n-gram frequency data obtained from a large sample of Dutch tweets, covering a period of 4 years. After filtering of re-tweets, (near-) duplicates, and non-Dutch tweets, more than 2.6 billion tweets remained. These were tokenized, and frequencies were collected for n-grams of up to 5 words. A web interface allows users to obtain frequency information for spelling variants, grammatical phenomena (as reflected in n-gram patterns), monthly trends, and word clusters. All the underlying n-gram frequency data as well as the word clusters are available for download

Published
2015-11-01
How to Cite
Bouma, G. (2015). N-gram Frequencies for Dutch Twitter Data. Computational Linguistics in the Netherlands Journal, 5, 25-36. Retrieved from https://clinjournal.org/clinj/article/view/55
Section
Articles