How to optimize your Twitter collection: Dutch keywords for better coverage

Tim Kreutz; Walter Daelemans

Authors

Tim Kreutz
Walter Daelemans

Abstract

Twitter allows API calls to retrieve one percent of all tweets at any time using a search word list. Since some languages, including Dutch, make up less than one percent of all tweets on average, a large part can be retrieved using the right keywords. This paper systematically assesses keyword lists for nding language-specic tweets. It contributes comparisons to previously suggested collection methods for the Dutch language and establishes the limitations of each. Generating keywords from Dutch tweets and picking 400 based on their precision-weighted recall achieves the
best coverage at 91.3%. The list of Dutch keywords is made openly available alongside the code that can be used to generate lists for the collection of other languages or for other tasks that benet from early ltering such as event or hate speech detection.

How to optimize your Twitter collection

Dutch keywords for better coverage

Authors

Abstract

Author Biographies

Tim Kreutz

Walter Daelemans

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)