How to optimize your Twitter collection
Dutch keywords for better coverage
Abstract
Twitter allows API calls to retrieve one percent of all tweets at any time using a search word list. Since some languages, including Dutch, make up less than one percent of all tweets on average, a large part can be retrieved using the right keywords. This paper systematically assesses keyword lists for nding language-specic tweets. It contributes comparisons to previously suggested collection methods for the Dutch language and establishes the limitations of each. Generating keywords from Dutch tweets and picking 400 based on their precision-weighted recall achieves the
best coverage at 91.3%. The list of Dutch keywords is made openly available alongside the code that can be used to generate lists for the collection of other languages or for other tasks that benet from early ltering such as event or hate speech detection.