Dealing with big data: The case of Twitter
Abstract
As data sets keep growing, computational linguists are experiencing more big data problems: challenging demands on storage and processing caused by very large data sets. An example of this is dealing with social media data: including metadata, the messages of the social media site Twitter in 2012 comprise more than 250 terabytes of structured text. Handling data volumes like this requires parallel computing architectures with appropriate software tools. In this paper we present our experiences in working with such a big data set, a collection of two billion Dutch tweets. We show how we collected and stored the data. Next we deal with searching in the data using the Hadoop framework and visualizing search results. In order to determine the usefulness of this tweet analysis resource, we have performed three case studies based on the data: relating word frequency to real-life events, finding words related to a topic, and gathering information about conversations. The three case studies are presented in this paper. Access to this current and expanding tweet data set is offered via the website twiqs.nl.