Dealing with big data: The case of Twitter

Erik Tjong Kim Sang; Antal van den Bosch

Authors

Erik Tjong Kim Sang Netherlands eScience Center, Amsterdam, The Netherlands
Antal van den Bosch Radboud University Nijmegen, Nijmegen, The Netherlands

Abstract

As data sets keep growing, computational linguists are experiencing more big data problems: challenging demands on storage and processing caused by very large data sets. An example of this is dealing with social media data: including metadata, the messages of the social media site Twitter in 2012 comprise more than 250 terabytes of structured text. Handling data volumes like this requires parallel computing architectures with appropriate software tools. In this paper we present our experiences in working with such a big data set, a collection of two billion Dutch tweets. We show how we collected and stored the data. Next we deal with searching in the data using the Hadoop framework and visualizing search results. In order to determine the usefulness of this tweet analysis resource, we have performed three case studies based on the data: relating word frequency to real-life events, finding words related to a topic, and gathering information about conversations. The three case studies are presented in this paper. Access to this current and expanding tweet data set is offered via the website twiqs.nl.

Dealing with big data: The case of Twitter

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)