Towards Identifying Normal Forms for Various Word Form Spellings on Twitter

  • Hans van Halteren CLST, Radboud University Nijmegen
  • Nelleke Oostdijk CLST, Radboud University Nijmegen

Abstract

We take a first step towards the annotation of word forms in tweets with normal forms. Such annotation can assist research into spelling variation and the use of standard NLP tools to process tweets. This first step consists of the design of a technique to estimate whether two word forms can be considered variants of one and the same normal form. At this point we are examining word form types in isolation, i.e. without taking the context into account. We describe a word form similarity measurement which combines edit distance and context similarity over our whole tweet collection. Furthermore, we present the results of a pilot study, which we executed on 7Gw worth of Dutch tweets. We find that, while results are encouraging, various improvements to the similarity estimations are still possible.

Published
2012-12-01
How to Cite
van Halteren, H., & Oostdijk, N. (2012). Towards Identifying Normal Forms for Various Word Form Spellings on Twitter. Computational Linguistics in the Netherlands Journal, 2, 2-22. Retrieved from https://clinjournal.org/clinj/article/view/12
Section
Articles