Towards Identifying Normal Forms for Various Word Form Spellings on Twitter
Abstract
We take a first step towards the annotation of word forms in tweets with normal forms. Such annotation can assist research into spelling variation and the use of standard NLP tools to process tweets. This first step consists of the design of a technique to estimate whether two word forms can be considered variants of one and the same normal form. At this point we are examining word form types in isolation, i.e. without taking the context into account. We describe a word form similarity measurement which combines edit distance and context similarity over our whole tweet collection. Furthermore, we present the results of a pilot study, which we executed on 7Gw worth of Dutch tweets. We find that, while results are encouraging, various improvements to the similarity estimations are still possible.