Evaluating the Impact of Word Classes on Cross-Domain Age Detection Models' Performance
Abstract
In this paper, we examine the importance of word category information for the age detection task – the task of identifying the age of a person based on their writing – both under in-domain and cross-domain conditions. We remove entire word classes and study its effect using both Support Vector Machines (SVM) and pre-trained contextual word embeddings (BERT). By conducting these experiments, we aim to gain insight into how both approaches handle cross-domain conditions. Our experiments show that, on the one hand, SVM mainly relies on content words in the in-domain settings, while function words are the most indicative features in the cross-domain setup. BERT, on the other hand, mainly relies on highly-frequent word classes, such as nouns and punctuation, to make predictions both under in-domain and cross-domain age detection conditions.