Unsupervised Text Classification with Neural Word Embeddings
Abstract
The paper presents the experiments and results of unsupervised multiclass text classification of news articles based on measuring semantic similarity between class labels and texts through neural word embeddings. The experiments have been conducted on the news texts of various lengths in English and Dutch with a wide range of pre-trained (word2vec, GloVe, fastText) and trained in-domain (word2vec and Doc2Vec) neural word embeddings. The paper demonstrates that distance-based multiclass text classification with neural word embedding can be improved through in-domain training (word2vec and Doc2Vec). Furthermore, we propose two techniques that enrich class label representation with adjacent words in the embedding space: substituting class label with class concept and augmenting class label with additional class label instances. We also argue that improved distance-based text classification with neural word embeddings can be employed for fast text classification in case of a lack of labeled data or frequent changes in class labels, since it is more computationally efficient than novel NLI approaches. Finally, we suggest that the aforementioned method is especially effective if applied to low-resource languages.