Unsupervised Text Classification with Neural Word Embeddings

Authors

  • Andriy Kosar Textgain / Universiteit Antwerpen
  • Guy De Pauw Textgain
  • Walter Daelemans Universiteit Antwerpen

Abstract

The paper presents the experiments and results of unsupervised multiclass text classification of news articles based on measuring semantic similarity between class labels and texts through neural word embeddings. The experiments have been conducted on the news texts of various lengths in English and Dutch with a wide range of pre-trained (word2vec, GloVe, fastText) and trained in-domain (word2vec and Doc2Vec) neural word embeddings. The paper demonstrates that distance-based multiclass text classification with neural word embedding can be improved through in-domain training (word2vec and Doc2Vec). Furthermore, we propose two techniques that enrich class label representation with adjacent words in the embedding space: substituting class label with class concept and augmenting class label with additional class label instances. We also argue that improved distance-based text classification with neural word embeddings can be employed for fast text classification in case of a lack of labeled data or frequent changes in class labels, since it is more computationally efficient than novel NLI approaches. Finally, we suggest that the aforementioned method is especially effective if applied to low-resource languages.

Downloads

Published

2022-12-22

How to Cite

Kosar, A., De Pauw, G., & Daelemans, W. (2022). Unsupervised Text Classification with Neural Word Embeddings. Computational Linguistics in the Netherlands Journal, 12, 165–181. Retrieved from https://clinjournal.org/clinj/article/view/153

Issue

Section

Articles

Most read articles by the same author(s)

1 2 > >>