All That Glitters is Not Gold: Transfer-learning for Offensive Language Detection in Dutch

Authors

  • Dion Theodoridis Rijksuniversiteit Groningen
  • Tommaso Caselli Rijksuniversiteit Groningen

Abstract

Creating datasets for language phenomena to fill gaps in the language resource panorama of specific natural languages is not a trivial task. In this work, we explore the application of transferlearning as strategy to boost both the creation of language-specific datasets and systems. We use offensive language in Dutch tweets directed at Dutch politicians as a case study. In particular, we trained a multilingual model using the Political Speech Project (Bröckling et al. 2018) dataset to automatically annotate tweets in Dutch. The automatically annotated tweets have been used to further train a monolingual language model in Dutch (BERTje) adopting different strategies and combination of manually curated data. Our results show that: (i) transfer learning is an effective strategy to boost the creation of new datasets for specific language phenomena by reducing the annotation efforts; (ii) using a monolingual language model fine-tuned with automatically annotated data (i.e., silver data) is a competitive baseline against the zero-shot transfer of a multilingual model; and finally, (iii) less surprisingly, the addition of automatically annotated data to manually curated ones is a source of errors for the systems, degrading their performances.

Downloads

Published

2022-12-22

How to Cite

Theodoridis, D., & Caselli, T. (2022). All That Glitters is Not Gold: Transfer-learning for Offensive Language Detection in Dutch. Computational Linguistics in the Netherlands Journal, 12, 141–164. Retrieved from https://clinjournal.org/clinj/article/view/152

Issue

Section

Articles

Most read articles by the same author(s)