RobBERT-2023: Keeping Dutch Language Models Up-To-Date at a Lower Cost Thanks to Model Conversion

Pieter Delobelle; François Remy

Authors

Pieter Delobelle
François Remy

Abstract

Pre-training large transformer-based language models on gigantic corpora and later repurposing them as base models for finetuning on downstream tasks has proven instrumental to the recent advances in computational linguistics. However, the prohibitively high cost associated with pretraining often hampers the regular updates of base models to incorporate the latest linguistic developments. To address this issue, we present an innovative approach for efficiently producing more powerful and up-to-date versions of RobBERT, our series of cutting-edge Dutch language models, by leveraging existing language models designed for high-resource languages. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, our two RobBERT-2023 models (base and large) are entirely initialized using the RoBERTa-family of models. To initialize an embedding table tailored to the newly devised Dutch tokenizer, we rely on a token translation strategy introduced by Remy et al. (2023). Along with our RobBERT-2023 release, we deliver a freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis, while mitigating the inclusion of previously over-represented terms from adult-oriented content. To assess the value of RobBERT-2023, we evaluate its performance using the same benchmarks employed for the state-of-the-art RobBERT-2022 model, as well as the newly-released Dutch Model Benchmark. Our experimental results demonstrate that RobBERT-2023 not only surpasses its predecessor in various aspects but also achieves these enhancements at a significantly reduced training cost. This work represents a significant step forward in keeping Dutch language models up-to-date and demonstrates the potential of model conversion techniques for reducing the environmental footprint of NLP research.

RobBERT-2023: Keeping Dutch Language Models Up-To-Date at a Lower Cost Thanks to Model Conversion

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)