Bag of Lies: Robustness in Continuous Pre-training BERT

Authors

  • Ine Gevers University of Antwerp
  • Walter Daelemans University of Antwerp

Abstract

This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Specifically, we focus on to what extent entity knowledge can be acquired through continuous pre-training, and how robust this process is. Since the pandemic emerged after the last update of BERT’s pre-training data, the model has little to no prior entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We use a fact-checking benchmark about the entity, namely Check-COVID, as an evaluative framework, comparing a baseline BERT model with continuous pre-trained variants on this task. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as using misinformation and shuffling the word order until the input becomes nonsensical. Our findings reveal that these methods do not degrade, and sometimes even improve, the model’s downstream performance. This suggests that continuous pre-training of BERT is robust against these attacks, but that BERT obtaining entity-specific knowledge is susceptible to writing style changes in the data. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated (false) counterparts.

Downloads

Published

2025-07-15

How to Cite

Gevers, I., & Daelemans, W. (2025). Bag of Lies: Robustness in Continuous Pre-training BERT. Computational Linguistics in the Netherlands Journal, 14, 67–84. Retrieved from https://clinjournal.org/clinj/article/view/187

Issue

Section

Articles

Most read articles by the same author(s)

1 2 > >>