Bag of Lies: Robustness in Continuous Pre-training BERT

Ine Gevers; Walter Daelemans

Authors

Ine Gevers University of Antwerp
Walter Daelemans University of Antwerp

Abstract

This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Specifically, we focus on to what extent entity knowledge can be acquired through continuous pre-training, and how robust this process is. Since the pandemic emerged after the last update of BERT’s pre-training data, the model has little to no prior entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We use a fact-checking benchmark about the entity, namely Check-COVID, as an evaluative framework, comparing a baseline BERT model with continuous pre-trained variants on this task. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as using misinformation and shuffling the word order until the input becomes nonsensical. Our findings reveal that these methods do not degrade, and sometimes even improve, the model’s downstream performance. This suggests that continuous pre-training of BERT is robust against these attacks, but that BERT obtaining entity-specific knowledge is susceptible to writing style changes in the data. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated (false) counterparts.

Bag of Lies: Robustness in Continuous Pre-training BERT

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)