MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records

Authors

  • Stella Verkijk Vrije Universiteit Amsterdam
  • Piek Vossen Vrije Universiteit Amsterdam

Abstract

This paper presents MedRoBERTa.nl as the first Transformer-based language model for Dutch medical language. We show that using 13GB of text data from Dutch hospital notes, pre-training from scratch results in a better domain-specific language model than further pre-training RobBERT. When extending pre-training on RobBERT, we use a domain-specific vocabulary and re-train the embedding look-up layer. We show that MedRoBERTa.nl, the model that was trained from scratch, outperforms general language models for Dutch on a medical odd-one-out similarity task. MedRoBERTa.nl already reaches higher performance than general language models for Dutch on this task after only 10k pre-training steps. When fine-tuned, MedRobERTa.nl outperforms general language models for Dutch in a task classifying sentences from Dutch hospital notes that contain information about patients’ mobility levels.

Downloads

Published

2021-12-31

How to Cite

Verkijk, S., & Vossen, P. . (2021). MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records. Computational Linguistics in the Netherlands Journal, 11, 141–159. Retrieved from https://clinjournal.org/clinj/article/view/132

Issue

Section

Articles

Most read articles by the same author(s)