BasiLex: an 11.5 million words corpus of Dutch texts written for children


  • Agnes Tellings Radboud University, Nijmegen, The Netherlands
  • Micha Hulsbosch Radboud University, Nijmegen, The Netherlands
  • Anne Vermeer Tilburg University, Tilburg, The Netherlands
  • Antal van den Bosch Radboud University, Nijmegen, The Netherlands


This article discusses Basilex, a 13.5-million tokens, 11.5-million Dutch words corpus of written language offered to children in the elementary school age, which was recently finalized. The corpus is automatically analyzed at the levels of part-of-speech tagging and lemmatization, and a limited amount of polysemous words has been partly automatically disambiguated. Also, a lemma-based lexicon is derived. The aim of the present article is threefold: First, to give a description of BasiLex and how it was built, and to discuss its validity. Second, to compare the BasiLex lexicon with two other lexicons regarding differences in their most frequent words: the Schrooten and Vermeer (1994) lexicon, a small and now outdated Dutch corpus of language addressed to children, and a derived lexicon of SoNaR, an adult written language corpus (Oostdijk et al. 2013). Third, we discuss some potential educational applications of BasiLex.




How to Cite

Tellings, A., Hulsbosch, M., Vermeer, A., & van den Bosch, A. (2014). BasiLex: an 11.5 million words corpus of Dutch texts written for children. Computational Linguistics in the Netherlands Journal, 4, 191–208. Retrieved from


