BasiLex: an 11.5 million words corpus of Dutch texts written for children

  • Agnes Tellings Radboud University, Nijmegen, The Netherlands
  • Micha Hulsbosch Radboud University, Nijmegen, The Netherlands
  • Anne Vermeer Tilburg University, Tilburg, The Netherlands
  • Antal van den Bosch Radboud University, Nijmegen, The Netherlands

Abstract

This article discusses Basilex, a 13.5-million tokens, 11.5-million Dutch words corpus of written language offered to children in the elementary school age, which was recently finalized. The corpus is automatically analyzed at the levels of part-of-speech tagging and lemmatization, and a limited amount of polysemous words has been partly automatically disambiguated. Also, a lemma-based lexicon is derived. The aim of the present article is threefold: First, to give a description of BasiLex and how it was built, and to discuss its validity. Second, to compare the BasiLex lexicon with two other lexicons regarding differences in their most frequent words: the Schrooten and Vermeer (1994) lexicon, a small and now outdated Dutch corpus of language addressed to children, and a derived lexicon of SoNaR, an adult written language corpus (Oostdijk et al. 2013). Third, we discuss some potential educational applications of BasiLex.

Published
2014-12-01
How to Cite
Tellings, A., Hulsbosch, M., Vermeer, A., & van den Bosch, A. (2014). BasiLex: an 11.5 million words corpus of Dutch texts written for children. Computational Linguistics in the Netherlands Journal, 4, 191-208. Retrieved from https://clinjournal.org/clinj/article/view/50
Section
Articles