BasiLex: an 11.5 million words corpus of Dutch texts written for children

Agnes Tellings; Micha Hulsbosch; Anne Vermeer; Antal van den Bosch

Authors

Agnes Tellings Radboud University, Nijmegen, The Netherlands
Micha Hulsbosch Radboud University, Nijmegen, The Netherlands
Anne Vermeer Tilburg University, Tilburg, The Netherlands
Antal van den Bosch Radboud University, Nijmegen, The Netherlands

Abstract

This article discusses Basilex, a 13.5-million tokens, 11.5-million Dutch words corpus of written language offered to children in the elementary school age, which was recently finalized. The corpus is automatically analyzed at the levels of part-of-speech tagging and lemmatization, and a limited amount of polysemous words has been partly automatically disambiguated. Also, a lemma-based lexicon is derived. The aim of the present article is threefold: First, to give a description of BasiLex and how it was built, and to discuss its validity. Second, to compare the BasiLex lexicon with two other lexicons regarding differences in their most frequent words: the Schrooten and Vermeer (1994) lexicon, a small and now outdated Dutch corpus of language addressed to children, and a derived lexicon of SoNaR, an adult written language corpus (Oostdijk et al. 2013). Third, we discuss some potential educational applications of BasiLex.

BasiLex: an 11.5 million words corpus of Dutch texts written for children

Authors

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)