Automatic Detection and Annotation of Spelling Errors and Orthographic Properties in the Dutch BasiScript Corpus


  • Wieke Noa Harmsen Centre for Language and Speech Technology
  • Catia Cucchiarini Centre for Language and Speech Technology
  • Helmer Strik Centre for Language and Speech Technology


Learning to spell in Dutch is a difficult task that children cannot learn autonomously. Essential for developing good spelling skills is direct instruction of the spelling principles, enough practice, and feedback. With respect to direct instruction, a qualitative overview of which spelling errors are made by which type of children at which point in time can be very useful to design effective spelling lessons. In addition, such an overview would enable large-scale, quantitative, research on children’s spelling development. In an earlier study we presented an algorithm that can automatically detect and annotate spelling errors in BasiScript, a corpus containing texts written by primary school children. In the present study, we extended the functionality of this algorithm to make it capable of annotating written words with their orthographic properties. In practice, this means that correctly spelled letters are detected and annotated with the spelling principle that is applied correctly. These additional annotations allow us to compute relative scores, which show how often a spelling principle is applied incorrectly with respect to the total occurrence frequency of that spelling principle. Using this relative frequency measure, we found that spelling principles from the syntax and semantics category are more problematic to learn for Dutch primary school children between second and sixth grade than phoneme-to-grapheme conversion, context and morphology spelling principles. Primary school children are especially bad at applying spelling principles concerning capital letter use, the writing of present participles, and past participles ending in “d”, while you hear a /t/ sound. In this paper, we first describe the implementation of the algorithm, its evaluation, and the computation of the relative frequency in more detail. We then discuss the results and possible limitations of our study and address future avenues of research.




How to Cite

Harmsen, W. N., Cucchiarini, C., & Strik, H. (2021). Automatic Detection and Annotation of Spelling Errors and Orthographic Properties in the Dutch BasiScript Corpus. Computational Linguistics in the Netherlands Journal, 11, 281–306. Retrieved from