Automating lexical simplification in Dutch

  • Bram Bulté Centre for Computational Linguistics, KU Leuven, Belgium
  • Leen Sevens Centre for Computational Linguistics, KU Leuven, Belgium
  • Vincent Vandeghinste Instituut voor de Nederlandse Taal

Abstract

We discuss the design, development and evaluation of an automated lexical simplification tool for Dutch. A basic pipeline approach is used to perform both text adaptation and annotation. First, sentences are preprocessed and word sense disambiguation is performed. Then, the difficulty of each token is estimated by looking at their average age of acquisition and frequency in a corpus of simplified Dutch. We use Cornetto to find synonyms of words that have been identified as difficult and the SONAR500 corpus to perform reverse lemmatisation. Finally, we rely on a largescale language model to verify whether the selected replacement word fits the local context. In addition, the text is augmented with information from Wikipedia (word definitions and links). We tune and evaluate the system with sentences taken from the Flemish newspaper De Standaard. The results show that the system’s adaptation component has low coverage, since it only correctly simplifies around one in five ‘difficult’ words, but reasonable accuracy, with no grammatical errors being introduced in the text. The Wikipedia annotations have a broader coverage, but their potential for simplification needs to be further developed and more thoroughly evaluated.

Published
2018-12-01
How to Cite
Bulté, B., Sevens, L., & Vandeghinste, V. (2018). Automating lexical simplification in Dutch. Computational Linguistics in the Netherlands Journal, 8, 24-48. Retrieved from https://clinjournal.org/clinj/article/view/78
Section
Articles