Automating lexical simplification in Dutch

Bram Bulté; Leen Sevens; Vincent Vandeghinste

Authors

Bram Bulté Centre for Computational Linguistics, KU Leuven, Belgium
Leen Sevens Centre for Computational Linguistics, KU Leuven, Belgium
Vincent Vandeghinste Instituut voor de Nederlandse Taal

Abstract

We discuss the design, development and evaluation of an automated lexical simplification tool for Dutch. A basic pipeline approach is used to perform both text adaptation and annotation. First, sentences are preprocessed and word sense disambiguation is performed. Then, the difficulty of each token is estimated by looking at their average age of acquisition and frequency in a corpus of simplified Dutch. We use Cornetto to find synonyms of words that have been identified as difficult and the SONAR500 corpus to perform reverse lemmatisation. Finally, we rely on a largescale language model to verify whether the selected replacement word fits the local context. In addition, the text is augmented with information from Wikipedia (word definitions and links). We tune and evaluate the system with sentences taken from the Flemish newspaper De Standaard. The results show that the system’s adaptation component has low coverage, since it only correctly simplifies around one in five ‘difficult’ words, but reasonable accuracy, with no grammatical errors being introduced in the text. The Wikipedia annotations have a broader coverage, but their potential for simplification needs to be further developed and more thoroughly evaluated.

Automating lexical simplification in Dutch

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)