Optimising Controllable Sentence Simplification for Dutch
Abstract
The concept of Easy Language (Vandeghinste et al. 2021) involves the use of simple text, avoiding complex grammatical constructions and difficult vocabulary. Recent approaches (Seidl and Vandeghinste 2024) have shown promising results for text simplification using the pre-trained encoder-decoder T5 model (Raffel et al. 2020). This paper investigates new control tokens with a Dutch T5 large language model, and predicts sentence-dependent control token values with BERTje (de Vries et al. 2019), based on each input instance and the desired output complexity. Control tokens monitor the splitting and reformulation of the simplified sentence to control the degree of simplification (Sheang et al. 2022). Instead of fixed values for control tokens, the characteristics and complexity of the difficult sentences will be taken into account. Agrawal and Carpuat (2023) show that this approach improves the quality and controllability of the simplified outputs compared to using standardised control values. Our dataset consists of selected parallel (complex-simple) sentence pairs of the LEESPLANK dataset. The introduction of new control tokens has not proven to enhance the model’s ability to simplify sentences. But introducing BERTje to predict the actual control token values given a complex sentence has resulted in better performances and more accurate sentence simplification.