Linguistic proxies of readability: Comparing easy-to-read and regular newspaper Dutch
Abstract
The aim of this study is to identify linguistic proxies of readability in Dutch, i.e. those linguistic features that define text as being easy-to-read. To this end, we compare the Wablieft corpus (Vandeghinste et al. 2019) (Flemish easy-to-read newspaper archives) to articles that appeared in the regular Flemish newspaper De Standaard, using a wide range of lexical, syntactic and readability metrics. We test which of these metrics has the highest effect size and which combinations of metrics work best in a classification task predicting whether articles belong to Wablieft or De Standaard. The results indicate that the best linguistic proxy for readability is (not surprisingly) the average number of words per sentence. Traditional reading metrics score well, although the combination of the parameters constituting these metrics score better in logistic regression than the original metrics.
