Stylometric Text Analysis for Dutch-speaking Adolescents with Autism Spectrum Disorder
Abstract
One of the main characteristics of individuals with autism spectrum disorder (ASD) is a deficit in social communication. The effects of ASD on both verbal and non-verbal communication are widely researched in this respect. In this exploratory study, we investigate whether texts of Dutchspeaking adolescents with ASD (aged 12-18 years) are (automatically) distinguishable from texts written by typically developing peers. First, we want to reveal whether specific characteristics can be found in the writing style of adolescents with ASD, and secondly, we examine the possibility to use these features in an automated classification task. We look for surface features (word and character n-grams, and simple linguistic metrics), but also for deep linguistic features (namely syntactic, semantic and discourse features). The differences between the ASD group and control group are tested for statistical significance and we show that mainly syntactic features are different among the groups, possibly indicating a less dynamic writing style for adolescents with ASD. For the classification task, a Logistic Regression classifier is used. With a surface feature approach, we could reach an F-score of 72.15%, which is much higher than the random baseline of 50%. However, a pure n-gram-based approach very much relies on content and runs the risk of detecting topics instead of style, which argues the need of using deeper linguistic features. The best combination in the deep feature approach originally reached an F-score of just 62.14%, which could not be boosted by automatic feature selection. However, by taking into account the information from the statistical analysis and merely using the features that were significant or trending, we could equal the surface-feature performance and again reached an F-score of 72.15%. This suggests that a carefully composed set of deep features is as informative as surface-feature word and character n-grams. Moreover, combining surface and deep features resulted in a slight increase in F-score to 72.33%.