Computational Linguistics in the Netherlands Journal https://clinjournal.org/clinj <p>The Computational Linguistics in the Netherlands Journal (CLIN Journal) provides an international forum for the electronic publication of high-quality scholarly articles in all areas of computational linguistics, language and speech technology. All published papers are open access and freely available online. The focus of the Journal is on computational linguistics research on Dutch and its variants.</p> <p>CLIN Journal is linked to the yearly CLIN conference and accepts submissions of full papers based on research presented at the conference. These papers are rigorously reviewed by members of the editorial board and additional expert reviewers, and when accepted are published in a volume of the journal with as guest editor the organizer(s) of the corresponding CLIN conference.</p> <p>ISSN: 2211-4009.</p> en-US Computational Linguistics in the Netherlands Journal 2211-4009 Preface https://clinjournal.org/clinj/article/view/168 <p>Preface</p> Jens Lemmens Lisa Hilte Jens Van Nooten Maxime De Bruyn Pieter Fivez Ine Gevers Jeska Buhmann Ehsan Lotfi Nicolae Banari Nerses Yuzbashyan Walter Daelemans Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 3 5 Personality Style Recognition via Machine Learning: Identifying Anaclitic and Introjective Personality Styles from Patients’ Speech https://clinjournal.org/clinj/article/view/169 <p>In disentangling the heterogeneity observed in psychopathology, personality of the patients is considered crucial. While it has been demonstrated that personality traits are reflected in the language used by a patient, we hypothesize that this enables automatic inference of the personality type directly from speech utterances, potentially more accurately than through a traditional questionnaire-based approach explicitly designed for personality classification. To validate this hypothesis, we adopt natural language processing (NLP) and standard machine learning tools for classification. We test this on a dataset of recorded clinical diagnostic interviews (CDI) on a sample of 79 patients diagnosed with major depressive disorder (MDD) – a condition for which differentiated treatment based on personality styles has been advocated – and classified into anaclitic and introjective personality styles. We start by analyzing the interviews to see which linguistic features are associated with each style, in order to gain a better understanding of the styles. Then, we develop automatic classifiers based on (a) standardized questionnaire responses; (b) basic text features, i.e., TF-IDF scores of words and word sequences; (c) more advanced text features, using LIWC (linguistic inquiry and word count) and context-aware features using BERT (bidirectional encoder representations from transformers); (d) audio features. We find that automated classification with language-derived features (i.e., based on LIWC) significantly outperforms questionnaire-based classification models. Furthermore, the best performance is achieved by combining LIWC with the questionnaire features. This suggests that more work should be put into developing linguistically based automated techniques for characterizing personality, however questionnaires still to some extent complement such methods.</p> Semere Kiros Bitew Vincent Schelstraete Klim Zaporojets Kimberly Van Nieuwenhove Reitske Meganck Chris Develder Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 7 29 Controllable Sentence Simplification in Dutch https://clinjournal.org/clinj/article/view/171 <p>Text simplification aims to reduce complexity in vocabulary and syntax, enhancing the readability and comprehension of text. This paper presents a supervised sentence simplification approach for Dutch using a pre-trained large language model (T5). Given the absence of a parallel corpus in Dutch, a synthetic dataset is generated from established parallel corpora. The implementation incorporates a sentence-level discrete parametrization mechanism, enabling control over the simplification features. The model’s output can be tailored to different simplification scenarios and target audiences by incorporating control tokens into the training data. The controlled attributes include sentence length, word length, paraphrasing, and lexical and syntactic complexity. This work contributes a dedicated set of control tokens tailored to the Dutch language. It shows that significant simplification can be achieved using a synthetic dataset with as few as 2000 parallel rows, although optimal performance requires a minimum of 10,000 rows. The fine-tuned model achieves a 36.85 SARI score on the test set, supporting its effectiveness in the simplification process. This research contributes to the field of sentence simplification by discussing the implementation of a supervised simplification approach for Dutch. The findings highlight the potential of synthetic datasets and control tokens in achieving effective simplification, despite the lack of a<br>parallel corpus in the target language.</p> Theresa Seidl Vincent Vandeghinste Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 31 61 Benchmarking Zero-Shot Text Classification for Dutch https://clinjournal.org/clinj/article/view/172 <p>The advent and popularisation of Large Language Models (LLMs) have given rise to promptbased Natural Language Processing (NLP) techniques which eliminate the need for large manually annotated corpora and computationally expensive supervised training or fine-tuning processes. Zero-shot learning in particular presents itself as an attractive alternative to the classical train-development-test paradigm for many downstream tasks as it provides a quick and inexpensive way of directly leveraging the implicitly encoded knowledge in LLMs. Despite the large interest in zero-shot applications within the domain of NLP as a whole, there is often no consensus on the methodology, analysis and evaluation of zero-shot pipelines. As a tentative step towards finding such a consensus, this work provides a detailed overview of available methods, resources, and caveats for zero-shot prompting within the Dutch language domain. At the same time, we present centralised zero-shot benchmark results on a large variety of Dutch NLP tasks using a series of standardised datasets. These tasks vary in subjectivity and domain, ranging from more social information extraction tasks (sentiment, emotion and irony detection for social media) to factual tasks (news topic classification and event coreference resolution). To ensure that the benchmark results are representative, we investigated a selection of zero-shot methodologies for a variety of state-of-the-art Dutch Natural Language Inference models (NLI), Masked Language models (MLM), and autoregressive language models. The output on each test set was compared to the best performance achieved using supervised methods. Our findings indicate that task-specific fine-tuning delivers superior performance in all but one (emotion detection) task. In the zero-shot settings it could be observed that large generative models through prompting seem to outperform NLI models, which in turn perform better than the MLM approach. Finally, we note several caveats and challenges tied to using zero-shot learning in application settings. These include, but are not limited to, properly streamlining evaluation of zero-shot output, parameter efficiency compared to standard finetuned models and prompt optimization.</p> Loic De Langhe Aaron Maladry Bram Vanroy Luna De Bruyne Pranaydeep Singh Els Lefever Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 63 90 Comparative Evaluation of Topic Detection: Humans vs. LLMs https://clinjournal.org/clinj/article/view/173 <p>This research explores topic detection and naming in news texts, conducting a comparative study involving human participants from Ukraine, Belgium, and the USA, alongside Large Language Models (LLMs). In the first experiment, 109 participants from diverse backgrounds assigned topics to three news texts each. The findings revealed significant variations in topic assignment and naming, emphasizing the need for nuanced evaluative metrics beyond simple binary matches. The second experiment engaged eight native speakers and six LLMs to determine and name topics for seven news texts. A jury of four experts anonymously assessed these topic names, evaluating them based on criteria such as relevance, completeness, clarity, and correctness. Detailed results shed light on the potential of LLMs in topic detection, stressing the importance of acknowledging and accommodating the inherent diversity and subjectivity in topic identification, while also proposing criteria for evaluating their application in both detecting and naming topics.</p> Andriy Kosar Guy De Pauw Walter Daelemans Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 91 120 Detecting Dialect Features Using Normalised Pointwise Information https://clinjournal.org/clinj/article/view/177 <p>Feature extraction refers to the identification of important features which differentiate one dialect group from another. It is an important step in understanding the dialectal variation, a step which has traditionally been done manually. However, manual extraction of important features is susceptible to the following problems, namely it is a time-consuming task; there is a risk of overlooking certain features and lastly, every analyst can come up with a different set of features. In this paper we compare two earlier automatic approaches to dialect feature extraction, namely Factor Analysis (Pickl 2016) and Proki´c et al.’s (2012) method based on Fisher’s Linear Discriminant. We also introduce a new method based on Normalised Pointwise Mutual Information (nPMI), which<br>outperforms other methods on the tested data set.</p> H. W. Matthew Sung Jelena Prokić Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 121 145 Historical Dutch Spelling Normalization with Pretrained Language Models https://clinjournal.org/clinj/article/view/178 <p>The Dutch language has undergone several spelling reforms since the 19th century. Normalizing 19th-century Dutch spelling to its modern equivalent can increase performance on various NLP tasks, such as machine translation or entity tagging. Van Cranenburgh and van Noord (2022) presented a rule-based system to normalize historical Dutch texts to their modern equivalent, but building and extending such a system requires careful engineering to ensure good coverage while not introducing incorrect normalizations. Recently, pretrained language models have become state-of-the-art for most NLP tasks. In this paper, we combine these approaches by building sequence-to-sequence language models trained on automatically corrected texts from the rule-based system (i.e., silver data). We experiment with several types of language models and approaches. First, we finetune two T5 models: Flan-T5 (Chung et al., 2022), an instruction-fine-tuned multilingual version of the original T5, and ByT5 (Xue et al., 2022), a token-free model which operates directly on the raw text and characters. Second, we pretrain ByT5 with the pretraining data used for BERTje (de Vries et al., 2019) and finetune this model afterward. For evaluation, we use three manually-corrected novels from the same source and compare all trained models with the original rule-based system used to generate the training data. This allows for a direct comparison between the rule-based and pretrained language models to analyze which yields the best performance. Our pretrained ByT5 model finetuned with our largest finetuning dataset achieved the best results on all three novels. This model not only outperformed the rule-based system, but also also made generalizations beyond the training data. In addition to an intrinsic evaluation of the spelling normalization itself, we also perform an extrinsic evaluation on downstream tasks, namely parsing and coreference. Results show that the neural system tends to outperform the rule-based method, although the differences are small. All code, data, and models used in this paper are available at https://github.com/andreasvc/neuralspellnorm.</p> Andre Wolters Andreas Van Cranenburgh Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 147 171 Exploring LLMs’ Capabilities for Error Detection in Dutch L1 and L2 Writing Products https://clinjournal.org/clinj/article/view/179 <p>This research examines the capabilities of Large Language Models for writing error detection, which can be seen as a first step towards automated writing support. Our work focuses on Dutch writing error detection, targeting two envisaged end-users: L1 and L2 adult speakers of Dutch. We relied on proprietary L1 and L2 datasets comprising writing products annotated with a variety of writing errors. Following the recent paradigms in NLP research, we experimented with both a fine-tuning approach combining different mono- (BERTje, RobBERT) and multilingual (mBERT, XLM-RoBERTa) models, as well as a zero-shot approach through prompting a generative autoregressive language model (GPT-3.5). The results reveal that the fine-tuning approach outperforms zero-shotting to a large extent, both for L1 and L2, even though there is much room left for improvement.</p> Joni Kruijsbergen Serafina Van Geertruyen Véronique Hoste Orphée De Clercq Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 173 191 RobBERT-2023: Keeping Dutch Language Models Up-To-Date at a Lower Cost Thanks to Model Conversion https://clinjournal.org/clinj/article/view/180 <p>Pre-training large transformer-based language models on gigantic corpora and later repurposing them as base models for finetuning on downstream tasks has proven instrumental to the recent advances in computational linguistics. However, the prohibitively high cost associated with pretraining often hampers the regular updates of base models to incorporate the latest linguistic developments. To address this issue, we present an innovative approach for efficiently producing more powerful and up-to-date versions of RobBERT, our series of cutting-edge Dutch language models, by leveraging existing language models designed for high-resource languages. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, our two RobBERT-2023 models (base and large) are entirely initialized using the RoBERTa-family of models. To initialize an embedding table tailored to the newly devised Dutch tokenizer, we rely on a token translation strategy introduced by Remy et al. (2023). Along with our RobBERT-2023 release, we deliver a freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis, while mitigating the inclusion of previously over-represented terms from adult-oriented content. To assess the value of RobBERT-2023, we evaluate its performance using the same benchmarks employed for the state-of-the-art RobBERT-2022 model, as well as the newly-released Dutch Model Benchmark. Our experimental results demonstrate that RobBERT-2023 not only surpasses its predecessor in various aspects but also achieves these enhancements at a significantly reduced training cost. This work represents a significant step forward in keeping Dutch language models up-to-date and demonstrates the potential of model conversion techniques for reducing the environmental footprint of NLP research.</p> Pieter Delobelle François Remy Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 193 203 Beyond Perplexity: Examining Temporal Generalization in Large Language Models via Definition Generation https://clinjournal.org/clinj/article/view/181 <p><span dir="ltr" role="presentation">The advent of large language models (LLMs) has significantly improved performance across various </span><span dir="ltr" role="presentation">Natural Language Processing tasks. However, the performance of LLMs has been shown to deteri</span><span dir="ltr" role="presentation">orate over time, indicating a lack of temporal generalization. To date, performance deterioration of </span><span dir="ltr" role="presentation">LLMs is primarily attributed to the factual changes in the real world over time. However, not only </span><span dir="ltr" role="presentation">the facts of the world, but also the language we use to describe it constantly changes. Recent stud</span><span dir="ltr" role="presentation">ies have indicated a relationship between performance deterioration and semantic change. This is </span><span dir="ltr" role="presentation">typically measured using perplexity scores and relative performance on downstream tasks.</span> <span dir="ltr" role="presentation">Yet, </span><span dir="ltr" role="presentation">perplexity and accuracy do not explain the effects of temporally shifted data on LLMs in practice. </span><span dir="ltr" role="presentation">In this work, we propose to assess lexico-semantic temporal generalization of a language model </span><span dir="ltr" role="presentation">by exploiting the task of contextualized word definition generation.</span> <span dir="ltr" role="presentation">This in-depth semantic as</span><span dir="ltr" role="presentation">sessment enables interpretable insights into the possible mistakes a model may perpetrate due </span><span dir="ltr" role="presentation">to meaning shift, and can be used to complement more coarse-grained measures like perplexity </span><span dir="ltr" role="presentation">scores. To assess how semantic change impacts performance, we design the task by differentiating </span><span dir="ltr" role="presentation">between semantically stable, changing, and emerging target words, and experiment with</span> <span dir="ltr" role="presentation">T5-base</span><span dir="ltr" role="presentation">, </span><span dir="ltr" role="presentation">fine-tuned for contextualized definition generation. </span><span dir="ltr" role="presentation">Our results indicate that (i) the model’s performance deteriorates for the task of contextualized </span><span dir="ltr" role="presentation">word definition generation, (ii) the performance deteriorates more for semantically changing words </span><span dir="ltr" role="presentation">compared to semantically stable words, (iii) the model exhibits significantly lower performance and </span><span dir="ltr" role="presentation">potential bias for emerging words, and (iv) the performance does not correlate with cross-entropy or </span><span dir="ltr" role="presentation">(pseudo)-perplexity scores.</span> <span dir="ltr" role="presentation">Overall, our results show that definition generation can be a promising </span><span dir="ltr" role="presentation">task to assess a model’s capacity for temporal generalization with respect to semantic change.</span></p> Iris Luden Mario Giulianelli Raquel Fernández Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 205 232 The CLIN33 Shared Task on the Detection of Text Generated by Large Language Models https://clinjournal.org/clinj/article/view/182 <p>The Shared Task for CLIN33 focuses on a relatively novel yet societally relevant task: the detection of text generated by Large Language Models (LLMs). We frame this detection task as a binary classification problem (LLM-generated or not), using test data from up to 6 different domains and text genres for both Dutch and English. Part of this test data was held out entirely from the contestants, including a ”mystery genre” which belonged to an unknown domain (later revealed to be columns). Four teams submitted 11 runs with substantially different models and features. This paper gives an overview of our task setup and contains the evaluation and detailed descriptions of the participating systems. Notably, included in the winning systems are both deep learning models as well as more traditional machine learning models leveraging task-specific feature engineering.</p> Pieter Fivez Walter Daelemans Tim Van de Cruys Yury Kashnitsky Savvas Chamezopoulos Hadi Mohammadi Anastasia Giachanou Ayoub Bagheri Wessel Poelman Juraj Vladika Esther Ploeger Johannes Bjerva Florian Matthes Hans van Halteren Copyright (c) 2024 Computational Linguistics in the Netherlands Journal 2024-03-21 2024-03-21 13 233 259