Computational Linguistics in the Netherlands Journal

Preface

Alina Karakanta — 2025-07-15

Preface

Happy or lonely? Investigating mental well-being using remote methods during the COVID-19 pandemic in The Netherlands

Marije Kanis — 2025-07-15

Understanding the unprecedented impact of COVID-19 on mental health and digital interactions is crucial, but also difficult to study in times of physical distancing. This paper contributes to the understanding of well-being in The Netherlands during the pandemic by employing mixed-remote methods. Sentiments of the Dutch public expressed on X (formally Twitter) are analyzed with AI techniques. Additionally, co-creative toolkits and probes, such as diaries, were used with older adults and students for detailed in-situ capturing. The AI approach provides general insights, while toolkit studies can address interpersonal variation and provide non-automated individual feedback. Findings indicate that (1) the pandemic has impacted the expressed emotional states of ‘loneliness’ and ‘happiness’, (2) this varied over time, for example related to pandemic announcements, (3) there are differences between groups (such as young and old), and (4) the toolkits provided contextual self-reflective insights and active inspiration in support of mental well-being.

Optimising Controllable Sentence Simplification for Dutch

Florelien Soete — 2025-07-15

The concept of Easy Language (Vandeghinste et al. 2021) involves the use of simple text, avoiding complex grammatical constructions and difficult vocabulary. Recent approaches (Seidl and Vandeghinste 2024) have shown promising results for text simplification using the pre-trained encoder-decoder T5 model (Raffel et al. 2020). This paper investigates new control tokens with a Dutch T5 large language model, and predicts sentence-dependent control token values with BERTje (de Vries et al. 2019), based on each input instance and the desired output complexity. Control tokens monitor the splitting and reformulation of the simplified sentence to control the degree of simplification (Sheang et al. 2022). Instead of fixed values for control tokens, the characteristics and complexity of the difficult sentences will be taken into account. Agrawal and Carpuat (2023) show that this approach improves the quality and controllability of the simplified outputs compared to using standardised control values. Our dataset consists of selected parallel (complex-simple) sentence pairs of the LEESPLANK dataset. The introduction of new control tokens has not proven to enhance the model’s ability to simplify sentences. But introducing BERTje to predict the actual control token values given a complex sentence has resulted in better performances and more accurate sentence simplification.

Evaluating LLM-Generated Topic Names via Text Reconstruction

Andriy Kosar — 2025-07-15

Automatically generating topic names for texts using large language models (LLMs) has become an innovative approach to topic detection. However, evaluating the quality of these LLM-generated topic names remains challenging, particularly in assessing their semantic relevance to the texts and the correctness of the information they convey. To address this gap, we propose a novel evaluation method that leverages LLMs to reconstruct original texts from generated topic names, then compares the reconstructed texts to the original by measuring their similarity. Topic names that produce reconstructed texts more similar to original ones better convey the original text’s information. This method favors topic names that maintain essential information, minimizing issues like incorrectness and irrelevance. Our experiments show that the reconstruction-based evaluation aligns with human topic name evaluation. This novel method demonstrates its versatility for evaluating other LLM-generated semantic compressions, such as summaries, headlines, and keywords.

Bag of Lies: Robustness in Continuous Pre-training BERT

Ine Gevers — 2025-07-15

This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Specifically, we focus on to what extent entity knowledge can be acquired through continuous pre-training, and how robust this process is. Since the pandemic emerged after the last update of BERT’s pre-training data, the model has little to no prior entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We use a fact-checking benchmark about the entity, namely Check-COVID, as an evaluative framework, comparing a baseline BERT model with continuous pre-trained variants on this task. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as using misinformation and shuffling the word order until the input becomes nonsensical. Our findings reveal that these methods do not degrade, and sometimes even improve, the model’s downstream performance. This suggests that continuous pre-training of BERT is robust against these attacks, but that BERT obtaining entity-specific knowledge is susceptible to writing style changes in the data. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated (false) counterparts.

Generating Simplified Dutch Texts for Pupils Through N-Shot Learning

Wout Sinnaeve — 2025-07-15

Text simplification (TS) aims to improve text readability while retaining its original meaning, aiding individuals with limited literacy skills or reading comprehension challenges. While substantial progress has been made in TS for English, there is a notable lack of research for Dutch, in part caused by the absence of Dutch parallel simplification corpora. This study investigates the effectiveness of N-shot learning using generative open-source large language models (LLM) for TS in Dutch, circumventing the need for extensive parallel corpora. Various N-shot learning techniques are assessed for their performance in generating simplified Dutch texts for pupils. The readability and appropriateness of these texts is evaluated using automatic readability assessment models and human evaluations. Results indicate that while one-shot learning using a Dutch monolingual generative LLM shows the highest performance among the tested methods, the overall effectiveness is poor, with metrics close to random guess probabilities. Human evaluation further highlights significant issues and that the generated outputs often do not match the intended readability levels and appropriateness for specific educational contexts. These findings suggest that current N-shot learning methodologies are not effective for Dutch TS, emphasising the need for more refined approaches and better training data to improve performance in this task.

PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics

Qixiang Fang — 2025-07-15

Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs’ academic proficiency, often with also an interest in comparing model performance with human test takers’. While such benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics—a field dedicated to the measurement of latent variables like academic proficiency—into LLM benchmarking. We make four primary contributions. First, we reflect on current LLM benchmark developments and contrast them with psychometrics-based test development. Second, we introduce PATCH: a novel framework for Psychometrics-AssisTed benCHmarking of LLMs. PATCH addresses the aforementioned limitations. In particular, PATCH enables valid comparison between LLMs and human populations. Third, we demonstrate PATCH by measuring several LLMs’ proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on current benchmarking practices. Fourth, we release 4 high-quality datasets to support measuring and comparing LLM proficiency in grade school mathematics and science with human populations.

On Dialect Typicality and Transitional Dialects

Ho Wang Matthew Sung — 2025-07-15

Since the late 19th century, dialectologists have already noticed that dialect areas do not have abrupt borders, but gradual transitions from one zone to another. One could speak of ‘focal’ or ‘core’ vs. ‘transitional’ areas for relatively homogenous areas vs. regions with linguistic variants from different neighbours. Dialectometry, the computational branch of dialectology, offers tools to capture and visualise dialect transitions. These techniques can show the global transition patterns, but not for which features, and in what way are they contributing to the transition. On the other hand, detailed studies on individual (often singleton) features in variationist sociolinguistics offer us insights to the details in the way dialects transition from one area to another. But the limit of close examination is the possible over-generalisation of the pattern of one feature over many other features in the dialects. In addition, it is not possible for humans to analyse every single variable in the data manually. In this paper, a new approach is proposed to explore dialect transitions on the feature level, namely the dialect typicality decay analysis. This novel approach builds on previous approaches in dialectometry (automatic dialect classification and feature extraction), and it explores how transitional (and prototypical) dialects possess characteristic features of a particular dialect group. Two main issues are explored using dialect typicality analysis: 1) what does dialect transition look like based on their top characteristic features? and 2) do transitional dialects favour adopting the most typical features from a certain dialect group?

A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation

David Ph. Shakouri — 2025-07-15

This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.

From ‘nijntje’ to ‘konijn’: New methods for analyzing stress pattern acquisition

Nienke Wessel — 2025-07-15

In order to gain more insight into stress pattern acquisition by children, we are in need of large datasets and analysis tools that can analyze those data sets. This paper suggests a new way to analyze PhonBank corpora by automatically loading in the annotated data and plotting it in different manners, focusing on stress pattern acquisition. This way, we can more easily track changes in children’s development through time. This method is applied to three longitudinal and two cross-sectional corpora of Dutch, German, and English. The results show that there are some aspects of previously proposed theories that are in need of further investigation. Most notably, past theories have assumed that stages of stress pattern development were the same across different words, while we found some evidence for stages of development per word. This suggests that further research is needed into how exactly these stages differ for different words, and what determines whether a child learns the correct stress pattern for a word sooner or later. Furthermore, comparing our findings to those of previous investigations which used manual analysis methods provides us with the opportunity to discuss the drawbacks and benefits of our analysis tool.

Gender Bias and the Role of Context in Human Perception and Machine Translation

Janiça Hackenbuchner — 2025-07-15

This paper investigates human gender bias and its relation to bias in machine translation (MT), focussing on the role of context in gender interpretation. To this end, we measured human implicit gender bias and conducted an annotation study, followed by a linguistic and computational analysis to compare human gender perceptions among themselves and with a machine translation system. We created a dataset of 60 gender-ambiguous sentences and collected annotations to understand human gender perceptions and specifically which trigger words in context lead to this perception. The study shows that, unlike the MT system tested in this study, humans exhibit highly varied perceptions of gender in ambiguous contexts. A linguistic analysis on annotated trigger words reveals that proper nouns, nouns and adjectives frequently affect human gender perception.

Location-focused translation of flooding events in news articles

Suzan Lejeune — 2025-07-15

We are interested in the automatic extraction of information on flooding events in the Philippines from local news papers. Given that the majority of existing information extraction tools have been developed for English, this study aims to investigate the feasibility of using open-source machine translation (MT) tools to translate Tagalog news items to English. Extra care should be taken when translating location names, as precise location information is indispensable for ef- fective disaster management. We fine-tuned an open-source multi-lingual MT model for disaster news in Tagalog. We investigated several methods to enhance the model performance on location translation and evaluated the different versions to compare the translation quality of locations using a custom location-focused evaluation metric. To this end, two new Tagalog-English datasets specific to the domain were introduced for the purposes of fine-tuning and evaluation. We tested out fine-tuning on domain specific data and two masking techniques using either general masks or database-look-up of names. Contrary to our expectations, our findings show that the base open-source multi-lingual MT model was already proficient in location translation. Our analysis indicates that fine-tuning on domain-specific data improves overall machine translation quality. Our manual analysis provides insight into specific errors of location translation and the unique effects of the fine-tuning techniques.

LLMs as chainsaws: evaluating open-weights generative LLMs for extracting fauna and flora from multilingual travelogues

Tess Dejaeghere — 2025-07-15

Named Entity Recognition (NER) is crucial in literary-historical research for tasks such as semantic indexing and entity linking. However, historical texts pose challenges for implementing said tasks due to language variations, OCR errors, and poor performance of off-the-shelf annotation tools. Generative Large Language Models (LLMs) present both novel opportunities and challenges in humanities research. These models, while powerful, raise valid concerns regarding biases, hallucinations, and opacity - making their evaluation for the Digital Humanities (DH) community all the more urgent. In response, we propose our work on the evaluation of 3 quantized open-weights LLMs (mistral-7b-instruct-v0.1, nous-hermes-llama2-13b, Meta-Llama-3-8B-instruct) through GPT4ALL for NER on literary-historical travelogues from the 18th to 20th centuries in English, French, Dutch, and German. All models were assessed both quantitatively and qualitatively across 5 incrementally more complex prompts - revealing common error types such as bias, parsing issues, the addition of redundant information, entity adaptations and hallucinations. We analyse prevalent examples per language, century, prompt and model. Our contributions include a publicly accessible annotated dataset, pioneering insights into LLMs’ performance in literary-historical contexts, and the publication of reusable workflows for utilizing and evaluating LLMs in humanities research.

Unlocking Domain Knowledge: Model Adaptation for Non-Normative Dutch

Florian Debaene — 2025-07-15

This study examines the adaptation of transformer models to two non-normative Dutch language variants: early modern Dutch and contemporary social media Dutch. Both share linguistic features that set them apart from standard Dutch, including spelling inconsistencies, semantic shifts and out-of-domain vocabulary. To address this, we explore two domain adaptation techniques to adapt models to these language variants: (1) continued full-model pre-training and (2) training specialized adapters integrated into existing models. We evaluate these adaptation techniques on sentiment and emotion detection in early modern Dutch comedies and farces and on emotion and irony detection in Dutch tweets. Our results show that both adaptation methods significantly improve performance on historical and social media Dutch tasks, with the greatest gains occurring when domain-relevant datasets are used. The effectiveness of model adaptation is task-dependent and sensitive to the selection of pre-training data, emphasizing domain relevance over data quantity for optimizing downstream performance. We hypothesize that contemporary Dutch encoder models already capture informal language but lack historical Dutch exposure, making adaptation more impactful for the latter. Additionally, we compare adapted encoder models to generative decoder models, which are state-of-the-art in many NLP tasks. While generative models fail to match the performance of our adapted models for historical Dutch, fine-tuned generative models outperform adapted models on social media Dutch tasks. This suggests that task-specific fine-tuning remains crucial for effective generative modelling. Finally, we release two pre-training corpora for Dutch encoder adaptation and two novel task-specific datasets for early modern Dutch on Hugging Face.

Using GPT-4 for Conventional Metaphor Detection in English News Texts

Jiahui Liang — 2025-07-15

Metaphor detection presents a significant challenge in natural language processing (NLP) due to the intrinsic complexity of metaphors. In this work, we apply a prompting approach to evaluate GPT-4’s performance on the conventional metaphor identification task. We specifically investigate the effects of prompt variation, output stability, and the role of n-shot prompting. The results indicate that GPT-4’s performance on the metaphor identification task is consistently low across all tested settings, significantly lagging behind the top-performing BERT model. Based on our findings and error analysis, we propose possible approaches for utilizing LLMs and AI assistants
in metaphor detection and analysis.

Exploring the use of pre-trained ASR models for automatic assessment of children’s oral reading

Bram Groenhof — 2025-07-15

Dutch children’s reading skills have been declining consistently for many years. Oral reading fluency, a combination of decoding skills and word recognition skills, is a fundamental pre-requisite for one’s reading competence. Children’s oral reading fluency is often tested through oral word reading tasks, which are time-consuming to carry out as teachers have to administer the tests in a one-on-one setting, in which they have to indicate the word reading correctness on-the-fly. One possible way of alleviating this workload is to use automatic speech recognition (ASR) to aid in the assessment process. A key concern is that many ASR models struggle with children’s speech. We explored the performance of two pre-trained ASR models: Wav2Vec2.0-CGN and Faster-Whisper-v2. We had them carry out correctness judgement on an oral word reading task, using data from the Children’s Oral Reading Corpus (CHOREC). This corpus contains oral reading data of word lists from native Dutch-speaking primary school children aged 6-12 from Flanders. We compared the results of the ASR models to those of assessors in CHOREC by using the agreement metrics specificity, recall, accuracy, F1-score, and MCC as agreement metrics. We then used two different methods to improve the baseline results, by post-correcting ASR model correctness judgements using manually defined error categories. We found that allowing a deviation from the prompt by one error category obtained the best results for the overall metrics. Faster-Whisper-v2 (accuracy = .89; F1-score = .58; MCC = .54) outperformed Wav2Vec2.0 (accuracy = .70; F1-score = .39; MCC = .38). The MCC values show that both ASR models had mild agreement with assessors. We expected the accuracy levels for both models to be lower than the lowest assessor inter-rater accuracy level (.86), but Faster-Whisper-v2 performed better than expected (.89). However, one should be careful in interpreting this result, since the high accuracy scores are partially due to the imbalanced dataset. We conclude that the performance of standard pre-trained ASR models is promising, but given the current quality of the procedure caution should be exercised in its use. Future research could aim to improve the performance of the whole procedure by e.g. using methods like fine-tuning and validation, and through collaborative research with teachers.

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

Gus Lathouwers — 2025-07-15

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

Word Sense Discrimination with French Transformer Models

Stef Accou — 2025-07-15

This paper investigates unsupervised Word Sense Discrimination using French monolingual transformer models (viz. FlauBERT and CamemBERT), employing clustering and lexical substitution techniques. To investigate approaches that can benefit lower-resource languages, we explore three approaches: (1) clustering contextual embeddings derived through Principal Component Analysis (PCA); (2) a substitute-based method inspired by Amrami and Goldberg (2018), which leverages sparse vectors of model-predicted substitutes; and (3) an enhanced lexical substitution approach adapted from Zhou (2019), designed specifically for BERT-based models and employing embedding dropout to preserve semantic coherence. The evaluation uses two datasets: a manually annotated gold standard comprising 11 homonymous and polysemous target words, and a noisier, augmented corpus sourced from web crawls. Cluster estimation is performed with the Bayesian Information Criterion (BIC), and clustering is conducted using Gaussian Mixture Models (GMM). The gold standard enables comprehensive evaluation across hard-clustering metrics, addressing the lack of consensus on benchmarking Word Sense Discrimination algorithms. Our results show that FlauBERT consistently outperforms CamemBERT on clean datasets, while CamemBERT demonstrates greater robustness to noise. Incorporating Zhou’s (2019) lexical substitution technique yields state-of-the-art performance, particularly in substitute-based methods, but at the cost of significantly higher computational demands and variability due to embedding dropout. These findings highlight the trade-offs between precision and scalability in applying advanced lexical substitution methods.

Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Zahra Abedi — 2025-07-15

This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: ’How can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records?’ We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08% and a Word Error Rate (WER) of 5.06%, while JSON extraction from OCR text achieved an average accuracy of 63% and, based on annotated OCR, 65%. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

Lexical semantic change detection for Ancient Greek: dataset creation and evaluation of a word-embedding-based technique

Silvia Stopponi — 2025-07-15

We create a benchmark for the evaluation of lexical semantic change detection in Ancient Greek and use it to assess the validity of two metrics of lexical semantic change on diachronic embeddings models. Stopponi et al. (2024b) assessed the viability of lexical semantic change detection for Ancient Greek with word2vec models, using two existing measures. However, only a manual evaluation was conducted since a benchmark for the evaluation of this task for Ancient Greek was still missing. We create such a benchmark by extracting cases of semantic change from close-reading studies in Ancient Greek lexical semantics. We also create a parallel benchmark of semantically stable items and assess the effectiveness of the most relevant of the two metrics in distinguishing semantically changed from semantically stable items. Finally, we qualitatively evaluate the candidates for semantic change detected by filtering words by low vector coherence value and high frequency. The results show that the method is effective at retrieving cases of semantic change, especially when coupled with frequency information, but also reinforce the idea that performing lexical semantic change detection on an ancient language and building a robust evaluation benchmark are particularly challenging tasks. In conclusion, we propose a constructive way to leverage this method as a research companion, by integrating it with the close-reading method.

The Riddle Experiment: two groups are trying to solve a Black Story behind a screen, only one group is alive

Nikki S. Rademaker — 2025-07-15

Investigating the cognitive abilities of large language models (LLMs) can inform theories about both artificial and human intelligence and highlight areas where AI may complement human cognition. This study explores GPT-4’s logical reasoning abilities by comparing its performance in solving Black Story riddles to that of humans. Black Stories are riddles where players reconstruct a hidden narrative by asking yes-or-no questions to a player who knows the full story. These riddles test logical reasoning, creativity, and inference skills of the solvers in an interactive setting. The study utilized a set of 12 existing Black Stories, with deviations in details included. Each Black Story was tested twice in the human and GPT-4 group to minimize individual differences. The experiment was conducted via text messaging to align the testing set-up for the two groups and eliminate potential non-verbal advantages for the human test group. The primary performance indicator was the number of questions needed to solve the riddle, considering the number of given hints to come to the solution. This measure indicated no significant difference between the groups, where both groups managed to arrive at the correct answer eventually. Though GPT-4 was significantly more verbose in questioning than humans, and qualitative results showed that GPT-4 excelled in precise questioning and creativity, but often fixated too much on details. This led to missing the bigger picture and summarizing solutions prematurely. On the other hand, humans covered broader topics and adapted their focus quickly, but had more difficulty figuring out uncommon details. This research suggests that the performance of GPT-4 and humans in solving Black Stories is not significantly different, despite using alternative approaches to achieve results.

Fietje: An open, efficient LLM for Dutch

Bram Vanroy — 2025-07-15

This paper introduces Fietje, a family of small language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. Fietje demonstrated competitive results with larger language models upon its release. A core emphasis of this work is transparency and reproducibility: Fietje is fully open-source, with model weights, datasets, training, and evaluation code all publicly accessible. The paper discusses the performance of Fietje and many other models on an extensive evaluation suite of benchmarks on reasoning, sentiment analysis, world knowledge, linguistic acceptability and word sense disambiguation. Evaluation results illustrate the rapid progress in the field of large language models (LLMs), where recent small models outperform older, larger models that were fine-tuned for Dutch. This trend signals an exciting future for Dutch language processing, suggesting that even compact LLMs are becoming increasingly capable. Furthermore, ongoing and future efforts to adapt LLMs to Dutch are poised to enhance these models even further, broadening their applicability and accessibility. Fietje is only an intermediate step in improving accessibility to language technology for users of the Dutch language.

Evaluating Humor Generation in an Improvisational Comedy Setting

Thomas Winters — 2025-07-15

While computational humor generation has long been considered a challenging task, recent large language models have significantly improved the quality of generated jokes. Evaluating humor quality is usually difficult, as not only is the exact quality subjective, but delivery also plays a role. Another disparity in evaluation standards between human and computer-generated humor is the difference in writing time between the two. In this study, we evaluate and compare the quality of humor generated by GPT-4 with human-written jokes in an improvisational comedy setting in Dutch. In a live performance setting on national TV, nine different audience suggestions were used across three improvisational comedy games. Three professional comedians each performed their own improvised joke and an AI-generated joke per round, resulting in a total of 54 jokes. The AI-generated jokes were selected in real time from candidate outputs generated by GPT-4 using a few-shot chain-of-thought prompt specific to each game and audience suggestion. An audience
of 40 people then rated all jokes on a 4-point scale, resulting in 2,160 ratings. This allows us to compare the difference in quality between AI and human-created jokes delivered by the same comedian for the same audience suggestion. Our results show that audience members preferred human-created jokes 34.6% of the time, AI-generated jokes 29.7% of the time, and rated them equally in 35.7% of cases. Human-created jokes also received a slightly higher average rating (2.67 vs. 2.59), although GPT-4 occasionally produced standout jokes that received high “best joke” votes. These findings suggest that while human improvisation retains a narrow edge in consistency, current large language models can produce competitive humor under real-time constraints.

Intrinsic evaluation of Mono- and Multilingual Dutch Language Models

Daniel Vlantis — 2025-07-15

Through transfer learning, multilingual language models can produce good results on extrinsic, downstream NLP tasks in low-resource languages despite a lack of abundant training data. In most cases, however, monolingual models still perform better. Using the Dutch SimLex-999 dataset, we intrinsically evaluate several pre-trained monolingual stacked encoder LLMs for Dutch and compare them to several multilingual models that support Dutch, including two with parallel architectures (BERTje and mBERT). We also try to improve these models’ semantic representations by tuning the multilingual models on additional Dutch data. Furthermore, we explore the effect of tuning these models on written versus transcribed spoken data. While we can improve multilingual model performance through fine-tuning, we find that significant amounts of fine-tuning data and compute are required to outscore monolingual models on the intrinsic evaluation metric.

Evaluating Dutch Speakers and Large Language Models on Standard Dutch: a grammatical Challenge Set based on the Algemene Nederlandse Spraakkunst

Julia Pestel — 2025-07-15

This study evaluates the linguistic knowledge of Dutch Large Language Models (LLMs) by introducing a novel challenge set based on the Algemene Nederlandse Spraakkunst (ANS). The ANS is a comprehensive resource of Dutch prescriptive grammar created by linguists. We collect acceptability judgements of Dutch native speakers on our dataset, validating its usability while observing varying degrees of grammatical acceptability on specific syntactic phenomena. We evaluate both transformer-encoder and transformer-decoder Dutch LLMs on this dataset, and we compare their performance against the standard rules of Dutch in our dataset and the speaker ratings. We find that transformer-encoder models exhibit almost perfect accuracy on our dataset, yet sensitivities for specific sentences differ between models and humans, partially due to mismatches between the reference grammar and actual use of Dutch.