Word Sense Discrimination with French Transformer Models
Abstract
This paper investigates unsupervised Word Sense Discrimination using French monolingual transformer models (viz. FlauBERT and CamemBERT), employing clustering and lexical substitution techniques. To investigate approaches that can benefit lower-resource languages, we explore three approaches: (1) clustering contextual embeddings derived through Principal Component Analysis (PCA); (2) a substitute-based method inspired by Amrami and Goldberg (2018), which leverages sparse vectors of model-predicted substitutes; and (3) an enhanced lexical substitution approach adapted from Zhou (2019), designed specifically for BERT-based models and employing embedding dropout to preserve semantic coherence. The evaluation uses two datasets: a manually annotated gold standard comprising 11 homonymous and polysemous target words, and a noisier, augmented corpus sourced from web crawls. Cluster estimation is performed with the Bayesian Information Criterion (BIC), and clustering is conducted using Gaussian Mixture Models (GMM). The gold standard enables comprehensive evaluation across hard-clustering metrics, addressing the lack of consensus on benchmarking Word Sense Discrimination algorithms. Our results show that FlauBERT consistently outperforms CamemBERT on clean datasets, while CamemBERT demonstrates greater robustness to noise. Incorporating Zhou’s (2019) lexical substitution technique yields state-of-the-art performance, particularly in substitute-based methods, but at the cost of significantly higher computational demands and variability due to embedding dropout. These findings highlight the trade-offs between precision and scalability in applying advanced lexical substitution methods.