Beyond Perplexity: Examining Temporal Generalization in Large Language Models via Definition Generation

Iris Luden; Mario Giulianelli; Raquel Fernández

Authors

Iris Luden UVA
Mario Giulianelli ETH Zürich
Raquel Fernández UVA

Abstract

The advent of large language models (LLMs) has significantly improved performance across various Natural Language Processing tasks. However, the performance of LLMs has been shown to deteriorate over time, indicating a lack of temporal generalization. To date, performance deterioration of LLMs is primarily attributed to the factual changes in the real world over time. However, not only the facts of the world, but also the language we use to describe it constantly changes. Recent studies have indicated a relationship between performance deterioration and semantic change. This is typically measured using perplexity scores and relative performance on downstream tasks. Yet, perplexity and accuracy do not explain the effects of temporally shifted data on LLMs in practice. In this work, we propose to assess lexico-semantic temporal generalization of a language model by exploiting the task of contextualized word definition generation. This in-depth semantic assessment enables interpretable insights into the possible mistakes a model may perpetrate due to meaning shift, and can be used to complement more coarse-grained measures like perplexity scores. To assess how semantic change impacts performance, we design the task by differentiating between semantically stable, changing, and emerging target words, and experiment with T5-base, fine-tuned for contextualized definition generation. Our results indicate that (i) the model’s performance deteriorates for the task of contextualized word definition generation, (ii) the performance deteriorates more for semantically changing words compared to semantically stable words, (iii) the model exhibits significantly lower performance and potential bias for emerging words, and (iv) the performance does not correlate with cross-entropy or (pseudo)-perplexity scores. Overall, our results show that definition generation can be a promising task to assess a model’s capacity for temporal generalization with respect to semantic change.

Beyond Perplexity: Examining Temporal Generalization in Large Language Models via Definition Generation

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)