Evaluating Humor Generation in an Improvisational Comedy Setting

Authors

  • Thomas Winters Catholic University of Leuven
  • Stijn Van der Stockt DPG media

Abstract

While computational humor generation has long been considered a challenging task, recent large language models have significantly improved the quality of generated jokes. Evaluating humor quality is usually difficult, as not only is the exact quality subjective, but delivery also plays a role. Another disparity in evaluation standards between human and computer-generated humor is the difference in writing time between the two. In this study, we evaluate and compare the quality of humor generated by GPT-4 with human-written jokes in an improvisational comedy setting in Dutch. In a live performance setting on national TV, nine different audience suggestions were used across three improvisational comedy games. Three professional comedians each performed their own improvised joke and an AI-generated joke per round, resulting in a total of 54 jokes. The AI-generated jokes were selected in real time from candidate outputs generated by GPT-4 using a few-shot chain-of-thought prompt specific to each game and audience suggestion. An audience
of 40 people then rated all jokes on a 4-point scale, resulting in 2,160 ratings. This allows us to compare the difference in quality between AI and human-created jokes delivered by the same comedian for the same audience suggestion. Our results show that audience members preferred human-created jokes 34.6% of the time, AI-generated jokes 29.7% of the time, and rated them equally in 35.7% of cases. Human-created jokes also received a slightly higher average rating (2.67 vs. 2.59), although GPT-4 occasionally produced standout jokes that received high “best joke” votes. These findings suggest that while human improvisation retains a narrow edge in consistency, current large language models can produce competitive humor under real-time constraints.

Downloads

Published

2025-07-15

How to Cite

Winters, T., & Van der Stockt, S. (2025). Evaluating Humor Generation in an Improvisational Comedy Setting. Computational Linguistics in the Netherlands Journal, 14, 505–523. Retrieved from https://clinjournal.org/clinj/article/view/214

Issue

Section

Articles

Most read articles by the same author(s)