The Riddle Experiment: two groups are trying to solve a Black Story behind a screen, only one group is alive

Authors

  • Nikki S. Rademaker University of Leiden
  • Linthe van Rooij University of Leiden
  • Yanna E. Smid University of Leiden
  • Tessa Verhoef University of Leiden

Abstract

Investigating the cognitive abilities of large language models (LLMs) can inform theories about both artificial and human intelligence and highlight areas where AI may complement human cognition. This study explores GPT-4’s logical reasoning abilities by comparing its performance in solving Black Story riddles to that of humans. Black Stories are riddles where players reconstruct a hidden narrative by asking yes-or-no questions to a player who knows the full story. These riddles test logical reasoning, creativity, and inference skills of the solvers in an interactive setting. The study utilized a set of 12 existing Black Stories, with deviations in details included. Each Black Story was tested twice in the human and GPT-4 group to minimize individual differences. The experiment was conducted via text messaging to align the testing set-up for the two groups and eliminate potential non-verbal advantages for the human test group. The primary performance indicator was the number of questions needed to solve the riddle, considering the number of given hints to come to the solution. This measure indicated no significant difference between the groups, where both groups managed to arrive at the correct answer eventually. Though GPT-4 was significantly more verbose in questioning than humans, and qualitative results showed that GPT-4 excelled in precise questioning and creativity, but often fixated too much on details. This led to missing the bigger picture and summarizing solutions prematurely. On the other hand, humans covered broader topics and adapted their focus quickly, but had more difficulty figuring out uncommon details. This research suggests that the performance of GPT-4 and humans in solving Black Stories is not significantly different, despite using alternative approaches to achieve results.

Downloads

Published

2025-07-15

How to Cite

Rademaker, N. S., van Rooij, L., Smid, Y. E., & Verhoef, T. (2025). The Riddle Experiment: two groups are trying to solve a Black Story behind a screen, only one group is alive. Computational Linguistics in the Netherlands Journal, 14, 451–472. Retrieved from https://clinjournal.org/clinj/article/view/212

Issue

Section

Articles