Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

Authors

Abstract

This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, with the goal of offering insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews).
We first evaluate BERTopic and Top2Vec for the purpose of individual interview summarization, by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on coherence, clarity, and relevance. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three clinically oriented embedding models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic precision and interpretability, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely "Coordination and Communication in Cancer Care Management" and "Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows. Affiliated resources created in this work will be shared publicly at https://github.com/4dpicture/TM4health including codes on preprocessing and stopword list prepared.

Author Biographies

  • Teodor-Călin Ionescu, Leiden University

    Msc student 

  • Lifeng Han, Leiden University

    Lifeng Got his PhD in Machine Translation from Dublin, Ireland, thesis title “An investigation into multi-word expressions in machine translation” 

    He did his first postdoctoral research project at University of Manchester on NLP for digital healthcare “Integrating hospital outpatient letters into the healthcare data space”, where he helped building models and supervising students on tasks including medication extraction, relation extraction, text simplification, entity linking, de-identification, synthetic data generation, and machine translation. 

    His current research is with the EU 4D Picture project, “The overall aim is to improve the cancer patient journey and ensure personal preferences are respected.” 

    He was the Workshop Co-chair on Multiword Expressions (MWEs), 2023/24, a long standing workshop with ACL since 2003. He gave a tutorial presentation to the main conference of LREC (Language Resource and Evaluation), one of the largest NLP conferences, in 2022, Marseille, France, on “Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview | Video |”.

    He holds an honorary position with University of Manchester. 

  • Jan Heijdra Suasnabar, Leiden University Medical Center

    PhD candidate in Healthcare, Health Economics, Public Health, Global Health

  • Anne Stiggelbout, Leiden University Medical Center

    "As full professor I study and teach shared decision making and other topics related to doctor-patient decision making (such as risk communication). Currently my research focuses mostly on supporting patient-centred care and shared decision making at the meso level, e.g., by redesigning care paths or modifying clinical practice guidelines. Increasingly a focus is on Value-Based Health Care and Appropriate Care (Passende Zorg), in the Netherlands as well as in my research.

    My PhD research (1995) and research in the years following that was mostly focused on patient preferences. I also developed and validated several measurement instruments.

    My teaching to medical students focuses on Shared Decision Making. Further I supervise both PhD and other students, both of the LUMC and other universities.
    I am also involved in initiatives related to quality of care improvement, when patient-centredness is involved.

    I am also member of the Board of Governors (non-executive Board) of the Jeroen Bosch Hospital (till Januari 2025).

    "

  • Suzan Verberne, Leiden University

    "

    Suzan Verberne is professor of Natural Language Processing (NLP) at the Leiden Institute of Advanced Computer Science at Leiden University and a member of the interdisciplinary research programme Society, Artificial Intelligence and Life Sciences (SAILS).

    "

Downloads

Published

2026-06-01

Issue

Section

Articles

How to Cite

Analyzing Cancer Patients’ Experiences with Embedding-based Topic Modeling and LLMs. (2026). Computational Linguistics in the Netherlands Journal, 15, 121-142. https://clinjournal.org/clinj/article/view/243

Most read articles by the same author(s)