Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs
Abstract
This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, with the goal of offering insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews).
We first evaluate BERTopic and Top2Vec for the purpose of individual interview summarization, by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on coherence, clarity, and relevance. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three clinically oriented embedding models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic precision and interpretability, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely "Coordination and Communication in Cancer Care Management" and "Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows. Affiliated resources created in this work will be shared publicly at https://github.com/4dpicture/TM4health including codes on preprocessing and stopword list prepared.