Performance of large language models on neuroanatomy-based medical riddles: a comparative study

Küçük Resim Yok

Tarih

2026

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Springer France

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

Purpose The integration of large language models (LLMs) into medical education has gained significant momentum in recent years. These models have demonstrated highly effective performance in medical board examination questions. However, their ability to comprehend, analyze, and reason through information has not yet been evaluated using medical riddles as an alternative assessment approach. Therefore, the aim of this study is to assess the performance of commercially available, general-purpose LLMs in solving medical riddles. Methods Responses generated by ChatGPT-5, ChatGPT-4, AnatomyGPT, Gemini 2.5, Claude, and DeepSeek for 20 neuroanatomy-related riddles were evaluated across two trials. Additionally, the riddles were presented in a different language to assess the impact of linguistic variation. Statistical analyses were conducted using Cochran's Q test and chi-square tests to compare the performance of the models. Response consistency was assessed using McNemar's test and Cohen's kappa coefficient. Results All models demonstrated strong performance on the riddles. Near-perfect accuracy was observed when the models were tested in English (ChatGPT-5 100%, ChatGPT-4 100%, AnatomyGPT 100%, Gemini 2.5 100%, DeepSeek 100%, Claude 95%). When tested in Turkish, Gemini 2.5 (80%) and DeepSeek (85%) showed relatively lower accuracy; however, overall correct response rates remained high across models. In terms of response consistency, five models demonstrated high agreement, while only Gemini 2.5 (kappa = 0.347) showed moderate agreement. Conclusion This study demonstrates that LLMs can successfully solve medical riddles with comparable levels of performance. These findings provide valuable insights into the current capabilities of LLMs in understanding, analyzing, and reasoning through domain-specific problem-solving tasks.

Açıklama

Anahtar Kelimeler

Large language models, Medical riddle, Anatomy education, ChatGPT, Gemini

Kaynak

Surgical and Radiologic Anatomy

WoS Q Değeri

Q3

Scopus Q Değeri

Q2

Cilt

48

Sayı

1

Künye