Performance of large language models on neuroanatomy-based medical riddles: a comparative study

Kacar, Huma; Turamanlar, Ozan; Emir, Busra; Yakinci, Cengiz

Performance of large language models on neuroanatomy-based medical riddles: a comparative study

dc.contributor.author	Kacar, Huma
dc.contributor.author	Turamanlar, Ozan
dc.contributor.author	Emir, Busra
dc.contributor.author	Yakinci, Cengiz
dc.date.accessioned	2026-04-04T13:37:35Z
dc.date.available	2026-04-04T13:37:35Z
dc.date.issued	2026
dc.department	İnönü Üniversitesi
dc.description.abstract	Purpose The integration of large language models (LLMs) into medical education has gained significant momentum in recent years. These models have demonstrated highly effective performance in medical board examination questions. However, their ability to comprehend, analyze, and reason through information has not yet been evaluated using medical riddles as an alternative assessment approach. Therefore, the aim of this study is to assess the performance of commercially available, general-purpose LLMs in solving medical riddles. Methods Responses generated by ChatGPT-5, ChatGPT-4, AnatomyGPT, Gemini 2.5, Claude, and DeepSeek for 20 neuroanatomy-related riddles were evaluated across two trials. Additionally, the riddles were presented in a different language to assess the impact of linguistic variation. Statistical analyses were conducted using Cochran's Q test and chi-square tests to compare the performance of the models. Response consistency was assessed using McNemar's test and Cohen's kappa coefficient. Results All models demonstrated strong performance on the riddles. Near-perfect accuracy was observed when the models were tested in English (ChatGPT-5 100%, ChatGPT-4 100%, AnatomyGPT 100%, Gemini 2.5 100%, DeepSeek 100%, Claude 95%). When tested in Turkish, Gemini 2.5 (80%) and DeepSeek (85%) showed relatively lower accuracy; however, overall correct response rates remained high across models. In terms of response consistency, five models demonstrated high agreement, while only Gemini 2.5 (kappa = 0.347) showed moderate agreement. Conclusion This study demonstrates that LLMs can successfully solve medical riddles with comparable levels of performance. These findings provide valuable insights into the current capabilities of LLMs in understanding, analyzing, and reasoning through domain-specific problem-solving tasks.
dc.identifier.doi	10.1007/s00276-026-03824-y
dc.identifier.issn	0930-1038
dc.identifier.issn	1279-8517
dc.identifier.issue	1
dc.identifier.orcid	0000-0003-4694-1319
dc.identifier.orcid	0000-0003-4804-3678
dc.identifier.pmid	41649564
dc.identifier.scopus	2-s2.0-105029611769
dc.identifier.scopusquality	Q2
dc.identifier.uri	https://doi.org/10.1007/s00276-026-03824-y
dc.identifier.uri	https://hdl.handle.net/11616/109908
dc.identifier.volume	48
dc.identifier.wos	WOS:001682976300002
dc.identifier.wosquality	Q3
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.indekslendigikaynak	PubMed
dc.language.iso	en
dc.publisher	Springer France
dc.relation.ispartof	Surgical and Radiologic Anatomy
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/closedAccess
dc.snmz	KA_WOS_20250329
dc.subject	Large language models
dc.subject	Medical riddle
dc.subject	Anatomy education
dc.subject	ChatGPT
dc.subject	Gemini
dc.title	Performance of large language models on neuroanatomy-based medical riddles: a comparative study
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
PubMed İndeksli Yayın Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Performance of large language models on neuroanatomy-based medical riddles: a comparative study

Dosyalar

Koleksiyon