Performance of large language models on neuroanatomy-based medical riddles: a comparative study

dc.contributor.authorKacar, Huma
dc.contributor.authorTuramanlar, Ozan
dc.contributor.authorEmir, Busra
dc.contributor.authorYakinci, Cengiz
dc.date.accessioned2026-04-04T13:37:35Z
dc.date.available2026-04-04T13:37:35Z
dc.date.issued2026
dc.departmentİnönü Üniversitesi
dc.description.abstractPurpose The integration of large language models (LLMs) into medical education has gained significant momentum in recent years. These models have demonstrated highly effective performance in medical board examination questions. However, their ability to comprehend, analyze, and reason through information has not yet been evaluated using medical riddles as an alternative assessment approach. Therefore, the aim of this study is to assess the performance of commercially available, general-purpose LLMs in solving medical riddles. Methods Responses generated by ChatGPT-5, ChatGPT-4, AnatomyGPT, Gemini 2.5, Claude, and DeepSeek for 20 neuroanatomy-related riddles were evaluated across two trials. Additionally, the riddles were presented in a different language to assess the impact of linguistic variation. Statistical analyses were conducted using Cochran's Q test and chi-square tests to compare the performance of the models. Response consistency was assessed using McNemar's test and Cohen's kappa coefficient. Results All models demonstrated strong performance on the riddles. Near-perfect accuracy was observed when the models were tested in English (ChatGPT-5 100%, ChatGPT-4 100%, AnatomyGPT 100%, Gemini 2.5 100%, DeepSeek 100%, Claude 95%). When tested in Turkish, Gemini 2.5 (80%) and DeepSeek (85%) showed relatively lower accuracy; however, overall correct response rates remained high across models. In terms of response consistency, five models demonstrated high agreement, while only Gemini 2.5 (kappa = 0.347) showed moderate agreement. Conclusion This study demonstrates that LLMs can successfully solve medical riddles with comparable levels of performance. These findings provide valuable insights into the current capabilities of LLMs in understanding, analyzing, and reasoning through domain-specific problem-solving tasks.
dc.identifier.doi10.1007/s00276-026-03824-y
dc.identifier.issn0930-1038
dc.identifier.issn1279-8517
dc.identifier.issue1
dc.identifier.orcid0000-0003-4694-1319
dc.identifier.orcid0000-0003-4804-3678
dc.identifier.pmid41649564
dc.identifier.scopus2-s2.0-105029611769
dc.identifier.scopusqualityQ2
dc.identifier.urihttps://doi.org/10.1007/s00276-026-03824-y
dc.identifier.urihttps://hdl.handle.net/11616/109908
dc.identifier.volume48
dc.identifier.wosWOS:001682976300002
dc.identifier.wosqualityQ3
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.indekslendigikaynakPubMed
dc.language.isoen
dc.publisherSpringer France
dc.relation.ispartofSurgical and Radiologic Anatomy
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WOS_20250329
dc.subjectLarge language models
dc.subjectMedical riddle
dc.subjectAnatomy education
dc.subjectChatGPT
dc.subjectGemini
dc.titlePerformance of large language models on neuroanatomy-based medical riddles: a comparative study
dc.typeArticle

Dosyalar