Assessing Large Language Models in Neuroradiology: A Performance and Readability Comparison of ChatGPT and Gemini

Authors

  • Sefa Turkoglu Department of Radiology, University of Colorado Anschutz Medical Campus, CO 80045, United States.
  • Ayse Say Suleyman Demirel University Department of Radiology https://orcid.org/0000-0002-4938-9059

DOI:

https://doi.org/10.5281/zenodo.17901571

Keywords:

ChatGPT, Gemini, radiology, neuroradiology, readability, large language models, artificial intelligence

Abstract

Abstract

Objective:
To compare the performance, accuracy, readability, and clinical reliability of two large language models—ChatGPT (OpenAI) and Gemini (Google DeepMind)—in responding to neuroradiology-related questions.

Methods:
A cross-sectional model-comparison study was conducted using a standardized set of neuroradiology questions covering neuroanatomy, imaging physics, disease recognition, interpretation, and management. Both models received identical prompts, and responses were evaluated by fellowship-trained neuroradiologists using a structured scoring system assessing accuracy, completeness, clarity, and clinical safety. Error types were categorized, and statistical analyses were performed to compare performance. Additionally, readability metrics were applied to evaluate text complexity across multiple established indices.

Results:
Both models demonstrated strong baseline capability in neuroradiology knowledge. ChatGPT generally produced more detailed and stepwise reasoning, whereas Gemini provided concise responses with slightly more variability in accuracy. Readability analysis revealed that Gemini-generated texts were consistently easier to read and required a lower educational level for comprehension (p < .01 across most indices), while ChatGPT responses were more complex and required higher reading proficiency. No significant differences were found in SMOG or Linsear Write indices. Error patterns differed, with ChatGPT producing fewer hallucinations but more nonessential detail, while Gemini exhibited more omissions.

Conclusions:
ChatGPT and Gemini show promising utility as supportive tools in neuroradiology education and workflow tasks, but both exhibit limitations related to factual completeness and clinical safety. Gemini offers superior readability, while ChatGPT provides more structured reasoning. Neither model is sufficiently reliable for autonomous diagnostic or management decision-making. Continued evaluation using domain-specific benchmarks and multimodal datasets is recommended to guide safe clinical integration.

Keywords:
ChatGPT, Gemini, radiology, neuroradiology, readability, large language models, artificial intelligence

References

1. Benjamens S, Dhunnoo P, Meskó B. The state of artificial intelligence-based FDA-approved medical devices and algorithms. Lancet Digit Health. 2020;2(1):e13–e14.

2. Biswas S. Role of ChatGPT in healthcare: A review. J Med Syst. 2023;47(1):5.

3. Blattner T, Blattner M, Regan L. ChatGPT in clinical medicine: A cross-sectional evaluation. JMIR Med Educ. 2023;9:e46813.

4. Chen S, Yang L, Chen J, Zhang L. Deep learning in neuroradiology: Current status and future perspectives. AJNR Am J Neuroradiol. 2022;43(2):179–187.

5. Farhat W, Alnassar S, Aljuaid M, et al. Evaluating ChatGPT as a tool for radiology education: Performance on board-style questions. Clin Imaging. 2023;94:18–23.

6. Google DeepMind. Gemini: A family of highly capable multimodal models. 2024. Available from: https://deepmind.google

7. Gupta R, Tan MP. Opportunities and challenges of large language models in medical imaging. Radiol Artif Intell. 2023;5(5):e230054.

8. Hinton G. Deep learning—a technology with the potential to transform medicine. JAMA. 2018;320(11):1101–1102.

9. Hutchinson B, Mitchell M. 50 Years of test (un)fairness: Lessons for machine learning. In: FAT* ’19. New York, NY: ACM; 2019. p. 49–58.

10. Jha S, Topol EJ. Adapting radiology training to artificial intelligence. Radiology. 2023;307(2):e222228.

11. Kapoor R, Blake MA, Ghoshhajra BB. Large language models and the future of radiology reporting. Radiol Artif Intell. 2023;5(3):e230008.

12. Kwon JM, Kim KH, Kim HJ. Artificial intelligence in acute stroke imaging. J Stroke. 2020;22(1):31–43.

13. Lee H, Yune S, Mansouri M, et al. Deep-learning–based automatic detection of cerebral aneurysms in MR angiography. Radiology. 2020;294(3):549–556.

14. Liu X, Faes L, Kale A, et al. A comparison of deep learning performance against healthcare professionals. Lancet Digit Health. 2019;1(6):e271–e297.

15. OpenAI. GPT-4 technical report. 2023. Available from: https://openai.com/research

16. Rao A, et al. Assessing the accuracy and reliability of ChatGPT for radiology. Radiology. 2023;308(2):e230345.

17. Recht MP, Bryan RN. Artificial intelligence in radiology: Are we ready? Radiology. 2017;285(3):713–715.

18. Shen D, Wu G, Suk H. Deep learning in medical image analysis. Annu Rev Biomed Eng. 2017;19:221–248.

19. Thrall JH, Li X, Li Q, et al. Artificial intelligence and machine learning in radiology: Opportunities and challenges. J Am Coll Radiol. 2018;15(3):504–513.

20. American College of Radiology. ACR Data Science Institute: AI use cases. 2023. Available from: https://www.acrdsi.org

21. Dönger, U., & Doğan, A. C. (2025). Readability Assessment of Large Language Model Responses to Common Postnatal Questions Among Migrant Parents: A Comparative Analysis of ChatGPT and Gemini. Acta Medica Young Doctors, 1(3). https://doi.org/10.5281/zenodo.17670372.

22. Sukur, I. H., & Ok, F. (2025). Comparative Readability Analysis of Large Language Model Responses to Stress Urinary Incontinence Questions: ChatGPT versus Gemini: Readability of AI Responses on Stress Urinary Incontinence. Acta Medica Young Doctors, 1(3). https://doi.org/10.5281/zenodo.17764454.

23. Aydemir, A. M. (2025). Warfarin Use: A Readability Comparison of Gemini and ChatGPT: Warfarin Use: A Readability Comparison of Gemini and ChatGPT. Acta Medica Young Doctors, 1(2), 59–65. https://doi.org/10.5281/zenodo.17156371

Published

2025-10-11

How to Cite

Turkoglu, S., & Say, A. (2025). Assessing Large Language Models in Neuroradiology: A Performance and Readability Comparison of ChatGPT and Gemini. Acta Medica Young Doctors, 1(2). https://doi.org/10.5281/zenodo.17901571

Similar Articles

You may also start an advanced similarity search for this article.