Can Artificial Intelligence Support Orthopedic Outpatient Decision-Making?

ChatGPT 4.0 as a second reader in orthopedic outpatient practice

Authors

DOI:

https://doi.org/10.66288/actamedi.2026.71

Keywords:

artificial intelligence, ChatGPT 4.0, orthopedic outpatient, red-flag recognition, clinical decision support, hallucination

Abstract

Aim

 

Artificial intelligence language models are increasingly evaluated as clinical decision-support tools. However, evidence regarding their performance in realistic orthopedic outpatient scenarios—particularly in investigation appropriateness, red-flag recognition, and safety-related errors—remains limited.

 

Methodology

 

A prospective simulation study was conducted using 100 standardized orthopedic outpatient case vignettes reflecting common clinical presentations and validated by two senior orthopedic surgeons. Cases were submitted to ChatGPT 4.0 using a standardized prompt and compared with responses from orthopedic residents and specialists. Primary outcomes were diagnostic accuracy and investigation appropriateness (based on ACR and NICE guidelines). Secondary outcomes included red-flag miss rates and error types (Type A: knowledge deficiency; Type B: unsafe reasoning; Type C: contextual hallucination). A combined model (resident + ChatGPT 4.0) was evaluated as a theoretical upper bound. Results are presented with 95% confidence intervals.

 

Results

 

Diagnostic accuracy was 62% for residents, 78% for ChatGPT 4.0, and 92% for specialists; the combined model reached 86%. Red-flag miss rates were 17%, 6%, 3%, and 2%, respectively. ChatGPT 4.0 showed lower investigation appropriateness than residents (64% vs. 71%; p=0.041), reflecting over-investigation. Its errors were mainly Type C (68%), whereas resident errors were predominantly Type B (42%).

 

Conclusion

 

ChatGPT 4.0 shows potential as a supplementary second reader, reducing red-flag misses when combined with physician judgment. However, its tendency toward over-investigation and contextual hallucination necessitates specialist oversight, supporting an augmentative rather than autonomous role.

References

1. Balogh EP, Miller BT, Ball JR, eds. Improving Diagnosis in Health Care. Washington (DC): National Academies Press; 2015. DOI: https://doi.org/10.17226/21794

2. Hall KK, Shoemaker-Hunt S, Hoffman L, et al. Making Healthcare Safer III: A Critical Analysis of Existing and Emerging Patient Safety Practices. Rockville (MD): Agency for Healthcare Research and Quality; 2020.

3. Gunderson CG, Bilan VP, Holleck JL, et al. Prevalence of harmful diagnostic errors in hospitalised adults: a systematic review and meta-analysis. BMJ Qual Saf. 2020;29(12):1008–1018. doi:10.1136/bmjqs-2019-010822 DOI: https://doi.org/10.1136/bmjqs-2019-010822

4. Slawomirski L, Kelly D, de Bienassis K, et al. The economics of diagnostic safety. OECD Health Work Pap. 2025;(176):1–98. DOI: https://doi.org/10.1787/fc61057a-en

5. Singh H, Schiff GD, Graber ML, Onakpoya I, Thompson MJ. The global burden of diagnostic errors in primary care. BMJ Qual Saf. 2017;26(6):484–494. doi:10.1136/bmjqs-2016-005401 DOI: https://doi.org/10.1136/bmjqs-2016-005401

6. Schiff GD, Hasan O, Kim S, et al. Diagnostic error in medicine: analysis of 583 physician-reported errors. Arch Intern Med. 2009;169(20):1881–1887. doi:10.1001/archinternmed.2009.333 DOI: https://doi.org/10.1001/archinternmed.2009.333

7. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31–38. doi:10.1038/s41591-021-01614-0 DOI: https://doi.org/10.1038/s41591-021-01614-0

8. Kim SE, Lee JH, Choi BS, Han HS, Lee MC, Ro DH. Performance of ChatGPT on solving orthopedic board-style questions: a comparative analysis of ChatGPT 3.5 and ChatGPT 4. Clin Orthop Surg. 2024;16(4):669–673. doi:10.4055/cios23229 DOI: https://doi.org/10.4055/cios23179

9. Hofmann HL, Guerra GA, Le JL, et al. The rapid development of artificial intelligence: GPT-4’s performance on orthopedic surgery board questions. Orthopedics. 2024;47(2):e85–e89. doi:10.3928/01477447-20240110-06 DOI: https://doi.org/10.3928/01477447-20230922-05

10. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education. PLoS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198 DOI: https://doi.org/10.1371/journal.pdig.0000198

11. Asgari E, Montana-Brown N, Dubois M, et al. A framework to assess clinical safety and hallucination rates of LLMs. NPJ Digit Med. 2025;8(1):274. doi:10.1038/s41746-025-01000-0 DOI: https://doi.org/10.1038/s41746-025-01670-7

12. Shool S, Adimi S, Saboori Amleshi R, et al. A systematic review of large language model evaluations in clinical medicine. BMC Med Inform Decis Mak. 2025;25(1):117. doi:10.1186/s12911-025-02500-0 DOI: https://doi.org/10.1186/s12911-025-02954-4

13. Alkalbani AM, Alrawahi AS, Salah A, et al. Large language models in medical specialties: applications and challenges. Information (Basel). 2025;16(6):489. doi:10.3390/info16060489 DOI: https://doi.org/10.3390/info16060489

14. Kunze KN, Varady NH, Mazzucco M, et al. ChatGPT-4 exhibits strong triage capabilities for knee pain. Arthroscopy. 2024. doi:10.1016/j.arthro.2024.06.021 DOI: https://doi.org/10.1016/j.arthro.2024.06.021

15. Muluk SY, Olcucu N. ChatGPT vs Bard in identifying red flags of low back pain. Cureus. 2024;16(7):e63580. doi:10.7759/cureus.63580 DOI: https://doi.org/10.7759/cureus.63580

16. American College of Radiology. ACR Appropriateness Criteria® Low Back Pain. 2021. Available from: https://acsearch.acr.org/docs/69483/Narrative/

17. Emery DJ, Shojania KG, Forster AJ, Mojaverian N, Feasby TE. Overuse of magnetic resonance imaging. JAMA Intern Med. 2013;173(9):823–825. doi:10.1001/jamainternmed.2013.3804 DOI: https://doi.org/10.1001/jamainternmed.2013.3804

18. Patel ND, Broderick DF, Burns J, et al. ACR Appropriateness Criteria® imaging of low back pain. J Am Coll Radiol. 2016;13(9):1069–1078. doi:10.1016/j.jacr.2016.06.008 DOI: https://doi.org/10.1016/j.jacr.2016.06.008

19. Shieh A, Tran B, He G, et al. ChatGPT 4.0 diagnostic accuracy on clinical cases. Sci Rep. 2024;14:9330. doi:10.1038/s41598-024-59330-0 DOI: https://doi.org/10.1038/s41598-024-58760-x

20. Wang L, Liu J, Liu S, et al. Applications and concerns of ChatGPT in healthcare. J Med Internet Res. 2024;26:e22769. doi:10.2196/22769 DOI: https://doi.org/10.2196/22769

21. Omar M, Sorin V, Collins JD, et al. LLM vulnerability to hallucination attacks in clinical decision support. Commun Med (Lond). 2025;5(1):330. doi:10.1038/s43856-025-00330-0 DOI: https://doi.org/10.1038/s43856-025-01021-3

22. Roustan D, Bastardot F. Clinicians’ guide to large language models. Interact J Med Res. 2025;14:e59823. doi:10.2196/59823 DOI: https://doi.org/10.2196/59823

23. Kücking F, Hübner U, Przysucha M, et al. Automation bias in AI decision support. Stud Health Technol Inform. 2024;317:298–304. doi:10.3233/SHTI240298 DOI: https://doi.org/10.3233/SHTI240298

24. Khan MM, Pincher B, Pacheco R. Unnecessary MRI of the knee and cost burden. Ann Med Surg (Lond). 2021;70:102736. doi:10.1016/j.amsu.2021.102736 DOI: https://doi.org/10.1016/j.amsu.2021.102736

25. Mukkamala L, Glaser D, Saleh A, et al. MRI overuse in knee pain referrals. JAAOS Glob Res Rev. 2024;8(10):e24.00258. doi:10.5435/JAAOSGlobal-D-24-00258 DOI: https://doi.org/10.5435/JAAOSGlobal-D-24-00258

26. Goh E, Gallo RJ, Strong E, et al. Physician decision modification with AI assistance. Commun Med (Lond). 2025. doi:10.1038/s43856-025-00781-2 DOI: https://doi.org/10.1038/s43856-025-00781-2

27. Rosbach E, Aubreville M, et al. Automation bias under time pressure in pathology. arXiv. 2024;2411.00998

28. Chen X, Xiang J, Lu S, et al. Evaluating LLMs in healthcare. Intell Med. 2025;5(2):151–163. doi:10.1016/j.imed.2025.02.003 DOI: https://doi.org/10.1016/j.imed.2025.03.002

29. Kim Y, et al. Medical hallucinations in foundation models. medRxiv. 2025. doi:10.1101/2025.02.28.25323115 DOI: https://doi.org/10.1101/2025.02.28.25323115

30. Wei Q, Liu H, Wang J, et al. ChatGPT-generated medical responses: systematic review. J Biomed Inform. 2024;156:104620. doi:10.1016/j.jbi.2024.104620 DOI: https://doi.org/10.1016/j.jbi.2024.104620

Downloads

Published

2026-05-01

How to Cite

Ayhan, B. . (2026). Can Artificial Intelligence Support Orthopedic Outpatient Decision-Making? : ChatGPT 4.0 as a second reader in orthopedic outpatient practice. Acta Medica Young Doctors, 2(3), 1–10. https://doi.org/10.66288/actamedi.2026.71

Similar Articles

1 2 3 4 > >> 

You may also start an advanced similarity search for this article.