Artificial Intelligence Large Language Models for Pulmonary Nodule Surgical Decision-Making: A Comparative Accuracy Study”

Authors

  • Nilay Çavuşoğlu Yalçın Antalya Training and Research Hospital Thoracic surgery department

DOI:

https://doi.org/10.5281/zenodo.18840730

Keywords:

Artificial intelligence, large language models, pulmonary nodule, surgical indication, diagnostic accuracy

Abstract

Background

Artificial intelligence (AI) large language models show promise in medical decision-making, but their reliability in determining surgical indications for pulmonary nodules remains unexplored. We evaluated the diagnostic accuracy and consistency of three leading AI models compared with expert thoracic surgeon consensus.

METHODS

This cross-sectional diagnostic accuracy study evaluated ChatGPT-4, Claude 3.5 Sonnet, and Google Gemini Pro using 45 standardized clinical vignettes representing diverse pulmonary nodule presentations. Six thoracic surgeons with ≥5 years of experience independently reviewed all vignettes to establish consensus. Each AI model was tested three times per vignette to assess test-retest reliability. Primary outcome was overall diagnostic accuracy; secondary outcomes included inter-model agreement and performance across nodule categories and complexity levels.

RESULTS

Expert panel achieved 91.4% mean inter-rater agreement (range: 60-100%), with unanimous consensus in 46.7% of cases. Overall AI-expert agreement was 82.2% (95% CI: 71.1-93.4%). Claude and Gemini both achieved 82.2% accuracy with perfect test-retest reliability (100% consistency across three trials), while GPT-4 demonstrated 80.0% accuracy with 86.8% consistency. Inter-model agreement was highest between Claude and Gemini (100%), versus 62.2% for GPT-4 comparisons with either model. Performance varied significantly by nodule category: 100% agreement in complex scenarios (mixed pattern, multiple nodules, high-risk comorbidities, post-treatment) versus 20% in intermediate-sized solid nodules (21-30 mm).

CONCLUSIONS

Leading AI large language models demonstrate substantial agreement with expert consensus in pulmonary nodule management, with Claude and Gemini showing superior consistency. However, performance varies markedly by clinical context, particularly for intermediate-sized solid nodules where guideline ambiguity is greatest. Current AI capabilities may complement but cannot replace expert thoracic surgical judgment.

KEY WORDS: Artificial intelligence; large language models; pulmonary nodule; surgical indication; diagnostic accuracy

References

1. National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low dose computed tomographic screening. N Engl J Med 2011;365:395-409.

2. Gould MK, Tang T, Liu IL, et al. Recent trends in the identification of incidental pulmonary nodules. Am J Respir Crit Care Med 2015;192:1208-14.

3. MacMahon H, Naidich DP, Goo JM, et al. Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischner Society 2017. Radiology 2017;284:228-43.

4. Henschke CI, Yankelevitz DF, Mirtcheva R, et al. CT screening for lung cancer: frequency and significance of part-solid and nonsolid nodules. AJR Am J Roentgenol 2002;178:1053-7.

5. Gould MK, Donington J, Lynch WR, et al. Evaluation of individuals with pulmonary nodules: when is it lung cancer? Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest 2013;143:e93S-e120S.

6. Callister MEJ, Baldwin DR, Akram AR, et al. British Thoracic Society guidelines for the investigation and management of pulmonary nodules. Thorax 2015;70(Suppl 2):ii1-ii54.

7. Tanner NT, Porter A, Gould MK, et al. Physician assessment of pretest probability of malignancy and adherence with guidelines for pulmonary nodule evaluation. Chest 2017;152:263-70.

8. Wiener RS, Gould MK, Woloshin S, et al. What do you mean, a spot?: A qualitative analysis of patients' reactions to discussions with their physicians about pulmonary nodules. Chest 2013;143:672-7.

9. Ost DE, Gould MK. Decision making in patients with pulmonary nodules. Am J Respir Crit Care Med 2012;185:363-72.

10. Krist AH, Davidson KW, Mangione CM, et al. Screening for lung cancer: US Preventive Services Task Force recommendation statement. JAMA 2021;325:962-70.

11. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620:172-80.

12. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023;29:1930-40.

13. Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259-65.

14. Adams LC, Truhn D, Busch F, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 2023;307:e230725.

15. Lee P, Goldberg C, Kohane IS. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med 2023;388:1233-9.

16. Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023;183:589-96.

17. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 2015;277:826-32.

18. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.

19. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:44-56.

20. Caruana R, Lou Y, Gehrke J, et al. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015:1721-30.

21. Wood DE, Kazerooni EA, Aberle D, et al. NCCN Guidelines Insights: lung cancer screening, version 1.2022. J Natl Compr Canc Netw 2022;20:754-64.

22. Rajpurkar P, Chen E, Banerjee O, et al. AI in health and medicine. Nat Med 2022;28:31-8.

23. Kobayashi Y, Sakao Y, Deshpande GA, et al. The association between baseline clinical-radiological characteristics and growth of pulmonary nodules with ground-glass opacity. Lung Cancer 2014;83:61-6.

24. Esteva A, Chou K, Yeung S, et al. Deep learning-enabled medical computer vision. NPJ Digit Med 2021;4:5.

25. Ettinger DS, Wood DE, Aisner DL, et al. Non-small cell lung cancer, version 3.2022, NCCN clinical practice guidelines in oncology. J Natl Compr Canc Netw 2022;20:497-530.

26. Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 2019;1:e271-97.

27. Savage N. Breaking into the black box of artificial intelligence. Nature 2022;604:S18-9.

28. Singhal K, Tu T, Gottweis J, et al. Towards expert-level medical question answering with large language models. arXiv:2305.09617v1 [cs.CL] 2023.

Downloads

Published

2026-03-02

How to Cite

Çavuşoğlu Yalçın, N. (2026). Artificial Intelligence Large Language Models for Pulmonary Nodule Surgical Decision-Making: A Comparative Accuracy Study” . Acta Medica Young Doctors, 2(1). https://doi.org/10.5281/zenodo.18840730

Similar Articles

1 2 > >> 

You may also start an advanced similarity search for this article.