Comparative Readability Analysis of Large Language Model Responses to Stress Urinary Incontinence Questions: ChatGPT versus Gemini
Readability of AI Responses on Stress Urinary Incontinence
DOI:
https://doi.org/10.5281/zenodo.17764454Keywords:
Artificial Intelligence, Readibility, Large language modelsAbstract
Background: Large language models (LLMs) are increasingly used by patients seeking health information. The readability of AI-generated medical content is crucial for patient comprehension and informed decision-making.
Objective: To compare the readability levels of responses generated by ChatGPT and Gemini to stress urinary incontinence (SUI) patient questions using multiple validated readability assessment tools.
Methods: Thirteen commonly asked questions about SUI were posed to both ChatGPT and Gemini. Responses were analyzed using nine standardized readability metrics: Average Reading Level Consensus Calculator (ARLC), Automated Readability Index (ARI), Flesch Reading Ease (FRE), Gunning Fog Index (GFI), Flesch-Kincaid Grade Level (FKGL), Coleman-Liau Index (CLI), SMOG Index, Original Linsear Write Readability Formula, and LINSEAR Write Grade Level Formula. Statistical comparisons were performed using Student's t-test.
Results: ChatGPT responses demonstrated higher complexity across most metrics. The Flesch Reading Ease score was significantly lower for ChatGPT (31.1±14.5) compared to Gemini (43.4±15.4, p=0.047), indicating more difficult text. The LINSEAR Write Grade Level Formula showed a substantial difference, with ChatGPT scoring 20.9±10.9 versus Gemini's 7.9±2.1 (p<0.001). Other indices including ARLC, ARI, GFI, FKGL, and CLI consistently showed higher grade levels for ChatGPT, though differences did not reach statistical significance.
Conclusion: Gemini produced more readable health information about SUI compared to ChatGPT. Both models generated content at grade levels exceeding recommended health literacy standards. These findings highlight the need for readability optimization in AI-generated medical content.
References
1. Sezgin, E., J. Sirrianni, and S.L. Linwood, Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR medical informatics, 2022. 10(2): p. e32875.
2. Berkman, N.D., et al., Low health literacy and health outcomes: an updated systematic review. Annals of internal medicine, 2011. 155(2): p. 97-107.
3. Weiss, B.D., Help patients understand. Manual for Clinicians. AMA Foundation, 2007.
4. Lukacz, E.S., et al., Urinary incontinence in women: a review. Jama, 2017. 318(16): p. 1592-1604.
5. Wang, L.-W., et al., Assessing readability formula differences with written health information materials: application, results, and recommendations. Research in Social and Administrative Pharmacy, 2013. 9(5): p. 503-516.
6. Paasche-Orlow, M.K., H.A. Taylor, and F.L. Brancati, Readability standards for informed-consent forms as compared with actual readability. New England journal of medicine, 2003. 348(8): p. 721-726.
7. Sweller, J., Cognitive load theory, learning difficulty, and instructional design. Learning and instruction, 1994. 4(4): p. 295-312.
8. Shoemaker, S.J., M.S. Wolf, and C. Brach, Development of the Patient Education Materials Assessment Tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information. Patient education and counseling, 2014. 96(3): p. 395-403.
9. Kobayashi, L.C., et al., Aging and functional health literacy: a systematic review and meta-analysis. Journals of Gerontology Series B: Psychological Sciences and Social Sciences, 2016. 71(3): p. 445-457.
10. Nouri, S.S., et al., Assessing mobile phone digital literacy and engagement in user-centered design in a diverse, safety-net population: mixed methods study. JMIR mHealth and uHealth, 2019. 7(8): p. e14250.
11. Ayers, J.W., et al., Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine, 2023. 183(6): p. 589-596.
12. Lee, P., S. Bubeck, and J. Petro, Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine, 2023. 388(13): p. 1233-1239.
13. Zeng-Treitler, Q., et al., Estimating consumer familiarity with health terminology: a context-based approach. Journal of the American Medical Informatics Association, 2008. 15(3): p. 349-356.
14. Haynes, S.N., D. Richard, and E.S. Kubany, Content validity in psychological assessment: A functional approach to concepts and methods. Psychological assessment, 1995. 7(3): p. 238.
15. Dumenci, L., et al., On the validity of the shortened rapid estimate of adult literacy in medicine (REALM) scale as a measure of health literacy. Communication methods and measures, 2013. 7(2): p. 134-143.
16. Chen, S., et al., Use of artificial intelligence chatbots for cancer treatment information. JAMA oncology, 2023. 9(10): p. 1459-1462.
17. DeWalt DA, C.L., Hawk VH, Brach C, Hink A, Health Literacy Universal Precautions Toolkit. 2010.
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ibrahim Halil Sukur, Fesih Ok

This work is licensed under a Creative Commons Attribution 4.0 International License.