混合方法评估人口统计学因素对 ChatGPT 提供医疗建议的影响。

Mixed methods assessment of the influence of demographics on medical advice of ChatGPT.

机构信息

Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States.

Brown University, Providence, RI 02912, United States.

出版信息

J Am Med Inform Assoc. 2024 Sep 1;31(9):2002-2009. doi: 10.1093/jamia/ocae086.

DOI:10.1093/jamia/ocae086

PMID:38679900

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11339520/

Abstract

OBJECTIVES

To evaluate demographic biases in diagnostic accuracy and health advice between generative artificial intelligence (AI) (ChatGPT GPT-4) and traditional symptom checkers like WebMD.

MATERIALS AND METHODS

Combination symptom and demographic vignettes were developed for 27 most common symptom complaints. Standardized prompts, written from a patient perspective, with varying demographic permutations of age, sex, and race/ethnicity were entered into ChatGPT (GPT-4) between July and August 2023. In total, 3 runs of 540 ChatGPT prompts were compared to the corresponding WebMD Symptom Checker output using a mixed-methods approach. In addition to diagnostic correctness, the associated text generated by ChatGPT was analyzed for readability (using Flesch-Kincaid Grade Level) and qualitative aspects like disclaimers and demographic tailoring.

RESULTS

ChatGPT matched WebMD in 91% of diagnoses, with a 24% top diagnosis match rate. Diagnostic accuracy was not significantly different across demographic groups, including age, race/ethnicity, and sex. ChatGPT's urgent care recommendations and demographic tailoring were presented significantly more to 75-year-olds versus 25-year-olds (P < .01) but were not statistically different among race/ethnicity and sex groups. The GPT text was suitable for college students, with no significant demographic variability.

DISCUSSION

The use of non-health-tailored generative AI, like ChatGPT, for simple symptom-checking functions provides comparable diagnostic accuracy to commercially available symptom checkers and does not demonstrate significant demographic bias in this setting. The text accompanying differential diagnoses, however, suggests demographic tailoring that could potentially introduce bias.

CONCLUSION

These results highlight the need for continued rigorous evaluation of AI-driven medical platforms, focusing on demographic biases to ensure equitable care.

摘要

目的

评估生成式人工智能（ChatGPT GPT-4）与传统症状检查器（如 WebMD）之间在诊断准确性和健康建议方面的人口统计学偏差。

材料和方法

为 27 种最常见的症状抱怨开发了组合症状和人口统计学小插图。从患者角度编写的标准化提示，具有年龄、性别和种族/民族的各种人口统计学变化，于 2023 年 7 月至 8 月期间输入到 ChatGPT（GPT-4）中。总共比较了 3 次共 540 次 ChatGPT 提示与相应的 WebMD 症状检查器输出，使用混合方法。除了诊断正确性外，还分析了 ChatGPT 生成的相关文本的可读性（使用 Flesch-Kincaid 年级水平）和定性方面，如免责声明和人口统计学定制。

结果

ChatGPT 在 91%的诊断中与 WebMD 匹配，最高诊断匹配率为 24%。诊断准确性在年龄、种族/民族和性别等人口统计学群体中没有显著差异。ChatGPT 的紧急护理建议和人口统计学定制更多地呈现给 75 岁的人，而不是 25 岁的人（P<.01），但在种族/民族和性别群体中没有统计学差异。GPT 文本适合大学生，没有明显的人口统计学差异。

讨论

使用非健康定制的生成式人工智能，如 ChatGPT，进行简单的症状检查功能可提供与商业可用症状检查器相当的诊断准确性，并且在这种情况下不存在显著的人口统计学偏差。然而，伴随鉴别诊断的文本表明存在潜在的人口统计学定制偏差，这可能会引入偏见。

结论

这些结果强调需要继续对人工智能驱动的医疗平台进行严格评估，重点关注人口统计学偏差，以确保公平的护理。

相似文献

Mixed methods assessment of the influence of demographics on medical advice of ChatGPT.

J Am Med Inform Assoc. 2024 Sep 1;31(9):2002-2009. doi: 10.1093/jamia/ocae086.

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.

JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.

JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.

Generative artificial intelligence versus clinicians: Who diagnoses multiple sclerosis faster and with greater accuracy?

Mult Scler Relat Disord. 2024 Oct;90:105791. doi: 10.1016/j.msard.2024.105791. Epub 2024 Aug 6.

Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment.

J Med Internet Res. 2024 Aug 14;26:e55939. doi: 10.2196/55939.

What's in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT.

J Med Internet Res. 2024 Mar 5;26:e51837. doi: 10.2196/51837.

Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.

JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Assessing the Efficacy of ChatGPT Prompting Strategies in Enhancing Thyroid Cancer Patient Education: A Prospective Study.

J Med Syst. 2025 Jan 17;49(1):11. doi: 10.1007/s10916-024-02129-0.

引用本文的文献

Public Versus Academic Discourse on ChatGPT in Health Care: Mixed Methods Study.

JMIR Infodemiology. 2025 Jun 23;5:e64509. doi: 10.2196/64509.

Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.

J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.

Large language models are less effective at clinical prediction tasks than locally trained machine learning models.

J Am Med Inform Assoc. 2025 May 1;32(5):811-822. doi: 10.1093/jamia/ocaf038.

Evaluating and addressing demographic disparities in medical large language models: a systematic review.

Int J Equity Health. 2025 Feb 26;24(1):57. doi: 10.1186/s12939-025-02419-0.

The Goldilocks Zone: Finding the right balance of user and institutional risk for suicide-related generative AI queries.

PLOS Digit Health. 2025 Jan 8;4(1):e0000711. doi: 10.1371/journal.pdig.0000711. eCollection 2025 Jan.

Not the Models You Are Looking For: Traditional ML Outperforms LLMs in Clinical Prediction Tasks.

medRxiv. 2024 Dec 5:2024.12.03.24318400. doi: 10.1101/2024.12.03.24318400.

Fairness in AI-Driven Oncology: Investigating Racial and Gender Biases in Large Language Models.

Cureus. 2024 Sep 16;16(9):e69541. doi: 10.7759/cureus.69541. eCollection 2024 Sep.

Large language models in biomedicine and health: current research landscape and future directions.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1801-1811. doi: 10.1093/jamia/ocae202.

本文引用的文献

Evaluating the Quality and Usability of Artificial Intelligence-Generated Responses to Common Patient Questions in Foot and Ankle Surgery.

Foot Ankle Orthop. 2023 Nov 22;8(4):24730114231209919. doi: 10.1177/24730114231209919. eCollection 2023 Oct.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Information Quality and Readability: ChatGPT's Responses to the Most Common Questions About Spinal Cord Injury.

World Neurosurg. 2024 Jan;181:e1138-e1144. doi: 10.1016/j.wneu.2023.11.062. Epub 2023 Nov 22.

Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases.

Neuroradiology. 2024 Jan;66(1):73-79. doi: 10.1007/s00234-023-03252-4. Epub 2023 Nov 23.

Assessing ChatGPT's ability to answer questions pertaining to erectile dysfunction: can our patients trust it?

Int J Impot Res. 2024 Nov;36(7):734-740. doi: 10.1038/s41443-023-00797-z. Epub 2023 Nov 20.

Bias and Inaccuracy in AI Chatbot Ophthalmologist Recommendations.

Cureus. 2023 Sep 25;15(9):e45911. doi: 10.7759/cureus.45911. eCollection 2023 Sep.

Artificial intelligence and increasing misinformation.

Br J Psychiatry. 2024 Feb;224(2):33-35. doi: 10.1192/bjp.2023.136.

Centering health equity in large language model deployment.

PLOS Digit Health. 2023 Oct 24;2(10):e0000367. doi: 10.1371/journal.pdig.0000367. eCollection 2023 Oct.

Large language models propagate race-based medicine.

NPJ Digit Med. 2023 Oct 20;6(1):195. doi: 10.1038/s41746-023-00939-z.

A vignette-based evaluation of ChatGPT's ability to provide appropriate and equitable medical advice across care contexts.

Sci Rep. 2023 Oct 19;13(1):17885. doi: 10.1038/s41598-023-45223-y.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

混合方法评估人口统计学因素对 ChatGPT 提供医疗建议的影响。

Mixed methods assessment of the influence of demographics on medical advice of ChatGPT.

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料和方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献