Milner John D, Quinn Matthew S, Schmitt Phillip, Knebel Ashley, Henstenburg Jeffrey, Nasreddine Adam, Boulos Alexandre R, Schiller Jonathan R, Eberson Craig P, Cruz Aristides I
Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, RI, USA.
Division of Sports Medicine, Boston Children's Hospital, Boston, MA, USA.
J Pediatr Soc North Am. 2025 Mar 9;11:100164. doi: 10.1016/j.jposna.2025.100164. eCollection 2025 May.
The vast accessibility of artificial intelligence (AI) has enabled its utilization in medicine to improve patient education, augment patient-physician communications, support research efforts, and enhance medical student education. However, there is significant concern that these models may provide responses that are incorrect, biased, or lacking in the required nuance and complexity of best practice clinical decision-making. Currently, there is a paucity of literature comparing the quality and reliability of AI-generated responses. The purpose of this study was to assess the ability of ChatGPT and Gemini to generate reponses to the 2022 American Academy of Orthopaedic Surgeons' (AAOS) current practice guidlines on pediatric supracondylar humerus fractures. We hypothesized that both ChatGPT and Gemini would demonstrate high-quality, evidence-based responses with no significant difference between the models across evaluation criteria.
The responses from ChatGPT and Gemini to responses based on the 14 AAOS guidelines were evaluated by seven fellowship-trained pediatric orthopaedic surgeons using a questionnaire to assess five key characteristics on a scale from 1 to 5. The prompts were categorized into nonoperative or preoperative management and diagnosis, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and two-sided t-tests to compare the performance between ChatGPT and Gemini. Scores were then evaluated for inter-rater reliability.
ChatGPT and Gemini demonstrated consistent performance across the criteria, with high mean scores across all criteria except for evidence-based responses. Mean scores were highest for clarity (ChatGPT: 3.745 ± 0.237, Gemini 4.388 ± 0.154) and lowest for evidence-based responses (ChatGPT: 1.816 ± 0.181, Gemini: 3.765 ± 0.229). There were notable statistically significant differences across all criteria, with Gemini having higher mean scores in each criterion ( < .001). Gemini achieved statistically higher ratings in the relevance ( = .03) and evidence-based ( < .001) criteria. Both large language models (LLMs) performed comparably in the accuracy, clarity, and completeness criteria ( > .05).
ChatGPT and Gemini produced responses aligned with the 2022 AAOS current guideline practices for pediatric supracondylar humerus fractures. Gemini outperformed ChatGPT across all criteria, with the greatest difference in scores seen in the evidence-based category. This study emphasizes the potential for LLMs, particularly Gemini, to provide pertinent clinical information for managing pediatric supracondylar humerus fractures.
(1)The accessibility of artificial intelligence has enabled its utilization in medicine to improve patient education, support research efforts, enhance medical student education, and augment patient-physician communications.(2)There is a significant concern that artificial intelligence may provide responses that are incorrect, biased, or lacking in the required nuance and complexity of best practice clinical decision-making.(3)There is a paucity of literature comparing the quality and reliability of AI-generated responses regarding management of pediatric supracondylar humerus fractures.(4)In our study, both ChatGPT and Gemini produced responses that were well aligned with the AAOS current guideline practices for pediatric supracondylar humerus fractures; however, Gemini outperformed ChatGPT across all criteria, with the greatest difference in scores seen in the evidence-based category.
Level II.
人工智能(AI)的广泛可及性使其得以在医学领域应用,以改善患者教育、加强医患沟通、支持研究工作并提升医学生教育水平。然而,人们高度担忧这些模型可能给出不正确、有偏差或缺乏最佳实践临床决策所需细微差别和复杂性的回答。目前,比较人工智能生成回答的质量和可靠性的文献匮乏。本研究的目的是评估ChatGPT和Gemini针对2022年美国骨科医师学会(AAOS)关于小儿肱骨髁上骨折的现行实践指南生成回答的能力。我们假设ChatGPT和Gemini都能给出高质量、基于证据的回答,且在各项评估标准上模型之间无显著差异。
七位经过专科培训的小儿骨科外科医生使用一份问卷,对ChatGPT和Gemini基于AAOS的14项指南生成的回答进行评估,以1至5分的量表评估五个关键特征。这些提示被分为非手术或术前管理与诊断、手术时机和技术以及康复与预防。统计分析包括平均评分、标准差以及双侧t检验,以比较ChatGPT和Gemini之间的表现。然后评估评分者间的可靠性。
ChatGPT和Gemini在各项标准上表现一致,除基于证据的回答外,所有标准的平均得分都很高。清晰度方面平均得分最高(ChatGPT:3.745±0.237,Gemini:4.388±0.154),基于证据的回答平均得分最低(ChatGPT:1.816±0.181,Gemini:3.765±0.229)。所有标准上都存在显著的统计学差异,Gemini在每个标准上的平均得分更高(P<0.001)。Gemini在相关性(P=0.03)和基于证据(P<0.001)标准上获得了统计学上更高的评分。两个大语言模型在准确性、清晰度和完整性标准上表现相当(P>0.05)。
ChatGPT和Gemini生成的回答与2022年AAOS关于小儿肱骨髁上骨折的现行指南实践一致。Gemini在所有标准上均优于ChatGPT,在基于证据的类别中得分差异最大。本研究强调了大语言模型,尤其是Gemini,为管理小儿肱骨髁上骨折提供相关临床信息的潜力。
(1)人工智能的可及性使其能够在医学中用于改善患者教育、支持研究工作、提升医学生教育并加强医患沟通。(2)人们高度担忧人工智能可能给出不正确、有偏差或缺乏最佳实践临床决策所需细微差别和复杂性的回答。(3)比较人工智能生成的关于小儿肱骨髁上骨折管理的回答的质量和可靠性的文献匮乏。(4)在我们的研究中,ChatGPT和Gemini生成的回答都与AAOS关于小儿肱骨髁上骨折的现行指南实践高度一致;然而,Gemini在所有标准上均优于ChatGPT,在基于证据的类别中得分差异最大。
二级。