人工智能在解决小儿肱骨髁上骨折管理相关问题中的表现

Performance of Artificial Intelligence in Addressing Questions Regarding the Management of Pediatric Supracondylar Humerus Fractures.

作者信息

Milner John D, Quinn Matthew S, Schmitt Phillip, Knebel Ashley, Henstenburg Jeffrey, Nasreddine Adam, Boulos Alexandre R, Schiller Jonathan R, Eberson Craig P, Cruz Aristides I

机构信息

Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, RI, USA.

Division of Sports Medicine, Boston Children's Hospital, Boston, MA, USA.

出版信息

J Pediatr Soc North Am. 2025 Mar 9;11:100164. doi: 10.1016/j.jposna.2025.100164. eCollection 2025 May.

DOI:10.1016/j.jposna.2025.100164

PMID:40432855

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12088213/

Abstract

BACKGROUND

The vast accessibility of artificial intelligence (AI) has enabled its utilization in medicine to improve patient education, augment patient-physician communications, support research efforts, and enhance medical student education. However, there is significant concern that these models may provide responses that are incorrect, biased, or lacking in the required nuance and complexity of best practice clinical decision-making. Currently, there is a paucity of literature comparing the quality and reliability of AI-generated responses. The purpose of this study was to assess the ability of ChatGPT and Gemini to generate reponses to the 2022 American Academy of Orthopaedic Surgeons' (AAOS) current practice guidlines on pediatric supracondylar humerus fractures. We hypothesized that both ChatGPT and Gemini would demonstrate high-quality, evidence-based responses with no significant difference between the models across evaluation criteria.

METHODS

The responses from ChatGPT and Gemini to responses based on the 14 AAOS guidelines were evaluated by seven fellowship-trained pediatric orthopaedic surgeons using a questionnaire to assess five key characteristics on a scale from 1 to 5. The prompts were categorized into nonoperative or preoperative management and diagnosis, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and two-sided t-tests to compare the performance between ChatGPT and Gemini. Scores were then evaluated for inter-rater reliability.

RESULTS

ChatGPT and Gemini demonstrated consistent performance across the criteria, with high mean scores across all criteria except for evidence-based responses. Mean scores were highest for clarity (ChatGPT: 3.745 ± 0.237, Gemini 4.388 ± 0.154) and lowest for evidence-based responses (ChatGPT: 1.816 ± 0.181, Gemini: 3.765 ± 0.229). There were notable statistically significant differences across all criteria, with Gemini having higher mean scores in each criterion ( < .001). Gemini achieved statistically higher ratings in the relevance ( = .03) and evidence-based ( < .001) criteria. Both large language models (LLMs) performed comparably in the accuracy, clarity, and completeness criteria ( > .05).

CONCLUSIONS

ChatGPT and Gemini produced responses aligned with the 2022 AAOS current guideline practices for pediatric supracondylar humerus fractures. Gemini outperformed ChatGPT across all criteria, with the greatest difference in scores seen in the evidence-based category. This study emphasizes the potential for LLMs, particularly Gemini, to provide pertinent clinical information for managing pediatric supracondylar humerus fractures.

KEY CONCEPTS

(1)The accessibility of artificial intelligence has enabled its utilization in medicine to improve patient education, support research efforts, enhance medical student education, and augment patient-physician communications.(2)There is a significant concern that artificial intelligence may provide responses that are incorrect, biased, or lacking in the required nuance and complexity of best practice clinical decision-making.(3)There is a paucity of literature comparing the quality and reliability of AI-generated responses regarding management of pediatric supracondylar humerus fractures.(4)In our study, both ChatGPT and Gemini produced responses that were well aligned with the AAOS current guideline practices for pediatric supracondylar humerus fractures; however, Gemini outperformed ChatGPT across all criteria, with the greatest difference in scores seen in the evidence-based category.

LEVEL OF EVIDENCE

Level II.

摘要

背景

人工智能（AI）的广泛可及性使其得以在医学领域应用，以改善患者教育、加强医患沟通、支持研究工作并提升医学生教育水平。然而，人们高度担忧这些模型可能给出不正确、有偏差或缺乏最佳实践临床决策所需细微差别和复杂性的回答。目前，比较人工智能生成回答的质量和可靠性的文献匮乏。本研究的目的是评估ChatGPT和Gemini针对2022年美国骨科医师学会（AAOS）关于小儿肱骨髁上骨折的现行实践指南生成回答的能力。我们假设ChatGPT和Gemini都能给出高质量、基于证据的回答，且在各项评估标准上模型之间无显著差异。

方法

七位经过专科培训的小儿骨科外科医生使用一份问卷，对ChatGPT和Gemini基于AAOS的14项指南生成的回答进行评估，以1至5分的量表评估五个关键特征。这些提示被分为非手术或术前管理与诊断、手术时机和技术以及康复与预防。统计分析包括平均评分、标准差以及双侧t检验，以比较ChatGPT和Gemini之间的表现。然后评估评分者间的可靠性。

结果

ChatGPT和Gemini在各项标准上表现一致，除基于证据的回答外，所有标准的平均得分都很高。清晰度方面平均得分最高（ChatGPT：3.745±0.237，Gemini：4.388±0.154），基于证据的回答平均得分最低（ChatGPT：1.816±0.181，Gemini：3.765±0.229）。所有标准上都存在显著的统计学差异，Gemini在每个标准上的平均得分更高（P<0.001）。Gemini在相关性（P=0.03）和基于证据（P<0.001）标准上获得了统计学上更高的评分。两个大语言模型在准确性、清晰度和完整性标准上表现相当（P>0.05）。

结论

ChatGPT和Gemini生成的回答与2022年AAOS关于小儿肱骨髁上骨折的现行指南实践一致。Gemini在所有标准上均优于ChatGPT，在基于证据的类别中得分差异最大。本研究强调了大语言模型，尤其是Gemini，为管理小儿肱骨髁上骨折提供相关临床信息的潜力。

关键概念

（1）人工智能的可及性使其能够在医学中用于改善患者教育、支持研究工作、提升医学生教育并加强医患沟通。（2）人们高度担忧人工智能可能给出不正确、有偏差或缺乏最佳实践临床决策所需细微差别和复杂性的回答。（3）比较人工智能生成的关于小儿肱骨髁上骨折管理的回答的质量和可靠性的文献匮乏。（4）在我们的研究中，ChatGPT和Gemini生成的回答都与AAOS关于小儿肱骨髁上骨折的现行指南实践高度一致；然而，Gemini在所有标准上均优于ChatGPT，在基于证据的类别中得分差异最大。

证据水平

二级。

相似文献

Performance of Artificial Intelligence in Addressing Questions Regarding the Management of Pediatric Supracondylar Humerus Fractures.

J Pediatr Soc North Am. 2025 Mar 9;11:100164. doi: 10.1016/j.jposna.2025.100164. eCollection 2025 May.

Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

Arthroscopy. 2025 Jun;41(6):2002-2008. doi: 10.1016/j.arthro.2024.09.020. Epub 2024 Sep 21.

Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

J Pediatr Orthop. 2025 Apr 1;45(4):e338-e344. doi: 10.1097/BPO.0000000000002890. Epub 2025 Jan 14.

Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.

Sports Health. 2025 Apr 1:19417381251326549. doi: 10.1177/19417381251326549.

ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

J Pediatr Soc North Am. 2024 Dec 9;10:100135. doi: 10.1016/j.jposna.2024.100135. eCollection 2025 Feb.

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.

J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.

Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions?

J Pediatr Orthop. 2025 Jan 1;45(1):e66-e71. doi: 10.1097/BPO.0000000000002797. Epub 2024 Aug 22.

ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury.

Arthroscopy. 2025 Feb 4. doi: 10.1016/j.arthro.2025.01.039.

Evidence-Based Potential of Generative Artificial Intelligence Large Language Models on Dental Avulsion: ChatGPT Versus Gemini.

Dent Traumatol. 2025 Apr;41(2):178-186. doi: 10.1111/edt.12999. Epub 2024 Nov 2.

本文引用的文献

Arthroscopy. 2025 Jun;41(6):2002-2008. doi: 10.1016/j.arthro.2024.09.020. Epub 2024 Sep 21.

Artificial Intelligence in Postoperative Care: Assessing Large Language Models for Patient Recommendations in Plastic Surgery.

Healthcare (Basel). 2024 May 24;12(11):1083. doi: 10.3390/healthcare12111083.

Application of ChatGPT for Orthopedic Surgeries and Patient Care.

Clin Orthop Surg. 2024 Jun;16(3):347-356. doi: 10.4055/cios23181. Epub 2024 May 13.

The Potential of ChatGPT for High-Quality Information in Patient Education for Sports Surgery.

Cureus. 2024 Apr 23;16(4):e58874. doi: 10.7759/cureus.58874. eCollection 2024 Apr.

Evaluating ChatGPT's Ability to Answer Common Patient Questions Regarding Hip Fracture.

J Am Acad Orthop Surg. 2024 Jul 15;32(14):656-659. doi: 10.5435/JAAOS-D-23-00877. Epub 2024 May 14.

Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions.

J Pediatr Orthop. 2024 Aug 1;44(7):e592-e597. doi: 10.1097/BPO.0000000000002695. Epub 2024 Apr 30.

ChatGPT Responses to Common Questions About Slipped Capital Femoral Epiphysis: A Reliable Resource for Parents?

J Pediatr Orthop. 2024 Jul 1;44(6):353-357. doi: 10.1097/BPO.0000000000002681. Epub 2024 Apr 10.

Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions.

J Arthroplasty. 2024 Aug;39(8):2022-2027. doi: 10.1016/j.arth.2024.02.023. Epub 2024 Feb 14.

Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review.

Ann Intern Med. 2024 Feb;177(2):210-220. doi: 10.7326/M23-2772. Epub 2024 Jan 30.

ChatGPT and large language models in orthopedics: from education and surgery to research.

J Exp Orthop. 2023 Dec 1;10(1):128. doi: 10.1186/s40634-023-00700-1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

人工智能在解决小儿肱骨髁上骨折管理相关问题中的表现

Performance of Artificial Intelligence in Addressing Questions Regarding the Management of Pediatric Supracondylar Humerus Fractures.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

KEY CONCEPTS

LEVEL OF EVIDENCE

背景

方法

结果

结论

关键概念

证据水平

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献