Pressman Sophia M, Borna Sahar, Gomez-Cabello Cesar A, Haider Syed Ali, Forte Antonio Jorge
Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA.
Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA.
J Clin Med. 2024 May 11;13(10):2832. doi: 10.3390/jcm13102832.
: OpenAI's ChatGPT (San Francisco, CA, USA) and Google's Gemini (Mountain View, CA, USA) are two large language models that show promise in improving and expediting medical decision making in hand surgery. Evaluating the applications of these models within the field of hand surgery is warranted. This study aims to evaluate ChatGPT-4 and Gemini in classifying hand injuries and recommending treatment. : Gemini and ChatGPT were given 68 fictionalized clinical vignettes of hand injuries twice. The models were asked to use a specific classification system and recommend surgical or nonsurgical treatment. Classifications were scored based on correctness. Results were analyzed using descriptive statistics, a paired two-tailed -test, and sensitivity testing. : Gemini, correctly classifying 70.6% hand injuries, demonstrated superior classification ability over ChatGPT (mean score 1.46 vs. 0.87, -value < 0.001). For management, ChatGPT demonstrated higher sensitivity in recommending surgical intervention compared to Gemini (98.0% vs. 88.8%), but lower specificity (68.4% vs. 94.7%). When compared to ChatGPT, Gemini demonstrated greater response replicability. : Large language models like ChatGPT and Gemini show promise in assisting medical decision making, particularly in hand surgery, with Gemini generally outperforming ChatGPT. These findings emphasize the importance of considering the strengths and limitations of different models when integrating them into clinical practice.
OpenAI的ChatGPT(美国加利福尼亚州旧金山)和谷歌的Gemini(美国加利福尼亚州山景城)是两个大型语言模型,在改善和加速手外科的医疗决策方面显示出前景。有必要评估这些模型在手外科领域的应用。本研究旨在评估ChatGPT-4和Gemini对手部损伤进行分类并推荐治疗方法的能力。
给Gemini和ChatGPT两次提供68个虚构的手部损伤临床病例。要求这些模型使用特定的分类系统并推荐手术或非手术治疗方法。根据正确性对分类进行评分。使用描述性统计、配对双尾t检验和敏感性测试对结果进行分析。
Gemini正确分类了70.6%的手部损伤,其分类能力优于ChatGPT(平均得分1.46对0.87,p值<0.001)。在治疗建议方面,ChatGPT在推荐手术干预方面比Gemini具有更高的敏感性(98.0%对88.8%),但特异性较低(68.4%对94.7%)。与ChatGPT相比,Gemini表现出更大的回答可重复性。
像ChatGPT和Gemini这样的大型语言模型在协助医疗决策方面显示出前景,特别是在手外科领域,Gemini通常表现优于ChatGPT。这些发现强调了在将不同模型整合到临床实践中时考虑其优势和局限性的重要性。