Marcaccini Gianluca, Seth Ishith, Xie Yi, Susini Pietro, Pozzi Mirco, Cuomo Roberto, Rozen Warren M
Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, 53100 Siena, Italy.
Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia.
J Clin Med. 2025 Mar 14;14(6):1983. doi: 10.3390/jcm14061983.
: Hand fracture management requires precise diagnostic accuracy and complex decision-making. Advances in artificial intelligence (AI) suggest that large language models (LLMs) may assist or even rival traditional clinical approaches. This study evaluates the effectiveness of ChatGPT-4o, DeepSeek-V3, and Gemini 1.5 in diagnosing and recommending treatment strategies for hand fractures compared to experienced surgeons. : A retrospective analysis of 58 anonymized hand fracture cases was conducted. Clinical details, including fracture site, displacement, and soft-tissue involvement, were provided to the AI models, which generated management plans. Their recommendations were compared to actual surgeon decisions, assessing accuracy, precision, recall, and F1 score. : ChatGPT-4o demonstrated the highest accuracy (98.28%) and recall (91.74%), effectively identifying most correct interventions but occasionally proposing extraneous options (precision 58.48%). DeepSeek-V3 showed moderate accuracy (63.79%), with balanced precision (61.17%) and recall (57.89%), sometimes omitting correct treatments. Gemini 1.5 performed poorly (accuracy 18.97%), with low precision and recall, indicating substantial limitations in clinical decision support. : AI models can enhance clinical workflows, particularly in radiographic interpretation and triage, but their limitations highlight the irreplaceable role of human expertise in complex hand trauma management. ChatGPT-4o demonstrated promising accuracy but requires refinement. Ethical concerns regarding AI-driven medical decisions, including bias and transparency, must be addressed before widespread clinical implementation.
手部骨折的处理需要精确的诊断准确性和复杂的决策。人工智能(AI)的进展表明,大语言模型(LLMs)可能辅助甚至媲美传统临床方法。本研究评估了ChatGPT-4o、DeepSeek-V3和Gemini 1.5在诊断手部骨折并推荐治疗策略方面与经验丰富的外科医生相比的有效性。
对58例匿名手部骨折病例进行了回顾性分析。将包括骨折部位、移位和软组织受累情况在内的临床细节提供给人工智能模型,这些模型生成了处理方案。将它们的建议与外科医生的实际决策进行比较,评估准确性、精确性、召回率和F1分数。
ChatGPT-4o表现出最高的准确性(98.28%)和召回率(91.74%),能有效识别出大多数正确的干预措施,但偶尔会提出无关选项(精确性58.48%)。DeepSeek-V3表现出中等准确性(63.79%),精确性(61.17%)和召回率(57.89%)较为平衡,有时会遗漏正确的治疗方法。Gemini 1.5表现较差(准确性18.97%),精确性和召回率较低,表明在临床决策支持方面存在重大局限性。
人工智能模型可以改善临床工作流程,特别是在影像学解读和分诊方面,但其局限性凸显了人类专业知识在复杂手部创伤处理中不可替代的作用。ChatGPT-4o表现出了有前景的准确性,但需要改进。在广泛临床应用之前,必须解决与人工智能驱动的医疗决策相关的伦理问题,包括偏差和透明度问题。