评估人工智能生成的牙科回复的准确性和完整性：Chat-GPT模型的评估

Assessing the Accuracy and Completeness of AI-Generated Dental Responses: An Evaluation of the Chat-GPT Model.

作者信息

Othman Ahmad A, Sharqawi Abdulwadood J, MohammedAziz Ahmed A, Ali Wafaa A, Alatiyyah Amjad A, Mirah Mahir A

机构信息

Department of Oral and Maxillofacial Diagnostic Sciences, College of Dentistry, Taibah University, Al-Madinah Al-Munawwarah 42353, Saudi Arabia.

Department of Preventive Dental Sciences, College of Dentistry, Taibah University, Al-Madinah Al-Munawwarah 42353, Saudi Arabia.

出版信息

Healthcare (Basel). 2025 Aug 28;13(17):2144. doi: 10.3390/healthcare13172144.

DOI:10.3390/healthcare13172144

PMID:40941502

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12428179/

Abstract

The rapid advancement of artificial intelligence (AI) in healthcare has opened new opportunities, yet the clinical validation of AI tools in dentistry remains limited. This study aimed to assess the performance of ChatGPT in generating accurate and complete responses to academic dental questions across multiple specialties, comparing the capabilities of GPT-4 and GPT-3.5 models. A panel of academic specialists from eight dental specialties collaboratively developed 48 clinical questions, classified by consensus as easy, medium, or hard, and as requiring either binary (yes/no) or descriptive responses. Each question was sequentially entered into both GPT-4 and GPT-3.5 models, with instructions to provide guideline-based answers. The AI-generated responses were independently evaluated by the specialists for accuracy (6-point Likert scale) and completeness (3-point Likert scale). Descriptive and inferential statistics were applied, including Mann-Whitney U and Kruskal-Wallis tests, with significance set at < 0.05. GPT-4 consistently outperformed GPT-3.5 in both evaluation domains. The median accuracy score was 6.0 for GPT-4 and 5.0 for GPT-3.5 ( = 0.02), while the median completeness score was 3.0 for GPT-4 and 2.0 for GPT-3.5 ( < 0.001). GPT-4 demonstrated significantly higher overall accuracy (5.29 ± 1.1) and completeness (2.44 ± 0.71) compared to GPT-3.5 (4.5 ± 1.7 and 1.69 ± 0.62, respectively; = 0.024 and <0.001). When stratified by specialty, notable improvements with GPT-4 were observed in Periodontology, Endodontics, Implantology, and Oral Surgery, particularly in completeness scores. In academic dental settings, GPT-4 provided more accurate and complete responses than GPT-3.5. Despite both models showing potential, their clinical application should remain supervised by human experts.

摘要

人工智能（AI）在医疗保健领域的迅速发展带来了新的机遇，但AI工具在牙科领域的临床验证仍然有限。本研究旨在评估ChatGPT对多个专业领域的牙科学术问题生成准确完整回答的表现，并比较GPT-4和GPT-3.5模型的能力。来自八个牙科专业的一组学术专家共同制定了48个临床问题，经共识分类为简单、中等或困难，且需要二元（是/否）或描述性回答。每个问题依次输入GPT-4和GPT-3.5模型，并要求提供基于指南的答案。专家们对AI生成的回答进行独立评估，评估指标包括准确性（6点李克特量表）和完整性（3点李克特量表）。应用了描述性和推断性统计，包括曼-惠特尼U检验和克鲁斯卡尔-沃利斯检验，显著性设定为<0.05。在两个评估领域中，GPT-4的表现始终优于GPT-3.5。GPT-4的中位准确性得分为6.0，GPT-3.5为5.0（=0.02），而GPT-4的中位完整性得分为3.0，GPT-3.5为2.0（<0.001）。与GPT-3.5（分别为4.5±1.7和1.69±0.62；=0.024和<0.001）相比，GPT-4的总体准确性（5.29±1.1）和完整性（2.44±0.71）显著更高。按专业分层时，在牙周病学、牙髓病学、种植学和口腔外科领域观察到GPT-4有显著改进，特别是在完整性得分方面。在牙科学术环境中，GPT-4比GPT-3.5提供了更准确完整的回答。尽管两个模型都显示出潜力，但其临床应用仍应由人类专家监督。