Özcivelek Tuğgen, Özcan Berna
Department of Prosthodontics, Gülhane Faculty of Dentistry, University of Health Sciences, Gen Dr Tevfik Saglam St. No:1 Kecioren, Ankara, Turkey.
BMC Oral Health. 2025 May 31;25(1):871. doi: 10.1186/s12903-025-06267-w.
Artificial intelligence chatbots have the potential to inform and guide patients by providing human-like responses to questions about dental and maxillofacial prostheses. Information regarding the accuracy and qualifications of these responses is limited. This in-silico study aimed to evaluate the accuracy, quality, readability, understandability, and actionability of the responses from DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and Dental GPT chatbots.
Four chatbots were queried about 35 of the most frequently asked patient questions about their prostheses. The accuracy, quality, understandability, and actionability of the responses were assessed by two prosthodontists using five-point Likert scale, Global Quality Score, and Patient Education Materials Assessment Tool for Printed Materials scales, respectively. Readability was scored using the Flesch-Kincaid Grade Level and Flesch Reading Ease. The agreement was assessed using the Cohen Kappa test. Differences between chatbots were analyzed using the Kruskal-Wallis test, one-way ANOVA, and post-hoc tests.
Chatbots showed a significant difference in accuracy and readability (p <.05). Dental GPT recorded the highest accuracy score, whereas ChatGPT-4 had the lowest. DeepSeek-R1 performed best, while Dental GPT had the lowest performance in readability. Quality, understandability, actionability, and reader education scores did not show significant differences.
While accuracy may vary among chatbots, the domain-specific trained AI tool and ChatGPT-o1 demonstrated superior accuracy. Even if accuracy is high, misinformation in health care can have significant consequences. Enhancing the readability of the responses is essential, and chatbots should be chosen accordingly. The accuracy and readability of information from chatbots should be monitored for public health.
人工智能聊天机器人有潜力通过对有关牙颌面修复体的问题提供类似人类的回答来为患者提供信息和指导。关于这些回答的准确性和资质的信息有限。这项计算机模拟研究旨在评估DeepSeek-R1、ChatGPT-o1、ChatGPT-4和Dental GPT聊天机器人回答的准确性、质量、可读性、可理解性和可操作性。
针对35个患者关于修复体最常问的问题对四个聊天机器人进行了询问。两名口腔修复医生分别使用五点李克特量表、全球质量评分和印刷材料患者教育材料评估工具量表对回答的准确性、质量、可理解性和可操作性进行了评估。使用弗莱什-金凯德年级水平和弗莱什阅读简易度对可读性进行评分。使用科恩卡方检验评估一致性。使用克鲁斯卡尔-沃利斯检验、单因素方差分析和事后检验分析聊天机器人之间的差异。
聊天机器人在准确性和可读性方面存在显著差异(p <.05)。Dental GPT的准确性得分最高,而ChatGPT-4的得分最低。DeepSeek-R1表现最佳,而Dental GPT在可读性方面表现最差。质量、可理解性、可操作性和读者教育得分没有显著差异。
虽然聊天机器人的准确性可能有所不同,但特定领域训练的人工智能工具和ChatGPT-o1表现出更高的准确性。即使准确性很高,医疗保健中的错误信息也可能产生重大后果。提高回答的可读性至关重要,应据此选择聊天机器人。为了公共卫生,应监测聊天机器人信息的准确性和可读性。