Owens Otis L, Leonard Michael
College of Social Work, University of South Carolina, Columbia, SC, USA.
Am J Health Promot. 2025 Jun;39(5):766-776. doi: 10.1177/08901171251316371. Epub 2025 Jan 24.
PurposeArtificially Intelligent (AI) chatbots have the potential to produce information to support shared prostate cancer (PrCA) decision-making. Therefore, our purpose was to evaluate and compare the accuracy, completeness, readability, and credibility of responses from standard and advanced versions of popular chatbots: ChatGPT-3.5, ChatGPT-4.0, Microsoft Copilot, Microsoft Copilot Pro, Google Gemini, and Google Gemini Advanced. We also investigated whether prompting chatbots for low-literacy PrCA information would improve the readability of responses. Lastly, we determined if the responses were appropriate for African-American men, who have the worst PrCA outcomes.ApproachThe study used a cross-sectional approach to examine the quality of responses solicited from chatbots.ParticipantsThe study did not include human subjects.MethodEleven frequently asked PrCA questions, based on resources produced by the Centers for Disease Control and Prevention (CDC) and the American Cancer Society (ACS), were posed to each chatbot twice (once for low literacy populations). A coding/rating form containing questions with key points/answers from the ACS or CDC to facilitate the rating process. Accuracy and completeness were rated dichotomously (i.e., yes/no). Credibility was determined by whether a trustworthy medical or health-related organization was cited. Readability was determined using a Flesch-Kincaid readability score calculator that enabled chatbot responses to be entered individually. Average accuracy, completeness, credibility, and readability percentages or scores were calculated using Excel.ResultsAll chatbots were accurate, but the completeness, readability, and credibility of responses varied. Soliciting low-literacy responses significantly improved readability, but sometimes at the detriment of completeness. All chatbots recognized the higher PrCA risk in African-American men and tailored screening recommendations. Microsoft Copilot Pro had the best overall performance on standard screening questions. Microsoft Copilot outperformed other chatbots on responses for low literacy populations.ConclusionsAI chatbots are useful tools for learning about PrCA screening but should be combined with healthcare provider advice.
目的
人工智能(AI)聊天机器人有潜力生成信息以支持前列腺癌(PrCA)的共同决策。因此,我们的目的是评估和比较流行聊天机器人的标准版本和高级版本(ChatGPT-3.5、ChatGPT-4.0、Microsoft Copilot、Microsoft Copilot Pro、Google Gemini和Google Gemini Advanced)回复的准确性、完整性、可读性和可信度。我们还研究了要求聊天机器人提供低识字水平的PrCA信息是否会提高回复的可读性。最后,我们确定这些回复是否适合PrCA结局最差的非裔美国男性。
方法
本研究采用横断面方法来检查从聊天机器人获得的回复质量。
参与者
本研究不包括人类受试者。
方法
根据美国疾病控制与预防中心(CDC)和美国癌症协会(ACS)提供的资源,向每个聊天机器人提出11个常见的PrCA问题(针对低识字水平人群提出一次)。使用一份包含来自ACS或CDC的问题及关键点/答案的编码/评分表,以方便评分过程。准确性和完整性采用二分法评分(即“是/否”)。可信度通过是否引用了值得信赖的医学或健康相关组织来确定。可读性使用Flesch-Kincaid可读性分数计算器来确定,该计算器可单独输入聊天机器人的回复。使用Excel计算平均准确性、完整性、可信度和可读性百分比或分数。
结果
所有聊天机器人的回复在准确性方面表现良好,但完整性、可读性和可信度各不相同。要求提供低识字水平的回复显著提高了可读性,但有时会牺牲完整性。所有聊天机器人都认识到非裔美国男性患PrCA的风险较高,并给出了针对性的筛查建议。Microsoft Copilot Pro在标准筛查问题上的总体表现最佳。Microsoft Copilot在针对低识字水平人群的回复方面优于其他聊天机器人。
结论
AI聊天机器人是了解PrCA筛查的有用工具,但应与医疗保健提供者的建议相结合。