评估用于BCID2解读与管理的GPT聊天机器人中的思维链提示：人工智能与人类专家相比如何？

Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?

作者信息

Tassone Daniel M, Hitchcock Matthew M, Rossier Connor J, Fletcher Douglas, Ye Julia, Langford Ian, Boatman Julie, Markley J Daniel

机构信息

Division of Infectious Diseases, Department of Medicine, Central Virginia VA Health Care System, Richmond, VA, USA.

Virginia Commonwealth University, School of Pharmacy, Richmond, VA, USA.

出版信息

Antimicrob Steward Healthc Epidemiol. 2025 Jul 11;5(1):e154. doi: 10.1017/ash.2025.10059. eCollection 2025.

DOI:10.1017/ash.2025.10059

PMID:40657035

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12247004/

Abstract

BACKGROUND

Rapid molecular diagnostics, such as the BIOFIRE® Blood Culture Identification 2 (BCID2) panel, have improved the time to pathogen identification in bloodstream infections. However, accurate interpretation and antimicrobial optimization require Infectious Disease (ID) expertise, which may not always be readily available. GPT-powered chatbots could support antimicrobial stewardship programs (ASPs) by assisting non-specialist providers in BCID2 result interpretation and treatment recommendations. This study evaluates the performance of a GPT-4 chatbot compared to ASP prospective audit and feedback interventions.

METHODS

This prospective observational study assessed 43 consecutive real-world cases of bacteremia at a 399-bed VA Medical Center from January to May 2024. The GPT-chatbot utilized "chain-of-thought" prompting and external knowledge integration to generate recommendations. Two independent ID physicians evaluated chatbot and ASP recommendations across four domains: BCID2 interpretation, source control, antibiotic therapy, and additional diagnostic workup. The primary endpoint was the combined rate of harmful or inadequate recommendations. Secondary endpoints assessed the rate of harmful or inadequate responses for each domain.

RESULTS

The chatbot had a significantly higher rate of harmful or inadequate recommendations (13%) compared to ASP (4%, = 0.047). The most significant discrepancy was observed in the domain of antibiotic therapy, where harmful recommendations occurred in up to 10% ( <0.05) of chatbot evaluations. The chatbot performed well in BCID2 interpretation (100% accuracy) but provided more inadequate responses in source control consideration (10% vs. 2% for ASP, = 0.022).

CONCLUSIONS

GPT-powered chatbots show potential for supporting antimicrobial stewardship but should only complement, not replace, human expertise in infectious disease management.

摘要

背景

快速分子诊断技术，如BIOFIRE®血培养鉴定2（BCID2）检测板，已缩短了血流感染中病原体鉴定的时间。然而，准确的解读和抗菌药物优化需要传染病（ID）专业知识，而这种专业知识并非总能随时获取。基于GPT的聊天机器人可以通过协助非专科医生解读BCID2检测结果并提供治疗建议，来支持抗菌药物管理计划（ASP）。本研究评估了与ASP前瞻性审核和反馈干预措施相比，GPT-4聊天机器人的性能。

方法

这项前瞻性观察性研究评估了2024年1月至5月在一家拥有399张床位的退伍军人事务部医疗中心连续收治的43例菌血症实际病例。GPT聊天机器人利用“思维链”提示和外部知识整合来生成建议。两名独立的传染病科医生在四个领域评估了聊天机器人和ASP的建议：BCID2解读、源头控制、抗生素治疗和额外的诊断检查。主要终点是有害或不充分建议的综合发生率。次要终点评估每个领域有害或不充分回复的发生率。

结果

与ASP（4%）相比，聊天机器人有害或不充分建议的发生率显著更高（13%，P = 0.047）。在抗生素治疗领域观察到的差异最为显著，在高达10%（P<0.05）的聊天机器人评估中出现了有害建议。聊天机器人在BCID2解读方面表现良好（准确率100%），但在源头控制考虑方面提供了更多不充分的回复（10%对ASP的2%，P = 0.022）。