Xu Jie, Lu Lu, Peng Xinwei, Pang Jiali, Ding Jinru, Yang Lingrui, Song Huan, Li Kang, Sun Xin, Zhang Shaoting
Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China.
Clinical Research and Innovation Unit, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China.
JMIR Med Inform. 2024 Jun 28;12:e57674. doi: 10.2196/57674.
Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation.
We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks.
First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory.
Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario.
MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.
大语言模型(LLMs)在自然语言处理任务中取得了巨大进展,并展示了在临床应用中的潜力。尽管它们有能力,但医学领域的大语言模型容易产生幻觉(即不完全可靠的回答)。大语言模型回答中的幻觉会带来重大风险,可能威胁患者的身体安全。因此,为了察觉并预防这种安全风险,对医学领域的大语言模型进行评估并建立系统的评估至关重要。
我们开发了一个综合评估系统MedGPTEval,它由评估标准、中文医学数据集和公开可用的基准组成。
首先,基于全面的文献综述设计了一组评估标准。其次,通过与5位医学和工程领域的专家采用德尔菲法对现有的候选标准进行了优化。第三,3位临床专家设计了医学数据集以与大语言模型进行交互。最后,在这些数据集上进行了基准测试实验。记录基于大语言模型的聊天机器人生成的回答,由5位有执照的医学专家进行盲评。获得的评估标准涵盖医学专业能力、社会综合能力、上下文能力和计算稳健性,有16个详细指标。医学数据集包括27个中文医学对话和7个病例报告。对三个聊天机器人进行了评估:OpenAI的ChatGPT;百度公司的文心一言;以及上海人工智能实验室的浦医博士(Dr PJ)。
在多轮医学对话和病例报告场景中,浦医博士的表现优于ChatGPT和文心一言。浦医博士在语义一致性率和完全错误率类别上也优于ChatGPT,表明其稳健性更好。然而,在多轮对话场景中,浦医博士在医学专业能力方面的得分与ChatGPT相比略低。
MedGPTEval提供了全面的标准来评估医学领域中基于大语言模型的聊天机器人、开源数据集以及评估3个大语言模型的基准。实验结果表明,浦医博士在社交和专业背景下的表现优于ChatGPT和文心一言。因此,这样的评估系统可以很容易地被该领域的研究人员采用,以扩充开源数据集。