用于评估医学领域大语言模型回复的数据集和基准（MedGPTEval）：评估开发与验证

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.

作者信息

Xu Jie, Lu Lu, Peng Xinwei, Pang Jiali, Ding Jinru, Yang Lingrui, Song Huan, Li Kang, Sun Xin, Zhang Shaoting

机构信息

Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China.

Clinical Research and Innovation Unit, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China.

出版信息

JMIR Med Inform. 2024 Jun 28;12:e57674. doi: 10.2196/57674.

DOI:10.2196/57674

PMID:38952020

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11225096/

Abstract

BACKGROUND

Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation.

OBJECTIVE

We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks.

METHODS

First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory.

RESULTS

Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario.

CONCLUSIONS

MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.

摘要

背景

大语言模型（LLMs）在自然语言处理任务中取得了巨大进展，并展示了在临床应用中的潜力。尽管它们有能力，但医学领域的大语言模型容易产生幻觉（即不完全可靠的回答）。大语言模型回答中的幻觉会带来重大风险，可能威胁患者的身体安全。因此，为了察觉并预防这种安全风险，对医学领域的大语言模型进行评估并建立系统的评估至关重要。

目的

我们开发了一个综合评估系统MedGPTEval，它由评估标准、中文医学数据集和公开可用的基准组成。

方法

首先，基于全面的文献综述设计了一组评估标准。其次，通过与5位医学和工程领域的专家采用德尔菲法对现有的候选标准进行了优化。第三，3位临床专家设计了医学数据集以与大语言模型进行交互。最后，在这些数据集上进行了基准测试实验。记录基于大语言模型的聊天机器人生成的回答，由5位有执照的医学专家进行盲评。获得的评估标准涵盖医学专业能力、社会综合能力、上下文能力和计算稳健性，有16个详细指标。医学数据集包括27个中文医学对话和7个病例报告。对三个聊天机器人进行了评估：OpenAI的ChatGPT；百度公司的文心一言；以及上海人工智能实验室的浦医博士（Dr PJ）。