Suppr超能文献

用于评估医学领域大语言模型回复的数据集和基准(MedGPTEval):评估开发与验证

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.

作者信息

Xu Jie, Lu Lu, Peng Xinwei, Pang Jiali, Ding Jinru, Yang Lingrui, Song Huan, Li Kang, Sun Xin, Zhang Shaoting

机构信息

Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China.

Clinical Research and Innovation Unit, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China.

出版信息

JMIR Med Inform. 2024 Jun 28;12:e57674. doi: 10.2196/57674.

Abstract

BACKGROUND

Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation.

OBJECTIVE

We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks.

METHODS

First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory.

RESULTS

Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario.

CONCLUSIONS

MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.

摘要

背景

大语言模型(LLMs)在自然语言处理任务中取得了巨大进展,并展示了在临床应用中的潜力。尽管它们有能力,但医学领域的大语言模型容易产生幻觉(即不完全可靠的回答)。大语言模型回答中的幻觉会带来重大风险,可能威胁患者的身体安全。因此,为了察觉并预防这种安全风险,对医学领域的大语言模型进行评估并建立系统的评估至关重要。

目的

我们开发了一个综合评估系统MedGPTEval,它由评估标准、中文医学数据集和公开可用的基准组成。

方法

首先,基于全面的文献综述设计了一组评估标准。其次,通过与5位医学和工程领域的专家采用德尔菲法对现有的候选标准进行了优化。第三,3位临床专家设计了医学数据集以与大语言模型进行交互。最后,在这些数据集上进行了基准测试实验。记录基于大语言模型的聊天机器人生成的回答,由5位有执照的医学专家进行盲评。获得的评估标准涵盖医学专业能力、社会综合能力、上下文能力和计算稳健性,有16个详细指标。医学数据集包括27个中文医学对话和7个病例报告。对三个聊天机器人进行了评估:OpenAI的ChatGPT;百度公司的文心一言;以及上海人工智能实验室的浦医博士(Dr PJ)。

结果

在多轮医学对话和病例报告场景中,浦医博士的表现优于ChatGPT和文心一言。浦医博士在语义一致性率和完全错误率类别上也优于ChatGPT,表明其稳健性更好。然而,在多轮对话场景中,浦医博士在医学专业能力方面的得分与ChatGPT相比略低。

结论

MedGPTEval提供了全面的标准来评估医学领域中基于大语言模型的聊天机器人、开源数据集以及评估3个大语言模型的基准。实验结果表明,浦医博士在社交和专业背景下的表现优于ChatGPT和文心一言。因此,这样的评估系统可以很容易地被该领域的研究人员采用,以扩充开源数据集。

相似文献

3
Large Language Models and Empathy: Systematic Review.
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
4
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.
6
Examining the Role of Large Language Models in Orthopedics: Systematic Review.
J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

本文引用的文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
3
Evaluating large language models on a highly-specialized topic, radiation oncology physics.
Front Oncol. 2023 Jul 17;13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.
5
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
7
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
8
The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model.
medRxiv. 2023 Feb 1:2023.01.30.23285067. doi: 10.1101/2023.01.30.23285067.
9
The Future of AI in Medicine: A Perspective from a Chatbot.
Ann Biomed Eng. 2023 Feb;51(2):291-295. doi: 10.1007/s10439-022-03121-w. Epub 2022 Dec 26.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验