Suppr超能文献

评估和缓解大型语言模型在临床决策中的局限性。

Evaluation and mitigation of the limitations of large language models in clinical decision-making.

机构信息

Institute for AI and Informatics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.

Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.

出版信息

Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.

Abstract

Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies.

摘要

临床决策是医生职责中最具影响力的部分之一,特别受益于人工智能解决方案和大型语言模型(LLM)。然而,尽管 LLM 在医学执照考试中表现出色,但这些测试未能评估在现实临床决策环境中部署所需的许多技能,包括收集信息、遵守指南和融入临床工作流程。在这里,我们基于涵盖 2400 个真实患者病例和四种常见腹部病理的重症监护医疗信息集市数据库创建了一个精心策划的数据集,以及一个模拟真实临床环境的框架。我们表明,目前最先进的 LLM 并不能准确诊断所有病理患者(表现明显不如医生),既不遵循诊断也不遵循治疗指南,也不能解释实验室结果,因此对患者的健康构成严重威胁。此外,我们超越了诊断准确性,并表明它们不能轻易地融入现有的工作流程,因为它们经常不遵守指令,并且对信息的数量和顺序都很敏感。总的来说,我们的分析表明,尽管提供了一个数据集和框架来指导未来的研究,但 LLM 目前还没有准备好进行自主临床决策。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/525a/11405275/528f19ba459f/41591_2024_3097_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验