基于电子健康记录的大语言模型评估医院病程总结

Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model.

作者信息

Small William R, Austrian Jonathan, O'Donnell Luke, Burk-Rafel Jesse, Hochman Katherine A, Goodman Adam, Zaretsky Jonah, Martin Jacob, Johnson Stephen, Major Vincent J, Jones Simon, Henke Christian, Verplanke Benjamin, Osso Jwan, Larson Ian, Saxena Archana, Mednick Aron, Simonis Choumika, Han Joseph, Kesari Ravi, Wu Xinyuan, Heery Lauren, Desel Tenzin, Baskharoun Samuel, Figman Noah, Farooq Umar, Shah Kunal, Jahan Nusrat, Kim Jeong Min, Testa Paul, Feldman Jonah

机构信息

Department of Health Informatics, New York University Langone Medical Center Information Technology.

Department of Medicine, New York University Grossman School of Medicine.

出版信息

JAMA Netw Open. 2025 Aug 1;8(8):e2526339. doi: 10.1001/jamanetworkopen.2025.26339.

DOI:10.1001/jamanetworkopen.2025.26339

PMID:40802185

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12351420/

Abstract

IMPORTANCE

Hospital course (HC) summarization represents an increasingly onerous discharge summary component for physicians. Literature supports large language models (LLMs) for HC summarization, but whether physicians can effectively partner with electronic health record-embedded LLMs to draft HCs is unknown.

OBJECTIVES

To compare the editing effort required by time-constrained resident physicians to improve LLM- vs physician-generated HCs toward a novel 4Cs (complete, concise, cohesive, and confabulation-free) HC.

DESIGN, SETTING, AND PARTICIPANTS: Quality improvement study using a convenience sample of 10 internal medicine resident editors, 8 hospitalist evaluators, and randomly selected general medicine admissions in December 2023 lasting 4 to 8 days at New York University Langone Health.

EXPOSURES

Residents and hospitalists reviewed randomly assigned patient medical records for 10 minutes. Residents blinded to author type who edited each HC pair (physician and LLM) for quality in 3 minutes, followed by comparative ratings by attending hospitalists.

MAIN OUTCOMES AND MEASURES

Editing effort was quantified by analyzing the edits that occurred on the HC pairs after controlling for length (percentage edited) and the degree to which the original HCs' meaning was altered (semantic change). Hospitalists compared edited HC pairs with A/B testing on the 4Cs (5-point Likert scales converted to 10-point bidirectional scales).

RESULTS

Among 100 admissions, compared with physician HCs, residents edited a smaller percentage of LLM HCs (LLM mean [SD], 31.5% [16.6%] vs physicians, 44.8% [20.0%]; P < .001). Additionally, LLM HCs required less semantic change (LLM mean [SD], 2.4% [1.6%] vs physicians, 4.9% [3.5%]; P < .001). Attending physicians deemed LLM HCs to be more complete (mean [SD] difference LLM vs physicians on 10-point bidirectional scale, 3.00 [5.28]; P < .001), similarly concise (mean [SD], -1.02 [6.08]; P = .20), and cohesive (mean [SD], 0.70 [6.14]; P = .60), but with more confabulations (mean [SD], -0.98 [3.53]; P = .002). The composite scores were similar (mean [SD] difference LLM vs physician on 40-point bidirectional scale, 1.70 [14.24]; P = .46).

CONCLUSIONS AND RELEVANCE

Electronic health record-embedded LLM HCs required less editing than physician-generated HCs to approach a quality standard, resulting in HCs that were comparably or more complete, concise, and cohesive, but contained more confabulations. Despite the potential influence of artificial time constraints, this study supports the feasibility of a physician-LLM partnership for writing HCs and provides a basis for monitoring LLM HCs in clinical practice.

摘要

重要性

医院病程（HC）总结对医生来说是出院小结中一项日益繁重的内容。文献支持使用大语言模型（LLM）进行HC总结，但医生能否有效地与嵌入电子健康记录的LLM合作来撰写HC尚不清楚。

目的

比较时间紧迫的住院医师为使基于LLM生成的HC和医生生成的HC朝着新颖的4C（完整、简洁、连贯且无虚构）HC改进所需的编辑工作量。

设计、设置和参与者：质量改进研究，采用便利样本，包括10名内科住院医师编辑、8名医院医生评估员，并于2023年12月在纽约大学朗格尼健康中心对随机选择的普通内科住院病例进行为期4至8天的研究。

暴露因素

住院医师和医院医生对随机分配的患者病历进行10分钟的审查。住院医师在不知作者类型的情况下，对每对HC（医生撰写的和基于LLM生成的）进行3分钟的质量编辑，随后由主治医院医生进行比较评分。

主要结局和测量指标

通过分析在控制长度（编辑百分比）和原始HC含义改变程度（语义变化）后HC对中发生的编辑来量化编辑工作量。医院医生通过对4C进行A/B测试（5点李克特量表转换为10点双向量表）比较编辑后的HC对。

结果

在100例住院病例中，与医生撰写的HC相比，住院医师编辑基于LLM生成的HC的百分比更小（基于LLM生成的HC均值[标准差]为31.5%[16.6%]，医生撰写的为44.8%[20.0%]；P < 0.001）。此外，基于LLM生成的HC所需的语义变化更少（基于LLM生成的HC均值[标准差]为2.4%[1.6%]，医生撰写的为4.9%[3.5%]；P < 0.001）。主治医生认为基于LLM生成的HC更完整（10点双向量表上基于LLM生成的HC与医生撰写的HC的均值[标准差]差异为3.00[5.28]；P < 0.001），简洁程度相似（均值[标准差]为 -1.02[6.08]；P = 0.20），连贯性也相似（均值[标准差]为0.70[6.14]；P = 0.60），但虚构内容更多（均值[标准差]为 -0.98[3.53]；P = 0.002）。综合得分相似（40点双向量表上基于LLM生成的HC与医生撰写的HC的均值[标准差]差异为1.70[14.24]；P = 0.46）。

结论与意义

与医生生成的HC相比，嵌入电子健康记录的基于LLM生成的HC达到质量标准所需的编辑更少，生成的HC同样完整或更完整、简洁且连贯，但包含更多虚构内容。尽管存在人为时间限制的潜在影响，但本研究支持医生与LLM合作撰写HC的可行性，并为在临床实践中监测基于LLM生成的HC提供了依据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4ba/12351420/5aefe712c750/jamanetwopen-e2526339-g001.jpg

相似文献

Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model.

JAMA Netw Open. 2025 Aug 1;8(8):e2526339. doi: 10.1001/jamanetworkopen.2025.26339.

Physician- and Large Language Model-Generated Hospital Discharge Summaries.

JAMA Intern Med. 2025 May 5. doi: 10.1001/jamainternmed.2025.0821.

Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.

JMIR Med Inform. 2025 Jul 24;13:e66476. doi: 10.2196/66476.

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.

JAMA Netw Open. 2024 Dec 2;7(12):e2448723. doi: 10.1001/jamanetworkopen.2024.48723.

Identification of Long-Term Care Facility Residence From Admission Notes Using Large Language Models.

JAMA Netw Open. 2025 May 1;8(5):e2512032. doi: 10.1001/jamanetworkopen.2025.12032.

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.

JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.

Prescription of Controlled Substances: Benefits and Risks

[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.

Sexual Harassment and Prevention Training

A dataset and benchmark for hospital course summarization with adapted large language models.

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

本文引用的文献

The TRIPOD-LLM reporting guideline for studies using large language models.

Nat Med. 2025 Jan;31(1):60-69. doi: 10.1038/s41591-024-03425-5. Epub 2025 Jan 8.

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.

JAMA Netw Open. 2024 Dec 2;7(12):e2448723. doi: 10.1001/jamanetworkopen.2024.48723.

Safety principles for medical summarization using generative AI.

Nat Med. 2024 Dec;30(12):3417-3419. doi: 10.1038/s41591-024-03313-y.

Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study.

J Med Internet Res. 2024 Jul 24;26:e57721. doi: 10.2196/57721.

The First Generative AI Prompt-A-Thon in Healthcare: A Novel Approach to Workforce Engagement with a Private Instance of ChatGPT.

PLOS Digit Health. 2024 Jul 23;3(7):e0000394. doi: 10.1371/journal.pdig.0000394. eCollection 2024 Jul.

Large Language Model-Based Responses to Patients' In-Basket Messages.

JAMA Netw Open. 2024 Jul 1;7(7):e2422399. doi: 10.1001/jamanetworkopen.2024.22399.

Harnessing the Power of Generative AI for Clinical Summaries: Perspectives From Emergency Physicians.

Ann Emerg Med. 2024 Aug;84(2):128-138. doi: 10.1016/j.annemergmed.2024.01.039. Epub 2024 Mar 12.

The Limits of Clinician Vigilance as an AI Safety Bulwark.

JAMA. 2024 Apr 9;331(14):1173-1174. doi: 10.1001/jama.2024.3620.

Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format.

JAMA Netw Open. 2024 Mar 4;7(3):e240357. doi: 10.1001/jamanetworkopen.2024.0357.

Adapted large language models can outperform medical experts in clinical text summarization.

Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于电子健康记录的大语言模型评估医院病程总结

Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVES

EXPOSURES

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结局和测量指标

结果

结论与意义

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献