Small William R, Austrian Jonathan, O'Donnell Luke, Burk-Rafel Jesse, Hochman Katherine A, Goodman Adam, Zaretsky Jonah, Martin Jacob, Johnson Stephen, Major Vincent J, Jones Simon, Henke Christian, Verplanke Benjamin, Osso Jwan, Larson Ian, Saxena Archana, Mednick Aron, Simonis Choumika, Han Joseph, Kesari Ravi, Wu Xinyuan, Heery Lauren, Desel Tenzin, Baskharoun Samuel, Figman Noah, Farooq Umar, Shah Kunal, Jahan Nusrat, Kim Jeong Min, Testa Paul, Feldman Jonah
Department of Health Informatics, New York University Langone Medical Center Information Technology.
Department of Medicine, New York University Grossman School of Medicine.
JAMA Netw Open. 2025 Aug 1;8(8):e2526339. doi: 10.1001/jamanetworkopen.2025.26339.
Hospital course (HC) summarization represents an increasingly onerous discharge summary component for physicians. Literature supports large language models (LLMs) for HC summarization, but whether physicians can effectively partner with electronic health record-embedded LLMs to draft HCs is unknown.
To compare the editing effort required by time-constrained resident physicians to improve LLM- vs physician-generated HCs toward a novel 4Cs (complete, concise, cohesive, and confabulation-free) HC.
DESIGN, SETTING, AND PARTICIPANTS: Quality improvement study using a convenience sample of 10 internal medicine resident editors, 8 hospitalist evaluators, and randomly selected general medicine admissions in December 2023 lasting 4 to 8 days at New York University Langone Health.
Residents and hospitalists reviewed randomly assigned patient medical records for 10 minutes. Residents blinded to author type who edited each HC pair (physician and LLM) for quality in 3 minutes, followed by comparative ratings by attending hospitalists.
Editing effort was quantified by analyzing the edits that occurred on the HC pairs after controlling for length (percentage edited) and the degree to which the original HCs' meaning was altered (semantic change). Hospitalists compared edited HC pairs with A/B testing on the 4Cs (5-point Likert scales converted to 10-point bidirectional scales).
Among 100 admissions, compared with physician HCs, residents edited a smaller percentage of LLM HCs (LLM mean [SD], 31.5% [16.6%] vs physicians, 44.8% [20.0%]; P < .001). Additionally, LLM HCs required less semantic change (LLM mean [SD], 2.4% [1.6%] vs physicians, 4.9% [3.5%]; P < .001). Attending physicians deemed LLM HCs to be more complete (mean [SD] difference LLM vs physicians on 10-point bidirectional scale, 3.00 [5.28]; P < .001), similarly concise (mean [SD], -1.02 [6.08]; P = .20), and cohesive (mean [SD], 0.70 [6.14]; P = .60), but with more confabulations (mean [SD], -0.98 [3.53]; P = .002). The composite scores were similar (mean [SD] difference LLM vs physician on 40-point bidirectional scale, 1.70 [14.24]; P = .46).
Electronic health record-embedded LLM HCs required less editing than physician-generated HCs to approach a quality standard, resulting in HCs that were comparably or more complete, concise, and cohesive, but contained more confabulations. Despite the potential influence of artificial time constraints, this study supports the feasibility of a physician-LLM partnership for writing HCs and provides a basis for monitoring LLM HCs in clinical practice.
医院病程(HC)总结对医生来说是出院小结中一项日益繁重的内容。文献支持使用大语言模型(LLM)进行HC总结,但医生能否有效地与嵌入电子健康记录的LLM合作来撰写HC尚不清楚。
比较时间紧迫的住院医师为使基于LLM生成的HC和医生生成的HC朝着新颖的4C(完整、简洁、连贯且无虚构)HC改进所需的编辑工作量。
设计、设置和参与者:质量改进研究,采用便利样本,包括10名内科住院医师编辑、8名医院医生评估员,并于2023年12月在纽约大学朗格尼健康中心对随机选择的普通内科住院病例进行为期4至8天的研究。
住院医师和医院医生对随机分配的患者病历进行10分钟的审查。住院医师在不知作者类型的情况下,对每对HC(医生撰写的和基于LLM生成的)进行3分钟的质量编辑,随后由主治医院医生进行比较评分。
通过分析在控制长度(编辑百分比)和原始HC含义改变程度(语义变化)后HC对中发生的编辑来量化编辑工作量。医院医生通过对4C进行A/B测试(5点李克特量表转换为10点双向量表)比较编辑后的HC对。
在100例住院病例中,与医生撰写的HC相比,住院医师编辑基于LLM生成的HC的百分比更小(基于LLM生成的HC均值[标准差]为31.5%[16.6%],医生撰写的为44.8%[20.0%];P < 0.001)。此外,基于LLM生成的HC所需的语义变化更少(基于LLM生成的HC均值[标准差]为2.4%[1.6%],医生撰写的为4.9%[3.5%];P < 0.001)。主治医生认为基于LLM生成的HC更完整(10点双向量表上基于LLM生成的HC与医生撰写的HC的均值[标准差]差异为3.00[5.28];P < 0.001),简洁程度相似(均值[标准差]为 -1.02[6.08];P = 0.20),连贯性也相似(均值[标准差]为0.70[6.14];P = 0.60),但虚构内容更多(均值[标准差]为 -0.98[3.53];P = 0.002)。综合得分相似(40点双向量表上基于LLM生成的HC与医生撰写的HC的均值[标准差]差异为1.70[14.24];P = 0.46)。
与医生生成的HC相比,嵌入电子健康记录的基于LLM生成的HC达到质量标准所需的编辑更少,生成的HC同样完整或更完整、简洁且连贯,但包含更多虚构内容。尽管存在人为时间限制的潜在影响,但本研究支持医生与LLM合作撰写HC的可行性,并为在临床实践中监测基于LLM生成的HC提供了依据。