Kresevic Simone, Giuffrè Mauro, Ajcevic Milos, Accardo Agostino, Crocè Lory S, Shung Dennis L
Department of Engineering and Architecture, University of Trieste, Trieste, Italy.
Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA.
NPJ Digit Med. 2024 Apr 23;7(1):102. doi: 10.1038/s41746-024-01091-y.
Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision support systems (CDSSs) through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Utilizing OpenAI's GPT-4 Turbo model, we developed a customized LLM framework that incorporates retrieval augmented generation (RAG) and prompt engineering. Our framework involved guideline conversion into the best-structured format that can be efficiently processed by LLMs to provide the most accurate output. An ablation study was conducted to evaluate the impact of different formatting and learning strategies on the LLM's answer generation accuracy. The baseline GPT-4 Turbo model's performance was compared against five experimental setups with increasing levels of complexity: inclusion of in-context guidelines, guideline reformatting, and implementation of few-shot learning. Our primary outcome was the qualitative assessment of accuracy based on expert review, while secondary outcomes included the quantitative measurement of similarity of LLM-generated responses to expert-provided answers using text-similarity scores. The results showed a significant improvement in accuracy from 43 to 99% (p < 0.001), when guidelines were provided as context in a coherent corpus of text and non-text sources were converted into text. In addition, few-shot learning did not seem to improve overall accuracy. The study highlights that structured guideline reformatting and advanced prompt engineering (data quality vs. data quantity) can enhance the efficacy of LLM integrations to CDSSs for guideline delivery.
大语言模型(LLMs)有可能改变医疗保健行业,特别是在医院工作流程中,在正确的时间向正确的提供者提供正确的信息。本研究调查了大语言模型在医疗保健中的整合情况,特别关注通过准确解读慢性丙型肝炎病毒感染管理的医学指南来改进临床决策支持系统(CDSSs)。利用OpenAI的GPT-4 Turbo模型,我们开发了一个定制的大语言模型框架,该框架结合了检索增强生成(RAG)和提示工程。我们的框架涉及将指南转换为最结构化的格式,以便大语言模型能够有效地处理,从而提供最准确的输出。进行了一项消融研究,以评估不同的格式化和学习策略对大语言模型答案生成准确性的影响。将基线GPT-4 Turbo模型的性能与五个复杂度不断增加的实验设置进行了比较:纳入上下文指南、指南重新格式化以及实施少样本学习。我们的主要结果是基于专家评审对准确性进行定性评估,而次要结果包括使用文本相似度分数对大语言模型生成的回答与专家提供的答案之间的相似度进行定量测量。结果显示,当在连贯的文本语料库中提供指南作为上下文并且将非文本来源转换为文本时,准确性从43%显著提高到99%(p < 0.001)。此外,少样本学习似乎并没有提高整体准确性。该研究强调,结构化的指南重新格式化和先进的提示工程(数据质量与数据数量)可以提高大语言模型与CDSSs整合以提供指南的效果。