Suppr超能文献

生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

作者信息

He Zhe, Bhasuran Balu, Jin Qiao, Tian Shubo, Hanna Karim, Shavor Cindy, Arguello Lisbeth Garcia, Murray Patrick, Lu Zhiyong

机构信息

School of Information, Florida State University, Tallahassee, Florida, USA.

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, Maryland, USA.

出版信息

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Abstract

BACKGROUND

Even though patients have easy access to their electronic health records and lab test results data through patient portals, lab results are often confusing and hard to understand. Many patients turn to online forums or question and answering (Q&A) sites to seek advice from their peers. However, the quality of answers from social Q&A on health-related questions varies significantly, and not all the responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered.

OBJECTIVE

We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches.

METHODS

We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses of seven selected questions on the same four aspects.

RESULTS

Regarding the similarity of the responses from 4 LLMs, where GPT-4 output was used as the reference answer, the responses from LLaMa 2 are the most similar ones, followed by LLaMa 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored lowest and thus least similar to GPT-4-generated answers. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all the four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from lack of interpretation in one's medical context, incorrect statements, and lack of references.

CONCLUSIONS

By evaluating LLMs in generating responses to patients' lab test results related questions, we find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases that GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses including prompt engineering, prompt augmentation, retrieval augmented generation, and response evaluation.

摘要

背景

尽管患者可以通过患者门户网站轻松访问其电子健康记录和实验室检查结果数据,但实验室结果往往令人困惑且难以理解。许多患者会转向在线论坛或问答(Q&A)网站向同行寻求建议。然而,社交问答中关于健康相关问题的答案质量差异很大,并非所有回答都是准确或可靠的。诸如ChatGPT之类的大语言模型(LLMs)为患者答疑开辟了一条充满希望的途径。

目的

我们旨在评估使用大语言模型为患者提出的与实验室检查相关的问题生成相关、准确、有用且无害回答的可行性,并识别可以通过增强方法缓解的潜在问题。

方法

我们首先从雅虎问答中收集与实验室检查结果相关的问答数据,并为本研究选择了53对问答。使用LangChain框架和ChatGPT门户网站,我们从包括GPT-4、Meta LLaMA 2、MedAlpaca和ORCA_mini在内的四个大语言模型生成了对这53个问题的回答。我们首先使用基于标准问答相似度的评估指标(包括ROUGE、BLEU、METEOR、BERTScore)评估它们答案的相似度。我们还利用基于大语言模型的评估器来判断目标模型在相关性、正确性、有用性和安全性方面是否比基线模型具有更高的质量。最后,我们与医学专家一起对七个选定问题的所有回答在相同的四个方面进行了人工评估。

结果

关于四个大语言模型回答的相似度,以GPT-4的输出作为参考答案,LLaMa 2的回答最相似,其次是LLaMa 2、ORCA_mini和MedAlpaca。雅虎数据中的人工回答得分最低,因此与GPT-4生成的回答最不相似。胜率结果和医学专家评估均表明,GPT-4的回答在所有四个方面(相关性、正确性、有用性和安全性)的得分均高于所有其他大语言模型的回答和人工回答。然而,大语言模型的回答偶尔也会存在缺乏医学背景解释、陈述错误和缺乏参考文献的问题。

结论

通过评估大语言模型对患者实验室检查结果相关问题的回答,我们发现与其他三个大语言模型以及问答网站的人工回答相比,GPT-4的回答更准确、有用、相关且更安全。然而,存在GPT-4回答不准确且未个性化的情况。我们确定了一些提高大语言模型回答质量的方法,包括提示工程、提示增强、检索增强生成和回答评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de04/10962749/ab5b241a2cae/nihpp-2402.01693v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验