Cai Xiangming, Geng Yuanming, Du Yiming, Westerman Bart, Wang Duolao, Ma Chiyuan, Vallejo Juan J Garcia
Department of Molecular Cell Biology & Immunology, Amsterdam Infection & Immunity Institute and Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
Department of Neurosurgery, Jinling Hospital, Nanjing, China.
BMC Med Res Methodol. 2025 Apr 28;25(1):116. doi: 10.1186/s12874-025-02569-3.
Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.
In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.
Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.
A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.
We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.
像ChatGPT这样的大语言模型在辅助医学研究方面显示出巨大潜力。在循证医学的研究过程中,尤其是在进行荟萃分析时,需要大量工作来筛选记录。然而,很少有研究尝试使用大语言模型来帮助在荟萃分析中筛选记录。
在本研究中,我们旨在探索整合多个大语言模型以促进基于记录标题和摘要在荟萃分析中进行筛选步骤的可能性。
对各种大语言模型进行了评估,包括GPT-3.5、GPT-4、百川-R1-蒸馏版、通义千问2.5、智谱清言、羊驼3.1、Gemini-2和Claude-2。为了评估我们的策略,我们从文献中选择了三项荟萃分析,以及研究中嵌入的一项胶质瘤荟萃分析作为额外验证。对于从精心策划的荟萃分析中自动选择记录,开发了一种名为LARS-GPT的四步策略,包括(1)标准选择和单提示(带有一个标准的提示)创建,(2)最佳组合识别,(3)组合提示(带有一个或多个标准的提示)创建,以及(4)请求发送和答案总结。计算召回率、工作量减少、精确率和F1分数以评估LARS-GPT的性能。
不同单提示之间表现出不同的性能,平均召回率为0.800。基于这些单提示,我们能够找到性能优于预设阈值的组合。最后,通过确定最佳标准组合,LARS-GPT平均减少了40.1%的工作量,召回率大于0.9。
我们在此展示了一个开创性的发现,即使用大语言模型进行荟萃分析文献的自动选择是可能的。我们在此提供了一个流程LARS-GPT,它在保持预设召回率的同时显著减少了工作量。