Ruan Minghao, Fan Junhao, Liu Mingkai, Meng Zhengfeng, Zhang Xiaohai, Zhang Chengjing
The Third Department of Hepatic Surgery, Eastern Hepatobiliary Surgery Hospital, the Naval Medical University, Shanghai, China.
Department of Hepatic Surgery, Eastern Hepatobiliary Surgery Hospital, the Naval Medical University, Shanghai, China.
BMC Med Res Methodol. 2025 Aug 25;25(1):199. doi: 10.1186/s12874-025-02644-9.
Literature screening constitutes a critical component in evidence synthesis; however, it typically requires substantial time and human resources. Artificial intelligence (AI) has shown promise in this field, yet the accuracy and effectiveness of AI tools for literature screening remain uncertain. This study aims to evaluate the performance of several existing AI-powered automated tools for literature screening.
This diagnostic accuracy study employed a cohort to evaluate the performance of five AI tools-ChatGPT 4.0, Claude 3.5, Gemini 1.5, DeepSeek-V3, and RobotSearch-in literature screening. We selected a random sample of 1,000 publications from a well-established literature cohort, with 500 as randomized controlled trials (RCTs) group and 500 as others group. Diagnostic accuracy was measured using several metrics, including the false negative fraction (FNF), time used for screening, false positive fraction (FPF), and the redundancy number needed to screen.
We reported the FNF for the RCTs group and the FPF for the others group. In the RCTs group, RobotSearch exhibited the lowest FNF at 6.4% (95% CI: 4.6% to 8.9%), whereas Gemini exhibited the highest at 13.0% (95% CI: 10.3% to 16.3%). In the others group, the FPF of the four large language models ranged from 2.8% (95% CI: 1.7% to 4.7%) to 3.8% (95% CI: 2.4% to 5.9%), both of which were significantly lower than RobotSearch's rate of 22.2% (95% CI: 18.8% to 26.1%). In terms of screening efficiency, the mean time used for screening per article was 1.3 s for ChatGPT, 6.0 s for Claude, 1.2 s for Gemini, and 2.6 s for DeepSeek.
The AI tools assessed in this study demonstrated commendable performance in literature screening; however, they are not yet suitable as standalone solutions. These tools can serve as effective auxiliary aids, and a hybrid approach that integrates human expertise with AI may enhance both the efficiency and accuracy of the literature screening process.
文献筛选是证据综合的关键组成部分;然而,它通常需要大量的时间和人力资源。人工智能(AI)在该领域已展现出前景,但用于文献筛选的人工智能工具的准确性和有效性仍不确定。本研究旨在评估几种现有的人工智能驱动的自动化文献筛选工具的性能。
这项诊断准确性研究采用队列研究来评估五种人工智能工具——ChatGPT 4.0、Claude 3.5、Gemini 1.5、DeepSeek-V3和RobotSearch——在文献筛选中的性能。我们从一个成熟的文献队列中随机抽取了1000篇出版物样本,其中500篇为随机对照试验(RCT)组,500篇为其他组。使用几种指标来衡量诊断准确性,包括假阴性率(FNF)、筛选所用时间、假阳性率(FPF)以及筛选所需的冗余数量。
我们报告了RCT组的FNF和其他组的FPF。在RCT组中,RobotSearch的FNF最低,为6.4%(95%CI:4.6%至8.9%),而Gemini的FNF最高,为13.0%(95%CI:10.3%至16.3%)。在其他组中,四个大语言模型的FPF在2.8%(95%CI:1.7%至4.7%)至3.8%(95%CI:2.4%至5.9%)之间,均显著低于RobotSearch的22.2%(95%CI:18.8%至26.1%)。在筛选效率方面,ChatGPT每篇文章的平均筛选时间为1.3秒,Claude为6.0秒,Gemini为1.2秒,DeepSeek为2.6秒。
本研究中评估的人工智能工具在文献筛选中表现出了值得称赞的性能;然而,它们尚不适宜作为独立的解决方案。这些工具可作为有效的辅助手段,将人类专业知识与人工智能相结合的混合方法可能会提高文献筛选过程的效率和准确性。