Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris; and Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France (V.-T.T.).
Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems, Austria; and Center for Public Health Methods, RTI International, Research Triangle Park, North Carolina (G.G.).
Ann Intern Med. 2024 Jun;177(6):791-799. doi: 10.7326/M23-3389. Epub 2024 May 21.
Systematic reviews are performed manually despite the exponential growth of scientific literature.
To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews.
Diagnostic test accuracy study.
Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations.
None.
A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened.
Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level.
Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time.
The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level.
None.
尽管科学文献呈指数级增长,但系统评价仍采用手动方式进行。
研究 OpenAI 的 GPT-3.5 Turbo 作为单一评审员,在系统评价的标题和摘要筛选中的灵敏度和特异性。
诊断测试准确性研究。
来自 5 项系统评价的未注释文献数据库,共涉及 22665 条引文。
无。
设计了一个通用提示框架来指导 GPT 进行标题和摘要筛选。模型的输出与作者根据 2 条规则做出的决策进行了比较。第一条规则平衡了灵敏度和特异性,例如充当第二评审员。第二条规则则优化了灵敏度,例如减少需要手动筛选的引文数量。
在平衡规则下,灵敏度范围为 81.1%至 96.5%,特异性范围为 25.8%至 80.4%。在所有的综述中,GPT 发现了 708 条引文(1%)中的 7 条被人类遗漏,这些引文在全文筛选后本应被纳入,但代价是 22665 条假阳性推荐中的 10279 条(45.3%)需要在筛选过程中进行协调。在敏感规则下,灵敏度范围为 94.6%至 99.8%,特异性范围为 2.2%至 46.6%。将手动筛选限制在 GPT 未排除的引文中,可以将需要筛选的引文数量从 6334 条中的 127 条(2%)减少到 4077 条中的 1851 条(45.4%),但代价是在全文水平上错过 0 至 1 条引文(3.8%)。
微调提示所需的时间。这是一项回顾性研究,方便选择了 5 项系统评价作为样本,并且 GPT 的性能对提示的开发和时间敏感。
GPT-3.5 Turbo 模型可以作为标题和摘要筛选的第二评审员,代价是需要额外的工作来协调增加的假阳性。它还有可能减少人类筛选前的引文数量,代价是在全文水平上可能会遗漏一些引文。
无。