Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, Department of Pediatrics, University of Alberta, Edmonton, Alberta, Canada.
Syst Rev. 2020 Nov 27;9(1):272. doi: 10.1186/s13643-020-01528-x.
We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening and explored whether Abstrackr's predictions varied by review or study-level characteristics.
For a convenience sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews), we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool's predictions varied by review and study-level characteristics.
Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) h of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) - 1.53 (- 2.92, - 0.15) to - 1.17 (- 2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P = 0.37) or intervention type (simple or complex, P = 0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P = 0.01), or that included only trials (95%) vs. multiple designs (86%) (P = 0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P = 0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P = 0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P = 0.02) were more often correctly predicted as relevant.
Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. Several of our findings are paradoxical and require further study to fully understand the tasks to which ML-assisted screening is best suited. The findings should be interpreted in light of the fact that the protocol was prepared for the funder, but not published a priori. Because we used a convenience sample, the findings may be prone to selection bias. The results may not be generalizable to other samples of reviews, ML tools, or screening approaches. The small number of missed studies across reviews with pairwise meta-analyses hindered strong conclusions about the effect of missed studies on the results and conclusions of systematic reviews.
我们评估了使用 Abstrackr 机器学习 (ML) 工具半自动筛选标题-摘要的益处和风险,并探讨了 Abstrackr 的预测结果是否因审查或研究水平特征而有所不同。
对于一个方便的样本,有 16 项审查,其中有足够的数据来解决我们的目标(11 项系统审查和 5 项快速审查),我们在 Abstrackr 中筛选了 200 条记录的训练集,并下载了工具预测的其余记录的相关性(相关或不相关)。我们回顾性地模拟了宽松加速筛选方法。我们估计了与双独立筛选相比节省的时间和遗漏的比例。对于具有成对荟萃分析的审查,我们评估了在删除遗漏的研究后汇总效果的变化。我们探讨了工具的预测结果是否因审查和研究水平特征而有所不同。
使用 ML 辅助的宽松加速方法,我们错误地排除了最终报告中包含的 0 到 3(0 到 14%)条记录,但节省了 26(9,42)小时的筛选时间。一项遗漏的研究被纳入了一项系统审查中的八项成对荟萃分析。其中一项荟萃分析的汇总效果发生了很大变化(从 MD(95%CI)-1.53(-2.92,-0.15)到-1.17(-2.70,0.36))。在最终报告中的 802 条记录中,87%被正确预测为相关。预测结果不因审查类型(系统或快速,P=0.37)或干预类型(简单或复杂,P=0.47)而有所不同。在具有多个(89%)而不是单个(83%)研究问题(P=0.01)或仅包含试验(95%)而不是多种设计(86%)的审查中,预测结果更准确(P=0.003)。在研究水平上,试验(91%)、混合方法(100%)和定性研究(93%)比观察性研究(79%)或综述(83%)更经常被正确预测为相关(P=0.0006)。高或不清楚(88%)风险的研究与低风险(80%)的研究相比(P=0.039),以及最近发表的研究(平均值(SD)2008(7)与 2006(10),P=0.02)更经常被正确预测为相关。
我们的筛选方法节省了时间,在有限的遗漏相关记录风险可以接受的情况下可能是合适的。我们的一些发现是矛盾的,需要进一步研究才能充分了解 ML 辅助筛选最适合的任务。由于我们使用了方便的样本,因此发现可能容易受到选择偏差的影响。结果可能不适用于其他审查、ML 工具或筛选方法的样本。在具有成对荟萃分析的审查中,遗漏的研究数量较少,这阻碍了对遗漏研究对系统综述结果和结论的影响得出强有力的结论。