Helmut Schmidt University, Hamburg, Germany.
Medical School Hamburg, Hamburg, Germany.
Res Synth Methods. 2024 Nov;15(6):1120-1146. doi: 10.1002/jrsm.1762. Epub 2024 Oct 16.
Several AI-aided screening tools have emerged to tackle the ever-expanding body of literature. These tools employ active learning, where algorithms sort abstracts based on human feedback. However, researchers using these tools face a crucial dilemma: When should they stop screening without knowing the proportion of relevant studies? Although numerous stopping rules have been proposed to guide users in this decision, they have yet to undergo comprehensive evaluation. In this study, we evaluated the performance of three stopping rules: the knee method, a data-driven heuristic, and a prevalence estimation technique. We measured performance via sensitivity, specificity, and screening cost and explored the influence of the prevalence of relevant studies and the choice of the learning algorithm. We curated a dataset of abstract collections from meta-analyses across five psychological research domains. Our findings revealed performance differences between stopping rules regarding all performance measures and variations in the performance of stopping rules across different prevalence ratios. Moreover, despite the relatively minor impact of the learning algorithm, we found that specific combinations of stopping rules and learning algorithms were most effective for certain prevalence ratios of relevant abstracts. Based on these results, we derived practical recommendations for users of AI-aided screening tools. Furthermore, we discuss possible implications and offer suggestions for future research.
已经出现了几种人工智能辅助筛选工具来处理不断扩展的文献。这些工具采用主动学习,算法根据人工反馈对摘要进行分类。然而,使用这些工具的研究人员面临着一个关键的困境:当他们不知道相关研究的比例时,应该在何时停止筛选?虽然已经提出了许多停止规则来指导用户做出这一决策,但这些规则尚未经过全面评估。在这项研究中,我们评估了三种停止规则的性能:膝盖法、数据驱动的启发式方法和流行度估计技术。我们通过灵敏度、特异性和筛选成本来衡量性能,并探讨了相关研究流行度和学习算法选择的影响。我们从五个心理学研究领域的荟萃分析中整理了一个摘要集合数据集。我们的研究结果表明,停止规则在所有性能指标上的性能存在差异,并且停止规则在不同的相关摘要流行度比率下的性能也存在差异。此外,尽管学习算法的影响相对较小,但我们发现,对于特定的相关摘要流行度比率,停止规则和学习算法的特定组合最为有效。基于这些结果,我们为人工智能辅助筛选工具的用户提供了实用的建议。此外,我们还讨论了可能的影响,并为未来的研究提出了建议。