Kapoor Sayash, Narayanan Arvind
Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA.
Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.
Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.
机器学习(ML)方法在定量科学领域已崭露头角。然而,在基于机器学习的科学研究中存在许多已知的方法陷阱,包括数据泄露。我们系统地研究了基于机器学习的科学中的可重复性问题。通过对采用机器学习方法的领域的文献调查,我们发现有17个领域存在数据泄露问题,总共涉及294篇论文,在某些情况下,还导致了极度乐观的结论。基于我们的调查,我们引入了一个详细的分类法,涵盖从教科书错误到开放性研究问题等八种类型的数据泄露。我们建议研究人员通过填写我们介绍的模型信息表来测试每种类型的数据泄露。最后,我们对内战预测进行了一项可重复性研究,在该研究中,复杂的机器学习模型被认为大大优于传统统计模型,如逻辑回归(LR)。当错误得到纠正后,复杂的机器学习模型在性能上并不比几十年前的LR模型有实质性的提升。