Suppr超能文献

基于机器学习的科学中的漏洞与可重复性危机。

Leakage and the reproducibility crisis in machine-learning-based science.

作者信息

Kapoor Sayash, Narayanan Arvind

机构信息

Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA.

出版信息

Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

Abstract

Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.

摘要

机器学习(ML)方法在定量科学领域已崭露头角。然而,在基于机器学习的科学研究中存在许多已知的方法陷阱,包括数据泄露。我们系统地研究了基于机器学习的科学中的可重复性问题。通过对采用机器学习方法的领域的文献调查,我们发现有17个领域存在数据泄露问题,总共涉及294篇论文,在某些情况下,还导致了极度乐观的结论。基于我们的调查,我们引入了一个详细的分类法,涵盖从教科书错误到开放性研究问题等八种类型的数据泄露。我们建议研究人员通过填写我们介绍的模型信息表来测试每种类型的数据泄露。最后,我们对内战预测进行了一项可重复性研究,在该研究中,复杂的机器学习模型被认为大大优于传统统计模型,如逻辑回归(LR)。当错误得到纠正后,复杂的机器学习模型在性能上并不比几十年前的LR模型有实质性的提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/42ce/10499856/810158c0eb6a/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验