基于机器学习的科学中的漏洞与可重复性危机。

Leakage and the reproducibility crisis in machine-learning-based science.

作者信息

Kapoor Sayash, Narayanan Arvind

机构信息

Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA.

出版信息

Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

DOI:10.1016/j.patter.2023.100804

PMID:37720327

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10499856/

Abstract

Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.

摘要

机器学习（ML）方法在定量科学领域已崭露头角。然而，在基于机器学习的科学研究中存在许多已知的方法陷阱，包括数据泄露。我们系统地研究了基于机器学习的科学中的可重复性问题。通过对采用机器学习方法的领域的文献调查，我们发现有17个领域存在数据泄露问题，总共涉及294篇论文，在某些情况下，还导致了极度乐观的结论。基于我们的调查，我们引入了一个详细的分类法，涵盖从教科书错误到开放性研究问题等八种类型的数据泄露。我们建议研究人员通过填写我们介绍的模型信息表来测试每种类型的数据泄露。最后，我们对内战预测进行了一项可重复性研究，在该研究中，复杂的机器学习模型被认为大大优于传统统计模型，如逻辑回归（LR）。当错误得到纠正后，复杂的机器学习模型在性能上并不比几十年前的LR模型有实质性的提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/42ce/10499856/810158c0eb6a/gr1.jpg

相似文献

Leakage and the reproducibility crisis in machine-learning-based science.

Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Measuring transparency in the social sciences: political science and international relations.

R Soc Open Sci. 2024 Jul 3;11(7):240313. doi: 10.1098/rsos.240313. eCollection 2024 Jul.

Confound-leakage: confound removal in machine learning leads to leakage.

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad071. Epub 2023 Sep 30.

The effects of data leakage on connectome-based machine learning models.

bioRxiv. 2023 Dec 28:2023.06.09.544383. doi: 10.1101/2023.06.09.544383.

Leakage Prediction in Machine Learning Models When Using Data from Sports Wearable Sensors.

Comput Intell Neurosci. 2022 May 17;2022:5314671. doi: 10.1155/2022/5314671. eCollection 2022.

Data leakage inflates prediction performance in connectome-based machine learning models.

Nat Commun. 2024 Feb 28;15(1):1829. doi: 10.1038/s41467-024-46150-w.

Machine Learning Applications for the Prediction of Bone Cement Leakage in Percutaneous Vertebroplasty.

Front Public Health. 2021 Dec 10;9:812023. doi: 10.3389/fpubh.2021.812023. eCollection 2021.

How to read and review papers on machine learning and artificial intelligence in radiology: a survival guide to key methodological concepts.

Eur Radiol. 2021 Apr;31(4):1819-1830. doi: 10.1007/s00330-020-07324-4. Epub 2020 Oct 1.

A data driven methodology for social science research with left-behind children as a case study.

PLoS One. 2020 Nov 20;15(11):e0242483. doi: 10.1371/journal.pone.0242483. eCollection 2020.

引用本文的文献

Screening hypertension using non-laboratory risk factors with machine learning: a retrospective cross-sectional study in Indonesia.

BMJ Open. 2025 Aug 27;15(8):e092364. doi: 10.1136/bmjopen-2024-092364.

A novel machine learning framework for stroke type identification in resource constrained settings with robustness to missing data.

Sci Rep. 2025 Aug 25;15(1):31207. doi: 10.1038/s41598-025-16660-8.

Successes and limitations of pretrained YOLO detectors applied to unseen time-lapse images for automated pollinator monitoring.

Sci Rep. 2025 Aug 21;15(1):30671. doi: 10.1038/s41598-025-16140-z.

Assessing DeepSeek-R1 for Clinical Decision Support in Multidisciplinary Laboratory Medicine.

J Multidiscip Healthc. 2025 Aug 12;18:4979-4988. doi: 10.2147/JMDH.S538253. eCollection 2025.

A software pipeline for systematizing machine learning of speech data.

Front Psychiatry. 2025 Jul 29;16:1451368. doi: 10.3389/fpsyt.2025.1451368. eCollection 2025.

Machine learning on multiple epigenetic features reveals H3K27Ac as a driver of gene expression prediction across patients with glioblastoma.

PLoS Comput Biol. 2025 Aug 7;21(8):e1012272. doi: 10.1371/journal.pcbi.1012272. eCollection 2025 Aug.

Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records.

Digit Health. 2025 Jul 29;11:20552076251352436. doi: 10.1177/20552076251352436. eCollection 2025 Jan-Dec.

OpDetect: A convolutional and recurrent neural network classifier for precise and sensitive operon detection from RNA-seq data.

PLoS One. 2025 Aug 1;20(8):e0329355. doi: 10.1371/journal.pone.0329355. eCollection 2025.

Conceptual framework for prediction models of patient deterioration based on nursing documentation patterns: reproducibility and generalizability with a large number of hospitals across the United States.

J Biomed Inform. 2025 Jul 27;169:104887. doi: 10.1016/j.jbi.2025.104887.

Review and recommendations for using artificial intelligence in intracoronary optical coherence tomography analysis.

Eur Heart J Digit Health. 2025 May 15;6(4):529-539. doi: 10.1093/ehjdh/ztaf053. eCollection 2025 Jul.

本文引用的文献

Putting Psychology to the Test: Rethinking Model Evaluation Through Benchmarking and Prediction.

Adv Methods Pract Psychol Sci. 2021 Jul-Sep;4(3). doi: 10.1177/25152459211026864. Epub 2021 Sep 23.

Genomic Machine Learning Meta-regression: Insights on Associations of Study Features With Reported Model Performance.

IEEE/ACM Trans Comput Biol Bioinform. 2024 Jan-Feb;21(1):169-177. doi: 10.1109/TCBB.2023.3343808. Epub 2024 Feb 5.

We need to talk about standard splits.

Proc Conf Assoc Comput Linguist Meet. 2019 Jul;2019:2786-2791. doi: 10.18653/v1/p19-1267.

Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge.

Socius. 2019 Jan-Dec;5. doi: 10.1177/2378023119849803. Epub 2019 Sep 10.

Assessing Methods and Tools to Improve Reporting, Increase Transparency, and Reduce Failures in Machine Learning Applications in Health Care.

Radiol Artif Intell. 2022 Jan 26;4(2):e210127. doi: 10.1148/ryai.210127. eCollection 2022 Mar.

Navigating the pitfalls of applying machine learning in genomics.

Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.

Data and its (dis)contents: A survey of dataset development and use in machine learning research.

Patterns (N Y). 2021 Nov 12;2(11):100336. doi: 10.1016/j.patter.2021.100336.

Confounds in the Data-Comments on "Decoding Brain Representations by Multimodal Learning of Neural Activity and Visual Features".

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9217-9220. doi: 10.1109/TPAMI.2021.3121268. Epub 2022 Nov 7.

Integrating explanation and prediction in computational social science.

Nature. 2021 Jul;595(7866):181-188. doi: 10.1038/s41586-021-03659-0. Epub 2021 Jun 30.

A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses.

PLoS One. 2021 Jun 21;16(6):e0251194. doi: 10.1371/journal.pone.0251194. eCollection 2021.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于机器学习的科学中的漏洞与可重复性危机。

Leakage and the reproducibility crisis in machine-learning-based science.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献