Suppr超能文献

数据清洗和特征工程的 11 个快速技巧。

Eleven quick tips for data cleaning and feature engineering.

机构信息

Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.

Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy.

出版信息

PLoS Comput Biol. 2022 Dec 15;18(12):e1010718. doi: 10.1371/journal.pcbi.1010718. eCollection 2022 Dec.

Abstract

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

摘要

将计算统计学或机器学习方法应用于数据是许多科学研究的关键组成部分,无论在哪个领域,仅应用这些方法可能不足以产生稳健可靠的结果。在应用任何发现方法之前,需要预处理步骤来准备数据进行计算分析。在这个框架中,数据清理和特征工程是任何涉及数据分析的科学研究的关键支柱,应该在项目的早期阶段充分设计和执行。我们将“特征”称为描述一个人或一个观察的特定特征的变量,通常记录为数据集的一列。尽管这些数据清理和特征工程步骤至关重要,但有时它们的执行效果不佳或效率低下,尤其是对于初学者和缺乏经验的研究人员来说。出于这个原因,我们在这里提出了一些关于数据清理和特征工程的快速提示,以帮助正确执行这些重要的预处理步骤,避免常见错误和陷阱。虽然我们是根据生物信息学和健康信息学的场景来设计这些准则的,但我们相信它们可以更一般地应用于任何科学领域。因此,我们将这些准则针对任何希望进行数据清理或特征工程的研究人员或从业者。我们相信,我们的简单建议可以帮助研究人员和学者进行更好的计算分析,从而得出更可靠的结果和更可靠的发现。

相似文献

1
Eleven quick tips for data cleaning and feature engineering.
PLoS Comput Biol. 2022 Dec 15;18(12):e1010718. doi: 10.1371/journal.pcbi.1010718. eCollection 2022 Dec.
2
Ten quick tips for machine learning in computational biology.
BioData Min. 2017 Dec 8;10:35. doi: 10.1186/s13040-017-0155-3. eCollection 2017.
3
Ten quick tips for clinical electroencephalographic (EEG) data acquisition and signal processing.
PeerJ Comput Sci. 2024 Sep 3;10:e2256. doi: 10.7717/peerj-cs.2256. eCollection 2024.
4
Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.
PLoS Comput Biol. 2023 Jul 20;19(7):e1011272. doi: 10.1371/journal.pcbi.1011272. eCollection 2023 Jul.
5
Ten quick tips for avoiding pitfalls in multi-omics data integration analyses.
PLoS Comput Biol. 2023 Jul 6;19(7):e1011224. doi: 10.1371/journal.pcbi.1011224. eCollection 2023 Jul.
6
Seven quick tips for gene-focused computational pangenomic analysis.
BioData Min. 2024 Sep 3;17(1):28. doi: 10.1186/s13040-024-00380-2.
7
Nine quick tips for pathway enrichment analysis.
PLoS Comput Biol. 2022 Aug 11;18(8):e1010348. doi: 10.1371/journal.pcbi.1010348. eCollection 2022 Aug.
8
Ten quick tips for computational analysis of medical images.
PLoS Comput Biol. 2023 Jan 5;19(1):e1010778. doi: 10.1371/journal.pcbi.1010778. eCollection 2023 Jan.
9
Ten quick tips for fuzzy logic modeling of biomedical systems.
PLoS Comput Biol. 2023 Dec 21;19(12):e1011700. doi: 10.1371/journal.pcbi.1011700. eCollection 2023 Dec.
10

引用本文的文献

2
The Use of Selected Machine Learning Methods in Dairy Cattle Farming: A Review.
Animals (Basel). 2025 Jul 10;15(14):2033. doi: 10.3390/ani15142033.
3
Hyperdimensional computing in biomedical sciences: a brief review.
PeerJ Comput Sci. 2025 May 13;11:e2885. doi: 10.7717/peerj-cs.2885. eCollection 2025.
4
Development and application of an early prediction model for risk of bloodstream infection based on real-world study.
BMC Med Inform Decis Mak. 2025 May 14;25(1):186. doi: 10.1186/s12911-025-03020-9.
6
A teaching proposal for a short course on biomedical data science.
PLoS Comput Biol. 2025 Apr 14;21(4):e1012946. doi: 10.1371/journal.pcbi.1012946. eCollection 2025 Apr.
7
Eight quick tips for biologically and medically informed machine learning.
PLoS Comput Biol. 2025 Jan 9;21(1):e1012711. doi: 10.1371/journal.pcbi.1012711. eCollection 2025 Jan.
9
Clinical Feature Ranking Based on Ensemble Machine Learning Reveals Top Survival Factors for Glioblastoma Multiforme.
J Healthc Inform Res. 2023 Sep 20;8(1):1-18. doi: 10.1007/s41666-023-00138-1. eCollection 2024 Mar.
10
Ten quick tips for harnessing the power of ChatGPT in computational biology.
PLoS Comput Biol. 2023 Aug 10;19(8):e1011319. doi: 10.1371/journal.pcbi.1011319. eCollection 2023 Aug.

本文引用的文献

1
The Commoditization of AI for Molecule Design.
Artif Intell Life Sci. 2022 Dec;2. doi: 10.1016/j.ailsci.2022.100031. Epub 2022 Jan 24.
2
Explainable, trustworthy, and ethical machine learning for healthcare: A survey.
Comput Biol Med. 2022 Oct;149:106043. doi: 10.1016/j.compbiomed.2022.106043. Epub 2022 Sep 7.
3
How (Not) to Generate a Highly Predictive Biomarker Panel Using Machine Learning.
J Proteome Res. 2022 Sep 2;21(9):2071-2074. doi: 10.1021/acs.jproteome.2c00117. Epub 2022 Aug 25.
4
Nine quick tips for pathway enrichment analysis.
PLoS Comput Biol. 2022 Aug 11;18(8):e1010348. doi: 10.1371/journal.pcbi.1010348. eCollection 2022 Aug.
5
Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.
BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.
6
Advancing code sharing in the computational biology community.
PLoS Comput Biol. 2022 Jun 2;18(6):e1010193. doi: 10.1371/journal.pcbi.1010193. eCollection 2022 Jun.
7
A Combined Interpolation and Weighted -Nearest Neighbours Approach for the Imputation of Longitudinal ICU Laboratory Data.
J Healthc Inform Res. 2020 Mar 2;4(2):174-188. doi: 10.1007/s41666-020-00069-1. eCollection 2020 Jun.
8
Ten simple rules for initial data analysis.
PLoS Comput Biol. 2022 Feb 24;18(2):e1009819. doi: 10.1371/journal.pcbi.1009819. eCollection 2022 Feb.
9
On the Commoditization of Artificial Intelligence.
Front Psychol. 2021 Sep 30;12:696346. doi: 10.3389/fpsyg.2021.696346. eCollection 2021.
10
Highly accurate protein structure prediction with AlphaFold.
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验