Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy.
PLoS Comput Biol. 2022 Dec 15;18(12):e1010718. doi: 10.1371/journal.pcbi.1010718. eCollection 2022 Dec.
Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.
将计算统计学或机器学习方法应用于数据是许多科学研究的关键组成部分,无论在哪个领域,仅应用这些方法可能不足以产生稳健可靠的结果。在应用任何发现方法之前,需要预处理步骤来准备数据进行计算分析。在这个框架中,数据清理和特征工程是任何涉及数据分析的科学研究的关键支柱,应该在项目的早期阶段充分设计和执行。我们将“特征”称为描述一个人或一个观察的特定特征的变量,通常记录为数据集的一列。尽管这些数据清理和特征工程步骤至关重要,但有时它们的执行效果不佳或效率低下,尤其是对于初学者和缺乏经验的研究人员来说。出于这个原因,我们在这里提出了一些关于数据清理和特征工程的快速提示,以帮助正确执行这些重要的预处理步骤,避免常见错误和陷阱。虽然我们是根据生物信息学和健康信息学的场景来设计这些准则的,但我们相信它们可以更一般地应用于任何科学领域。因此,我们将这些准则针对任何希望进行数据清理或特征工程的研究人员或从业者。我们相信,我们的简单建议可以帮助研究人员和学者进行更好的计算分析,从而得出更可靠的结果和更可靠的发现。