Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany.
Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany.
F1000Res. 2021 Jan 18;10:33. doi: 10.12688/f1000research.29032.2. eCollection 2021.
Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.
数据分析通常需要涉及许多不同的步骤,从应用各种命令行工具到使用 R 或 Python 等脚本语言生成图表和表格。人们普遍认识到,数据分析最好以可重现的方式进行。可重复性使技术验证和原始数据甚至新数据的结果再生成为可能。然而,可重复性本身并不能保证分析具有持久的影响力(即可持续性),甚至不能保证对一个研究小组有影响。我们假设,确保适应性和透明度同样重要。前者描述了修改分析以回答扩展或略有不同的研究问题的能力。后者描述了理解分析的能力,以便判断它不仅在技术上而且在方法上是否有效。在这里,我们分析了使数据分析具有可重现性、适应性和透明性所需的属性。我们展示了如何使用流行的工作流程管理系统 Snakemake 来保证这一点,以及它如何能够以一种符合人体工程学的、组合的、统一的方式表示数据分析中涉及的所有步骤,从原始数据处理到质量控制以及最终结果的精细、交互式探索和绘制。