Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
Department of Epidemiology, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
BMC Res Notes. 2021 Sep 8;14(1):352. doi: 10.1186/s13104-021-05741-2.
Illumina BeadChip arrays are commonly used to generate DNA methylation data for large epidemiological studies. Updates in technology over time create challenges for data harmonization within and between studies, many of which obtained data from the older 450K and newer EPIC platforms. The pre-processing pipeline for DNA methylation is not trivial, and influences the downstream analyses. Incorporating different platforms adds a new level of technical variability that has not yet been taken into account by recommended pipelines. Our study evaluated the performance of various tools on different versions of platform data harmonization at each step of pre-processing pipeline, including quality control (QC), normalization, batch effect adjustment, and genomic inflation. We illustrate our novel approach using 450K and EPIC data from the Diabetes Autoimmunity Study in the Young (DAISY) prospective cohort.
We found normalization and probe filtering had the biggest effect on data harmonization. Employing a meta-analysis was an effective and easily executable method for accounting for platform variability. Correcting for genomic inflation also helped with harmonization. We present guidelines for studies seeking to harmonize data from the 450K and EPIC platforms, which includes the use of technical replicates for evaluating numerous pre-processing steps, and employing a meta-analysis.
Illumina BeadChip 阵列常用于生成大型流行病学研究的 DNA 甲基化数据。随着时间的推移,技术的更新为研究内部和研究之间的数据协调带来了挑战,其中许多研究从较旧的 450K 和较新的 EPIC 平台获得了数据。DNA 甲基化的预处理管道并不简单,并且会影响下游分析。整合不同的平台增加了一个尚未被推荐管道考虑到的新的技术可变性层次。我们的研究评估了各种工具在预处理管道的每个步骤(包括质量控制 (QC)、标准化、批次效应调整和基因组膨胀)中对不同版本平台数据协调的性能。我们使用来自年轻糖尿病自身免疫研究 (DAISY) 前瞻性队列的 450K 和 EPIC 数据说明了我们的新方法。
我们发现标准化和探针过滤对数据协调有最大的影响。采用荟萃分析是一种有效且易于执行的方法,可以解决平台变异性问题。校正基因组膨胀也有助于协调。我们为试图协调 450K 和 EPIC 平台数据的研究提供了指导方针,包括使用技术重复来评估众多预处理步骤,并采用荟萃分析。