Unit of Computational Medicine, Center for Molecular Medicine, Department of Medicine, Karolinska Institutet, Stockholm, Sweden.
Epigenetics. 2013 Mar;8(3):333-46. doi: 10.4161/epi.24008. Epub 2013 Feb 19.
The proper identification of differentially methylated CpGs is central in most epigenetic studies. The Illumina HumanMethylation450 BeadChip is widely used to quantify DNA methylation; nevertheless, the design of an appropriate analysis pipeline faces severe challenges due to the convolution of biological and technical variability and the presence of a signal bias between Infinium I and II probe design types. Despite recent attempts to investigate how to analyze DNA methylation data with such an array design, it has not been possible to perform a comprehensive comparison between different bioinformatics pipelines due to the lack of appropriate data sets having both large sample size and sufficient number of technical replicates. Here we perform such a comparative analysis, targeting the problems of reducing the technical variability, eliminating the probe design bias and reducing the batch effect by exploiting two unpublished data sets, which included technical replicates and were profiled for DNA methylation either on peripheral blood, monocytes or muscle biopsies. We evaluated the performance of different analysis pipelines and demonstrated that: (1) it is critical to correct for the probe design type, since the amplitude of the measured methylation change depends on the underlying chemistry; (2) the effect of different normalization schemes is mixed, and the most effective method in our hands were quantile normalization and Beta Mixture Quantile dilation (BMIQ); (3) it is beneficial to correct for batch effects. In conclusion, our comparative analysis using a comprehensive data set suggests an efficient pipeline for proper identification of differentially methylated CpGs using the Illumina 450K arrays.
在大多数表观遗传学研究中,正确识别差异甲基化的 CpG 是至关重要的。Illumina HumanMethylation450 BeadChip 被广泛用于定量 DNA 甲基化;然而,由于生物学和技术变异性的卷积以及 Infinium I 和 II 探针设计类型之间存在信号偏差,设计合适的分析管道面临着严峻的挑战。尽管最近有人试图研究如何使用这种阵列设计来分析 DNA 甲基化数据,但由于缺乏具有大样本量和足够技术重复的适当数据集,因此无法对不同的生物信息学管道进行全面比较。在这里,我们针对降低技术变异性、消除探针设计偏差和减少批次效应的问题进行了这样的比较分析,利用两个未发表的数据集中的技术重复,这些数据集分别对外周血、单核细胞或肌肉活检进行了 DNA 甲基化分析。我们评估了不同分析管道的性能,并证明:(1)纠正探针设计类型至关重要,因为测量的甲基化变化幅度取决于潜在的化学性质;(2)不同归一化方案的效果混杂,我们手中最有效的方法是分位数归一化和 Beta 混合分位数扩张(BMIQ);(3)纠正批次效应是有益的。总之,我们使用综合数据集进行的比较分析表明,使用 Illumina 450K 阵列正确识别差异甲基化的 CpG 是一种有效的管道。