Department of Anthropology, University of Toronto at Mississauga, Mississauga, Canada.
Department of Biochemistry, University of São Paulo, São Paulo, Brazil.
Clin Epigenetics. 2023 Mar 11;15(1):41. doi: 10.1186/s13148-023-01459-z.
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson's correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson's correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Infinium EPIC 阵列测量了超过 850,000 个 CpG 位点的甲基化状态。EPIC BeadChip 使用了两种阵列设计:Infinium 型 I 和 II 探针。这两种探针类型具有不同的技术特性,可能会干扰分析。已经开发了许多标准化和预处理方法,以减少探针类型偏差以及背景和染料偏差等其他问题。
本研究使用 16 个重复样本和三个指标评估了各种标准化方法的性能:绝对β值差异、重复对之间非重复 CpG 的重叠以及对β值分布的影响。此外,我们使用原始数据和 SeSAMe 2 标准化数据进行了 Pearson 相关和组内相关系数 (ICC) 分析。
我们定义的方法 SeSAMe 2,由应用常规 SeSAMe 管道与额外一轮 QC、pOOBAH 掩蔽组成,被发现是性能最佳的标准化方法,而基于分位数的方法则被发现是性能最差的方法。全阵列 Pearson 相关系数很高。然而,与之前的研究一致,EPIC 阵列上的很大一部分探针表现出较差的重现性(ICC < 0.50)。表现不佳的探针大多数具有接近 0 或 1 的β值,且相对较低的标准偏差。这些结果表明,探针的可靠性主要是由于生物学变异有限,而不是技术测量变异。重要的是,使用 SeSAMe 2 对数据进行标准化极大地提高了 ICC 估计值,具有 ICC 值> 0.50 的探针比例从 45.18%(原始数据)增加到 61.35%(SeSAMe 2)。