School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK.
Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK.
Genome Biol. 2023 Dec 5;24(1):278. doi: 10.1186/s13059-023-03114-5.
Epigenetic scores (EpiScores) can provide biomarkers of lifestyle and disease risk. Projecting new datasets onto a reference panel is challenging due to separation of technical and biological variation with array data. Normalisation can standardise data distributions but may also remove population-level biological variation.
We compare two birth cohorts (Lothian Birth Cohorts of 1921 and 1936 - n = 387 and n = 498) with blood-based DNA methylation assessed at the same chronological age (79 years) and processed in the same lab but in different years and experimental batches. We examine the effect of 16 normalisation methods on a novel BMI EpiScore (trained in an external cohort, n = 18,413), and Horvath's pan-tissue DNA methylation age, when the cohorts are normalised separately and together. The BMI EpiScore explains a maximum variance of R=24.5% in BMI in LBC1936 (SWAN normalisation). Although there are cross-cohort R differences, the normalisation method makes a minimal difference to within-cohort estimates. Conversely, a range of absolute differences are seen for individual-level EpiScore estimates for BMI and age when cohorts are normalised separately versus together. While within-array methods result in identical EpiScores whether a cohort is normalised on its own or together with the second dataset, a range of differences is observed for between-array methods.
Normalisation methods returning similar EpiScores, whether cohorts are analysed separately or together, will minimise technical variation when projecting new data onto a reference panel. These methods are important for cases where raw data is unavailable and joint normalisation of cohorts is computationally expensive.
表观遗传评分(EpiScores)可以提供生活方式和疾病风险的生物标志物。由于阵列数据中技术和生物变异的分离,将新数据集投射到参考面板上具有挑战性。归一化可以标准化数据分布,但也可能消除人群水平的生物变异。
我们比较了两个出生队列(1921 年和 1936 年的洛锡安出生队列,n=387 和 n=498),这些队列的血液 DNA 甲基化在相同的年龄(79 岁)进行评估,并且在同一个实验室但在不同的年份和实验批次中进行处理。我们研究了 16 种归一化方法对 BMI EpiScore(在外部队列中进行训练,n=18413)和 Horvath 的泛组织 DNA 甲基化年龄的影响,当队列分别和一起进行归一化时。BMI EpiScore 在 LBC1936 中解释了 BMI 的最大方差 R=24.5%(SWAN 归一化)。尽管存在跨队列 R 差异,但归一化方法对同队列估计值的影响最小。相反,当队列分别与一起进行归一化时,BMI 和年龄的个体水平 EpiScore 估计值会出现一系列绝对差异。虽然在单独对队列进行归一化或与第二组数据一起进行归一化时,基于数组的方法会产生相同的 EpiScores,但对于基于数组的方法,会观察到一系列差异。
当将新数据投射到参考面板上时,返回相似 EpiScores 的归一化方法,无论队列是单独分析还是一起分析,都将最小化技术变异。这些方法在无法获得原始数据且联合归一化队列计算成本较高的情况下非常重要。