Höffler Kira D, Katrinli Seyma, Halvorsen Matthew W, Stavrum Anne-Kristin, O'Connell Kevin S, Shadrin Alexey, Djurovic Srdjan, Andreassen Ole A, Crowley James J, Haavik Jan, Hagen Kristen, Kvale Gerd, Ressler Kerry, Hansen Bjarne, Soares Jair C, Fries Gabriel R, Smith Alicia K, Le Hellard Stéphanie
University of Bergen and Haukeland University Hospital.
Emory University.
Res Sq. 2025 May 18:rs.3.rs-6580295. doi: 10.21203/rs.3.rs-6580295/v1.
Genetic ancestry is an important factor to account for in DNA methylation studies because genetic variation influences DNA methylation patterns. One approach uses principal components (PCs) calculated from CpG sites that overlap with common SNPs to adjust for ancestry when genotyping data is not available. However, this method does not remove technical and biological variations, such as sex and age, prior to calculating the PCs. The first PC is therefore often associated with factors other than ancestry.
We developed and adapted the adapted approach, which includes 1) residualizing the CpG data overlapping with common SNPs for control probe PCs, sex, age, and cell type proportions to remove the effects of technical and biological factors, and 2) integrating the residualized data with genotype calls from the SNP probes (commonly referred to as rs probes) present on the arrays, before calculating PCs and evaluated the clustering ability and relationship to genetic ancestry.
The PCs generated by led to improved clustering for repeated samples from the same individual and stronger associations with genetic ancestry groups predicted from genotype information compared to the original approach.
We show that the approach improves the adjustment for genetic ancestry in DNA methylation studies. can be integrated into existing R pipelines for commercial methylation arrays, such as 450K, EPICv1, and EPICv2. The code is available on GitHub (https://github.com/KiraHoeffler/EpiAnceR).
在DNA甲基化研究中,遗传血统是一个需要考虑的重要因素,因为基因变异会影响DNA甲基化模式。当无法获得基因分型数据时,一种方法是使用从与常见单核苷酸多态性(SNP)重叠的CpG位点计算出的主成分(PC)来调整血统。然而,这种方法在计算主成分之前,并没有去除技术和生物学变异,如性别和年龄。因此,第一主成分往往与血统以外的因素相关。
我们开发并采用了一种改进的方法,该方法包括:1)将与常见SNP重叠的CpG数据对对照探针主成分、性别、年龄和细胞类型比例进行残差分析,以去除技术和生物学因素的影响;2)在计算主成分之前,将残差化数据与阵列上存在的SNP探针(通常称为rs探针)的基因型调用数据整合,并评估聚类能力以及与遗传血统的关系。
与原始方法相比,改进方法生成的主成分使得来自同一个体的重复样本聚类得到改善,并且与根据基因型信息预测的遗传血统组的关联更强。
我们表明,改进方法在DNA甲基化研究中改善了对遗传血统的调整。该方法可以集成到现有的用于商业甲基化阵列(如450K、EPICv1和EPICv2)的R管道中。代码可在GitHub上获取(https://github.com/KiraHoeffler/EpiAnceR)。