Institute of Biological Psychiatry, Mental Health Center Sankt Hans, Roskilde, 4000, Denmark.
The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark.
Commun Biol. 2023 Jan 26;6(1):101. doi: 10.1038/s42003-023-04477-y.
Sample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols. While phasing accuracy varied by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. We demonstrate an attenuation in imputation accuracy within samples of non-European origin, highlighting challenges to studying complex traits in diverse populations. Finally, imputation errors can bias association tests, reduce predictive utility of polygenic scores. Carefully optimized data integration strategies enhance accuracy and replicability of complex trait analyses in complex biobanks.
样本招募对于研究联盟、生物库和个人基因组学公司来说需要数年时间,这就需要分批进行基因分型,使用不同的技术。由于基因分型芯片上的标记内容不同,因此整合这些数据集并非易事,其对单倍型估计(相位)和全基因组估计的影响(必要步骤)对于复杂性状分析仍然评估不足。使用 iPSYCH 数据集,包含 130438 个人,分两个阶段在不同的数组上进行基因分型,我们评估了多种相位方法和数据集成协议的相位和插补性能。虽然相位准确性因方法和数据集成协议的选择而异,但插补准确性主要在数据集成协议之间变化。我们在非欧洲血统的样本中发现了插补准确性的衰减,突出了在不同人群中研究复杂性状的挑战。最后,插补错误会偏倚关联测试,降低多基因评分的预测效用。精心优化的数据集成策略可提高复杂生物库中复杂性状分析的准确性和可重复性。