Department of Computer Science, University of California, Los Angeles, California 90095-1596, USA.
Genetics. 2011 Jun;188(2):449-60. doi: 10.1534/genetics.111.128595. Epub 2011 Apr 5.
Genome-wide association studies (GWASs) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single-nucleotide polymorphisms (SNPs), called tag SNPs, is genotyped in case/control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this article we address how to characterize these regions cost effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case/control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Project can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case Control Consortium to demonstrate that our method shows superior performance to the correlation- and distance-based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs.
全基因组关联研究(GWAS)已成功确定与疾病特征相关的基因组区域。在典型的 GWAS 中,对病例/对照个体中的一组信息单核苷酸多态性(SNP),称为标签 SNP,进行基因分型。一旦计算出标签 SNP 统计数据,就认为与最显著相关的标签 SNP 处于连锁不平衡(LD)的基因组区域包含因果多态性。然而,这些 LD 区域通常很大,包含许多其他多态性。在这些区域中跟踪所有包含的 SNP 既昂贵又不适合生物验证。在本文中,我们将解决如何以经济有效的方式对这些区域进行特征描述,旨在为研究人员提供明确的生物验证方向。我们提出了一种后续研究方法,通过从关联区域中选择额外的 SNP(称为后续 SNP)并对原始病例/对照个体进行基因分型,从而有效地对这些区域进行后续研究,以识别所有未分型的关联 SNP。我们介绍了一种新的 SNP 选择方法,目的是在选择的后续 SNP 中最大化关联 SNP 的数量。我们展示了如何利用原始标签 SNP 的观察统计数据和人类遗传变异参考数据(如 HapMap 项目)来识别后续 SNP。我们使用基于 HapMap 数据和 Wellcome Trust Case Control Consortium 的模拟和真实关联研究来证明我们的方法优于基于相关性和距离的传统后续 SNP 选择方法。我们的方法可在 http://genetics.cs.ucla.edu/followupSNPs 上公开获取。