Paschou Peristera, Ziv Elad, Burchard Esteban G, Choudhry Shweta, Rodriguez-Cintron William, Mahoney Michael W, Drineas Petros
Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli, Greece.
PLoS Genet. 2007 Sep;3(9):1672-86. doi: 10.1371/journal.pgen.0030160.
Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
用于确定人类群体结构的少量标记的现有方法需要个体祖先的先验知识。基于主成分分析(PCA)以及理论计算机科学的最新成果,我们提出了一种新颖的算法,该算法应用于全基因组数据时,可选择少量单核苷酸多态性子集(PCA相关的单核苷酸多态性)来重现PCA在完整数据集中发现的结构,而无需使用祖先信息。在先前描述的数据集(10,805个单核苷酸多态性,11个群体)上评估我们的方法时,我们证明使用简单的聚类算法,非常少量的PCA相关单核苷酸多态性可有效地用于将个体分配到特定的大陆或群体。我们在HapMap群体上验证了我们的方法,并通过14个PCA相关单核苷酸多态性实现了完美的洲际区分。在评估了来自HapMap的170万个单核苷酸多态性后,使用少于100个PCA相关单核苷酸多态性就能轻松区分中国和日本群体。我们表明,一般来说,结构信息丰富的单核苷酸多态性在不同地理区域之间不可移植。然而,我们成功识别出一组通用的50个PCA相关单核苷酸多态性,可有效地将个体分配到九个不同群体之一。与使用信息性度量进行的分析相比,我们的方法虽然是无监督的,但取得了相似的结果。我们接着证明,我们的算法可有效地用于混合群体的分析,而无需追溯个体的起源。分析一个波多黎各数据集(192个个体,7,257个单核苷酸多态性)时,我们表明PCA相关单核苷酸多态性可用于成功预测结构和祖先比例。随后,我们在一个独立的波多黎各数据集中验证了这些用于结构识别的单核苷酸多态性。我们引入的算法在数秒内即可运行,并且可轻松应用于大型全基因组数据集,有助于识别群体亚结构、多阶段全基因组关联研究中的分层评估以及人类群体的人口历史研究。