用于 SNP 数据亲缘关系和主成分分析的高性能计算工具集。

A high-performance computing toolset for relatedness and principal component analysis of SNP data.

机构信息

Department of Biostatistics, University of Washington, Seattle, WA 98195-7232, USA.

出版信息

Bioinformatics. 2012 Dec 15;28(24):3326-8. doi: 10.1093/bioinformatics/bts606. Epub 2012 Oct 11.

DOI:10.1093/bioinformatics/bts606

PMID:23060615

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3519454/

Abstract

Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ∼8-50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30-300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the 'Gene-Environment Association Studies' consortium studies.

摘要

全基因组关联研究被广泛用于研究疾病和特征的遗传基础，但它们带来了许多计算挑战。我们开发了 gdsfmt 和 SNPRelate（用于多核对称多处理计算机架构的 R 包）来加速 SNP 数据的两个关键计算：主成分分析（PCA）和使用亲缘关系分析的近亲关系度量。我们算法的核心是用 C/C++编写的，经过高度优化。基准测试表明，PCA 和近亲关系的单核实现速度分别比流行的 EIGENSTRAT（v3.0）和 PLINK（v1.07）程序中的实现快约 8-50 倍，并且通过使用 8 个核可以加速 30-300 倍。SNPRelate 可以分析数万例样本的数百万个 SNP。例如，我们的软件包被用于对来自“基因-环境关联研究”联盟研究的 55324 名受试者进行 PCA。

相似文献

A high-performance computing toolset for relatedness and principal component analysis of SNP data.

Bioinformatics. 2012 Dec 15;28(24):3326-8. doi: 10.1093/bioinformatics/bts606. Epub 2012 Oct 11.

FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data.

BMC Bioinformatics. 2016 Mar 9;17:122. doi: 10.1186/s12859-016-0965-1.

Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies.

Genet Epidemiol. 2012 May;36(4):293-302. doi: 10.1002/gepi.21621. Epub 2012 Apr 16.

VCF2PCACluster: a simple, fast and memory-efficient tool for principal component analysis of tens of millions of SNPs.

BMC Bioinformatics. 2024 May 1;25(1):173. doi: 10.1186/s12859-024-05770-1.

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.

ParallABEL: an R library for generalized parallelization of genome-wide association studies.

BMC Bioinformatics. 2010 Apr 29;11:217. doi: 10.1186/1471-2105-11-217.

Inference of relationships in population data using identity-by-descent and identity-by-state.

PLoS Genet. 2011 Sep;7(9):e1002287. doi: 10.1371/journal.pgen.1002287. Epub 2011 Sep 22.

Fast principal component analysis of large-scale genome-wide data.

PLoS One. 2014 Apr 9;9(4):e93766. doi: 10.1371/journal.pone.0093766. eCollection 2014.

Genome-wide Analysis of Large-scale Longitudinal Outcomes using Penalization -GALLOP algorithm.

Sci Rep. 2018 May 1;8(1):6815. doi: 10.1038/s41598-018-24578-7.

GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies.

BMC Bioinformatics. 2013 May 28;14:166. doi: 10.1186/1471-2105-14-166.

引用本文的文献

Epigenome-wide association study of placental co-methylated regions in newborns for prenatal opioid exposure.

Environ Epigenet. 2025 Sep 4;11(1):dvaf021. doi: 10.1093/eep/dvaf021. eCollection 2025.

Local Adaptation Drives Leaf Thermoregulation in Tropical Rainforest Trees.

Glob Chang Biol. 2025 Sep;31(9):e70461. doi: 10.1111/gcb.70461.

A high-throughput phenotyping dataset for GWAS analysis of maize under combined drought and heat stress.

Data Brief. 2025 Aug 5;62:111947. doi: 10.1016/j.dib.2025.111947. eCollection 2025 Oct.

Multi-locus genome-wide association studies for root system architectural traits in Ethiopian sorghum (Sorghum bicolor L.) landraces.

BMC Plant Biol. 2025 Sep 2;25(1):1180. doi: 10.1186/s12870-025-07271-6.

Using landscape genomics to infer genomic regions involved in environmental adaptation of soybean genebank accessions.

BMC Plant Biol. 2025 Sep 1;25(1):1175. doi: 10.1186/s12870-025-07202-5.

Abundant genetic variation is retained in many laboratory schistosome populations.

PLoS Pathog. 2025 Aug 20;21(8):e1013439. doi: 10.1371/journal.ppat.1013439. eCollection 2025 Aug.

Regular Plasmodium falciparum importation onto Bioko Island, Equatorial Guinea, hampers malaria elimination from the island.

PLOS Glob Public Health. 2025 Aug 19;5(8):e0004999. doi: 10.1371/journal.pgph.0004999. eCollection 2025.

Exploring genetic diversity and population structure of Myanmar indigenous chickens using double digest restriction site-associated DNA sequencing.

Anim Genet. 2025 Aug;56(4):e70038. doi: 10.1111/age.70038.

Footprints of Worldwide Adaptation in Structured Populations of Drosophila melanogaster Through the Expanded DEST 2.0 Genomic Resource.

Mol Biol Evol. 2025 Jul 30;42(8). doi: 10.1093/molbev/msaf132.

NewtCap: An Efficient Target Capture Approach to Boost Genomic Studies in Salamandridae (True Salamanders and Newts).

Ecol Evol. 2025 Aug 12;15(8):e71835. doi: 10.1002/ece3.71835. eCollection 2025 Aug.

本文引用的文献

GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies.

Bioinformatics. 2012 Dec 15;28(24):3329-31. doi: 10.1093/bioinformatics/bts610. Epub 2012 Oct 10.

The variant call format and VCFtools.

Bioinformatics. 2011 Aug 1;27(15):2156-8. doi: 10.1093/bioinformatics/btr330. Epub 2011 Jun 7.

A map of human genome variation from population-scale sequencing.

Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.

Quality control and quality assurance in genotypic data for genome-wide association studies.

Genet Epidemiol. 2010 Sep;34(6):591-602. doi: 10.1002/gepi.20516.

New approaches to population stratification in genome-wide association studies.

Nat Rev Genet. 2010 Jul;11(7):459-63. doi: 10.1038/nrg2813.

The Gene, Environment Association Studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions.

Genet Epidemiol. 2010 May;34(4):364-72. doi: 10.1002/gepi.20492.

Case-control association testing in the presence of unknown relationships.

Genet Epidemiol. 2009 Dec;33(8):668-78. doi: 10.1002/gepi.20418.

A unified association analysis approach for family and unrelated samples correcting for stratification.

Am J Hum Genet. 2008 Feb;82(2):352-65. doi: 10.1016/j.ajhg.2007.10.009.

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Am J Hum Genet. 2007 Sep;81(3):559-75. doi: 10.1086/519795. Epub 2007 Jul 25.

Principal components analysis corrects for stratification in genome-wide association studies.

Nat Genet. 2006 Aug;38(8):904-9. doi: 10.1038/ng1847. Epub 2006 Jul 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于 SNP 数据亲缘关系和主成分分析的高性能计算工具集。

A high-performance computing toolset for relatedness and principal component analysis of SNP data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献