Department of Biostatistics, University of Washington, Seattle, WA 98195-7232, USA.
Bioinformatics. 2012 Dec 15;28(24):3326-8. doi: 10.1093/bioinformatics/bts606. Epub 2012 Oct 11.
Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ∼8-50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30-300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the 'Gene-Environment Association Studies' consortium studies.
全基因组关联研究被广泛用于研究疾病和特征的遗传基础,但它们带来了许多计算挑战。我们开发了 gdsfmt 和 SNPRelate(用于多核对称多处理计算机架构的 R 包)来加速 SNP 数据的两个关键计算:主成分分析(PCA)和使用亲缘关系分析的近亲关系度量。我们算法的核心是用 C/C++编写的,经过高度优化。基准测试表明,PCA 和近亲关系的单核实现速度分别比流行的 EIGENSTRAT(v3.0)和 PLINK(v1.07)程序中的实现快约 8-50 倍,并且通过使用 8 个核可以加速 30-300 倍。SNPRelate 可以分析数万例样本的数百万个 SNP。例如,我们的软件包被用于对来自“基因-环境关联研究”联盟研究的 55324 名受试者进行 PCA。