BGI Research, Sanya, 572025, People's Republic of China.
Key Laboratory of Neuroregeneration of Jiangsu and Ministry of Education, Co-innovation Center of Neuroregeneration, NMPA Key Laboratory for Research and Evaluation of Tissue Engineering Technology Products, Nantong University, Nantong, 226001, People's Republic of China.
BMC Bioinformatics. 2024 May 1;25(1):173. doi: 10.1186/s12859-024-05770-1.
Principal component analysis (PCA) is an important and widely used unsupervised learning method that determines population structure based on genetic variation. Genome sequencing of thousands of individuals usually generate tens of millions of SNPs, making it challenging for PCA analysis and interpretation. Here we present VCF2PCACluster, a simple, fast and memory-efficient tool for Kinship estimation, PCA and clustering analysis, and visualization based on VCF formatted SNPs. We implemented five Kinship estimation methods and three clustering methods for its users to choose from. Moreover, unlike other PCA tools, VCF2PCACluster possesses a clustering function based on PCA result, which enabling users to automatically and clearly know about population structure. We demonstrated the same accuracy but a higher performance of this tool in performing PCA analysis on tens of millions of SNPs compared to another popular PLINK2 software, especially in peak memory usage that is independent of the number of SNPs in VCF2PCACluster.
主成分分析(PCA)是一种重要且广泛使用的无监督学习方法,它基于遗传变异来确定群体结构。对数千个人的基因组测序通常会产生数千万个 SNP,这使得 PCA 分析和解释具有挑战性。在这里,我们提出了 VCF2PCACluster,这是一种简单、快速和内存高效的工具,用于基于 VCF 格式的 SNP 进行亲缘关系估计、PCA 和聚类分析以及可视化。我们实现了五种亲缘关系估计方法和三种聚类方法,供用户选择。此外,与其他 PCA 工具不同,VCF2PCACluster 具有基于 PCA 结果的聚类功能,使用户能够自动清晰地了解群体结构。与另一个流行的 PLINK2 软件相比,我们证明了该工具在对数千万个 SNP 进行 PCA 分析时具有相同的准确性,但性能更高,尤其是在峰值内存使用方面,它与 VCF2PCACluster 中的 SNP 数量无关。