School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia.
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad540.
Sequence simulation plays a vital role in phylogenetics with many applications, such as evaluating phylogenetic methods, testing hypotheses, and generating training data for machine-learning applications. We recently introduced a new simulator for multiple sequence alignments called AliSim, which outperformed existing tools. However, with the increasing demands of simulating large data sets, AliSim is still slow due to its sequential implementation; for example, to simulate millions of sequence alignments, AliSim took several days or weeks. Parallelization has been used for many phylogenetic inference methods but not yet for sequence simulation.
This paper introduces AliSim-HPC, which, for the first time, employs high-performance computing for phylogenetic simulations. AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and message passing interface (MPI) libraries, respectively. AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30 000 sequences of one million sites) from over one day to 11 min using 256 CPU cores from a cluster with six computing nodes, a 153-fold speedup. While the OpenMP version can only simulate gap-free alignments, the MPI version supports insertion-deletion models like the sequential AliSim.
AliSim-HPC is open-source and available as part of the new IQ-TREE version v2.2.3 at https://github.com/iqtree/iqtree2/releases with a user manual at http://www.iqtree.org/doc/AliSim.
序列模拟在系统发育学中起着至关重要的作用,有许多应用,如评估系统发育方法、检验假设以及为机器学习应用生成训练数据。我们最近引入了一种新的多序列比对模拟程序,称为 AliSim,它的性能优于现有的工具。然而,随着模拟大数据集的需求不断增加,由于其顺序实现,AliSim 仍然很慢;例如,要模拟数百万个序列比对,AliSim 需要几天或几周的时间。并行化已被用于许多系统发育推断方法,但尚未用于序列模拟。
本文介绍了 AliSim-HPC,它首次在系统发育模拟中使用高性能计算。AliSim-HPC 使用 OpenMP 和消息传递接口 (MPI) 库分别在多核和多 CPU 级别上并行化模拟过程。AliSim-HPC 具有高效性和可扩展性,将模拟 100 个大无间隙比对(30000 个百万位序列)的运行时间从超过一天缩短到使用 6 个计算节点的集群中的 256 个 CPU 核心的 11 分钟,速度提高了 153 倍。虽然 OpenMP 版本只能模拟无间隙比对,但 MPI 版本支持插入-缺失模型,如顺序 AliSim。
AliSim-HPC 是开源的,作为新的 IQ-TREE 版本 v2.2.3 的一部分提供,可在 https://github.com/iqtree/iqtree2/releases 上获得,用户手册可在 http://www.iqtree.org/doc/AliSim 上获得。