Institute of Computational Life Science, Zurich University of Applied Science, Wädenswil, Switzerland.
Faculty of Mathematics and Science, University of Zurich, Zürich, Switzerland.
Mol Biol Evol. 2024 Jul 3;41(7). doi: 10.1093/molbev/msae109.
Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.
尽管插入和缺失(indel)事件具有重要的生物学意义,但在系统发育推断中,这些事件经常被忽视或处理不当。在多重序列比对中,indels 表示为空位,并且在不考虑插入和缺失的独特进化历史的情况下进行估计。因此,indels 通常会从后续的推断步骤(如祖先序列重建和系统发育树搜索)中排除。在这里,我们引入了插入和缺失感知简约法(indelMaP),这是一种通过将插入和缺失视为单独的进化事件并考虑长 indels 来处理简约性标准下的空位的新方法。通过确定树上进化事件的确切位置,我们可以分离重叠的 indel 事件,并使用仿射间隙惩罚来对长 indel 进行建模。我们的插入和缺失感知方法利用了 indels 中的系统发育信号,将其纳入所有推断阶段。在模拟数据上对最先进的推断工具进行验证和比较表明,indelMaP 最适合具有紧密或中度相关序列的密集采样数据集,在这些数据集中,它可以达到与概率方法相当的对齐质量,并准确推断出包括 indel 模式在内的祖先序列。由于其显著的速度,我们的方法非常适合流行病学数据集,消除了下采样的需求,并能够利用密集分类采样提供的额外信息。此外,indelMaP 通过将空位视为关键的进化信号而不仅仅是纯粹的人工制品,为具有生物学意义的序列的 indel 模式提供了新的见解,并通过考虑空位来推进我们对遗传变异性的理解。