Mirarab S, Nguyen N, Warnow T
Department of Computer Science University of Texas at Austin, Austin, TX 78712, USA.
Pac Symp Biocomput. 2012:247-58. doi: 10.1142/9789814366496_0024.
We address the problem of Phylogenetic Placement, in which the objective is to insert short molecular sequences (called query sequences) into an existing phylogenetic tree and alignment on full-length sequences for the same gene. Phylogenetic placement has the potential to provide information beyond pure "species identification" (i.e., the association of metagenomic reads to existing species), because it can also give information about the evolutionary relationships between these query sequences and to known species. Approaches for phylogenetic placement have been developed that operate in two steps: first, an alignment is estimated for each query sequence to the alignment of the full-length sequences, and then that alignment is used to find the optimal location in the phylogenetic tree for the query sequence. Recent methods of this type include HMMALIGN+EPA, HMMALIGN+pplacer, and PaPaRa+EPA.We report on a study evaluating phylogenetic placement methods on biological and simulated data. This study shows that these methods have extremely good accuracy and computational tractability under conditions where the input contains a highly accurate alignment and tree for the full-length sequences, and the set of full-length sequences is sufficiently small and not too evolutionarily diverse; however, we also show that under other conditions accuracy declines and the computational requirements for memory and time exceed acceptable limits. We present SEPP, a general "boosting" technique to improve the accuracy and/or speed of phylogenetic placement techniques. The key algorithmic aspect of this booster is a dataset decomposition technique in SATé, a method that utilizes an iterative divide-and-conquer technique to co-estimate alignments and trees on large molecular sequence datasets. We show that SATé-boosting improves HMMALIGN+pplacer, placing short sequences more accurately when the set of input sequences has a large evolutionary diameter and produces placements of comparable accuracy in a fraction of the time for easier cases. SEPP software and the datasets used in this study are all available for free at http://www.cs.utexas.edu/users/phylo/software/sepp/submission.
我们研究了系统发育定位问题,其目标是将短分子序列(称为查询序列)插入到现有的系统发育树中,并对同一基因的全长序列进行比对。系统发育定位有可能提供超越单纯“物种鉴定”(即将宏基因组读数与现有物种关联起来)的信息,因为它还可以给出这些查询序列与已知物种之间进化关系的信息。已经开发出了用于系统发育定位的方法,这些方法分两步进行:首先,为每个查询序列估计与全长序列比对的比对结果,然后利用该比对结果在系统发育树中找到查询序列的最佳位置。这类最新方法包括HMMALIGN+EPA、HMMALIGN+pplacer和PaPaRa+EPA。我们报告了一项对生物数据和模拟数据评估系统发育定位方法的研究。这项研究表明,在输入包含高度准确的全长序列比对和树,且全长序列集足够小且进化差异不太大的条件下,这些方法具有极高的准确性和计算易处理性;然而,我们也表明,在其他条件下,准确性会下降,内存和时间的计算要求会超出可接受的限度。我们提出了SEPP,一种通用的“增强”技术,用于提高系统发育定位技术的准确性和/或速度。这种增强器的关键算法方面是SATé中的数据集分解技术,SATé是一种利用迭代分治技术在大型分子序列数据集上共同估计比对和树的方法。我们表明,SATé增强改进了HMMALIGN+pplacer,当输入序列集具有较大的进化直径时,能更准确地放置短序列,并且在较简单的情况下,能在更短的时间内产生具有相当准确性的放置结果。本研究中使用的SEPP软件和数据集均可在http://www.cs.utexas.edu/users/phylo/software/sepp/submission免费获取。