Department of Biology, Pennsylvania State University, University Park, PA 16802, USA.
BMC Evol Biol. 2014 Mar 29;14:67. doi: 10.1186/1471-2148-14-67.
As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size.
Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ∼47 kilobases of sequence at 121 loci. Each "strategy" for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies.
When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.
随着从不同分类单元中获得同源基因的 DNA 序列变得越来越可能,物种树经常根据多点数据进行推断。然而,许多执行这种推断的方法的行为在很大程度上仍未得到探索。一些方法在某些进化模型下已被证明是一致的,而其他方法则依赖于标准,尽管这些标准对于许多参数值是合适的,但在参数空间的特殊区域中,当数据量增加时,它们无法收敛到正确的估计值。
在这里,我们使用北美松树,通过使用三个替代外群(总共 72 种策略),从经验上评估了 24 种用于物种树推断的策略的行为。数据由 8 个种内物种的 120 个个体组成,来自 Strobus 亚科,以及来自 Gerardianae 亚科的 3 个外群物种,跨越 121 个基因座的约 47 千碱基序列。每种用于推断物种树的“策略”都由三个特征组成:一种物种树构建方法、一种基因树推断方法和一种外群选择。我们使用多元分析技术,如主成分分析和层次聚类,来识别在策略中稳健观察到的树特征,以及识别产生具有相似特征的树的策略组。我们发现,仅使用拓扑信息构建物种树的策略聚集在一起,并且使用额外的非拓扑信息(例如,分支长度)的策略也聚集在一起。使用一个物种内的多个个体来推断基因树的策略往往会产生包含其他策略估计的树中存在的分支的物种树估计。使用最小化深聚结准则构建物种树的策略往往会产生包含不在Concatenation、RTC、SMRT、STAR 和 STEAC 方法估计的树中存在的分支的物种树估计,并且通常比其他策略推断的更平衡。
当从一组多点序列构建物种树时,我们的观察结果为通过具有共同两步结构的不同方法获得的物种树估计值之间的差异提供了一个解释基础,一个步骤用于基因树估计,另一个步骤用于物种树估计。这里探索的方法采用了数据的许多不同特征,我们的分析表明,从倾向于在推断模式上有所不同的多个方法中获得相同的结果,可以作为获得可靠估计值的有用工具。