Suppr超能文献

能够解释核苷酸序列进化过程中替换过程在各位置和各谱系间异质性的混合模型。

Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages.

机构信息

School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia.

School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia.

出版信息

Syst Biol. 2014 Sep;63(5):726-42. doi: 10.1093/sysbio/syu036. Epub 2014 Jun 12.

Abstract

Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. castellii, S. kluyveri, S. bayanus, and Candida albicans. Our results show that second codon sites in the ancestral genome of these species contained 49.1% invariable sites, 39.6% variable sites belonging to one rate category (V1), and 11.3% variable sites belonging to a second rate category (V2). The ancestral nucleotide content was found to differ markedly across these three sets of sites, and the evolutionary processes operating at the variable sites were found to be non-SRH and best modeled by a combination of eight edge-specific rate matrices (four for V1 and four for V2). The number of substitutions per site at the variable sites also differed markedly, with sites belonging to V1 evolving slower than those belonging to V2 along the lineages separating the seven species of Saccharomyces. Finally, sites belonging to V1 appeared to have ceased evolving along the lineages separating S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, implying that they might have become so selectively constrained that they could be considered invariable sites in these species.

摘要

分子系统发育研究中,核苷酸同源序列的研究通常假设进化过程是全局稳定、可逆和均匀的(SRH),并且一个或多个具有特定位置和时间可逆速率矩阵的进化模型(例如,GTR 速率矩阵)足以准确地对整个树的数据进化进行建模。然而,越来越多的数据表明,在这些条件下的进化是例外,而不是常态。为了解决这个问题,已经提出了几种非 SRH 的分子进化模型,但它们要么忽略了跨位置替代过程中的异质性(HAS),要么假设可以使用分布准确地对其进行建模。作为这些进化模型的替代方案,我们引入了一组混合模型,这些模型无需假设潜在的预定义统计分布即可近似 HAS。该混合模型家族与跨谱系的替代过程异质性(HAL)的非 SRH 进化模型相结合。我们还提出了两种搜索模型空间和识别不太可能过度或欠参数化数据的最优进化模型的算法。使用在复杂的非 SRH 条件下对具有 10000 个位置的核苷酸序列进行的模拟,在 25 个尖端树的模拟上评估了这两种新算法的性能。发现这些算法非常成功,以 75%的成功率确定了正确的 HAL 模型(将速率矩阵分配给树的 48 个边缘的平均成功率为 99.25%),并且对于正确的 HAL 模型,以 98%的成功率确定了正确的 HAS 模型。最后,发现在正确的 HAL-HAS 模型下获得的参数估计是准确和精确的。通过对来自 106 个直系同源基因的核苷酸序列进行的分析,说明了我们新算法的优点,这些基因的核基因组编码的 Saccharomyces cerevisiae、S. paradoxus、S. mikatae、S. kudriavzevii、S. castellii、S. kluyveri、S. bayanus 和 Candida albicans。我们的结果表明,这些物种祖先基因组中的第二密码子位点包含 49.1%不变位点、39.6%属于一个速率类别(V1)的可变位点和 11.3%属于第二个速率类别(V2)的可变位点。发现这三个类别的核苷酸含量在这些位点之间差异明显,并且在可变位点上的进化过程是非 SRH 的,最好通过八个边缘特定的速率矩阵(四个用于 V1 和四个用于 V2)的组合进行建模。可变位点的每个位置的替换数也差异明显,属于 V1 的位点比属于 V2 的位点沿分离七个 Saccharomyces 种的谱系进化得更慢。最后,属于 V1 的位点似乎在分离 S. cerevisiae、S. paradoxus、S. mikatae、S. kudriavzevii 和 S. bayanus 的谱系中停止了进化,这表明它们可能受到了如此强烈的选择限制,以至于它们在这些物种中可以被视为不变的位点。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验