Suppr超能文献

基于数据的替代模型可提高基于蛋白质的系统发育分析。

Data-specific substitution models improve protein-based phylogenetics.

机构信息

Centro de Ciências do Mar, Universidade do Algarve, Faro, Algarve, Portugal.

Department of Life Sciences, Natural History Museum, London, United Kingdom.

出版信息

PeerJ. 2023 Aug 8;11:e15716. doi: 10.7717/peerj.15716. eCollection 2023.

Abstract

Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses.

摘要

计算特定于单个蛋白质数据集的氨基酸替换模型通常很困难,因为估计大量速率参数的计算负担很大。在这项研究中,我们测试了五种用于估计替换模型的方法的计算效率和准确性,即 Codeml、FastMG、IQ-TREE、P4(最大似然)和 P4(贝叶斯推断)。从已知模拟模型和模拟树生成的模拟对齐(具有不同长度)中估计数据特定的替换模型。使用每个生成的数据特定替换模型来计算模拟树和模拟数据的最大似然得分,该得分用于计算模型,并与相同模拟数据上的已知模拟模型和模拟树的最大似然得分进行比较。此外,还对常用的经验模型 cpREV 和 WAG 进行了类似的评估。数据特定模型的性能优于经验模型,经验模型对模拟对齐拟合不足,与模拟模型最大似然得分的差异最大,在主成分分析排序中与模拟模型聚类更远,推断出的树不太准确。数据特定模型和模拟模型具有统计上不可区分的最大似然得分,表明这五种方法在通过该度量标准估计替换模型方面具有相当的准确性。然而,树统计数据显示了最优最大似然树之间的差异。与其他模型估计方法不同,使用 IQ-TREE 和 P4(最大似然)生成的数据特定模型推断的树在每种分析中与从模拟模型得出的树没有显著差异,这表明这两种方法单独是最准确的在估计数据特定模型方面。为了展示使用数据特定蛋白质模型的好处,使用 IQ-TREE 估计的模型重新分析了几个已发布的数据集。这些新估计的模型比原始作者使用的经验模型更适合数据,通常推断出更长的树,并导致超过一半的重新分析数据集的树拓扑结构不同。这项研究的结果表明,软件可用性和高计算负担并不是为系统发育分析生成更适合的特定于数据的氨基酸替换模型的限制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b232/10416777/5a5eb4e38dd1/peerj-11-15716-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验