Bioinformatics and Genomics Programme, Centre de Regulació Genòmica (CRG), Universitat Pompeu Fabra, Dr. Aiguader, 88. 08003, Barcelona, Spain.
Nucleic Acids Res. 2011 Mar;39(5):e32. doi: 10.1093/nar/gkq953. Epub 2010 Dec 11.
Reliable prediction of orthology is central to comparative genomics. Approaches based on phylogenetic analyses closely resemble the original definition of orthology and paralogy and are known to be highly accurate. However, the large computational cost associated to these analyses is a limiting factor that often prevents its use at genomic scales. Recently, several projects have addressed the reconstruction of large collections of high-quality phylogenetic trees from which orthology and paralogy relationships can be inferred. This provides us with the opportunity to infer the evolutionary relationships of genes from multiple, independent, phylogenetic trees. Using such strategy, we combine phylogenetic information derived from different databases, to predict orthology and paralogy relationships for 4.1 million proteins in 829 fully sequenced genomes. We show that the number of independent sources from which a prediction is made, as well as the level of consistency across predictions, can be used as reliable confidence scores. A webserver has been developed to easily access these data (http://orthology.phylomedb.org), which provides users with a global repository of phylogeny-based orthology and paralogy predictions.
可靠的同源性预测是比较基因组学的核心。基于系统发育分析的方法与同源性和旁系同源性的原始定义非常相似,并且被证明具有高度的准确性。然而,这些分析所涉及的巨大计算成本是一个限制因素,常常阻止其在基因组规模上使用。最近,有几个项目致力于从大量高质量的系统发育树中重建,这些树可以推断出同源性和旁系同源性的关系。这为我们提供了从多个独立的系统发育树推断基因进化关系的机会。使用这种策略,我们结合了来自不同数据库的系统发育信息,为 829 个完全测序的基因组中的 410 万个蛋白质预测了同源性和旁系同源性的关系。我们表明,预测所来自的独立来源的数量以及预测之间的一致性水平可以用作可靠的置信分数。已经开发了一个网络服务器来方便地访问这些数据(http://orthology.phylomedb.org),该服务器为用户提供了基于系统发育的同源性和旁系同源性预测的全局存储库。