National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Brief Bioinform. 2011 Sep;12(5):379-91. doi: 10.1093/bib/bbr030. Epub 2011 Jun 19.
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple 'tree-like' mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
准确推断直系同源基因是大多数比较基因组学研究的前提,对于新基因组的功能注释也很重要。直系同源基因集的识别通常涉及系统发育树分析、基于序列保守性的启发式算法、同线性分析,或这些方法的某种组合。最直接的基于树的方法通常依赖于将单个基因树与物种树进行比较。一旦准确构建了这两棵树,就可以根据同源物是通过物种形成而不是基因复制在最近的起源点相关的定义,直接识别直系同源物。虽然从理论上讲,这种方法非常适合确定直系同源物,但对于大量基因和基因组来说,构建系统发育树的计算成本很高,而且它们通常包含错误,尤其是在较大的进化距离上。此外,在许多生物体中,特别是原核生物和病毒,进化似乎并没有遵循简单的“树状”模式,这使得传统的树整合方法不适用。其他启发式方法将最接近的同源对或一组基因识别为一组生物体中的可能直系同源物。这些方法比基于树的方法更快、更容易自动化,图形理论算法的高效实现使数千个基因组的比较成为可能。这两种方法的比较表明,尽管存在概念上的差异,但它们产生了相似的直系同源物集,尤其是在较短的进化距离上。同线性也有助于鉴定直系同源物。通常,基于树的、基于序列相似性的和基于同线性的方法可以组合成灵活的混合方法。