Trachana Kalliopi, Forslund Kristoffer, Larsson Tomas, Powell Sean, Doerks Tobias, von Mering Christian, Bork Peer
Institute for Systems Biology, Seattle, WA, United States of America.
Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
PLoS One. 2014 Nov 4;9(11):e111122. doi: 10.1371/journal.pone.0111122. eCollection 2014.
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a "core" species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.
准确的直系同源预测对于后基因组时代的许多应用至关重要。缺乏广泛接受的基准测试妨碍了对直系同源推断的全面分析。到目前为止,直系同源物之间的功能注释用作性能代理。然而,这违背了直系同源作为进化定义的基本原则,而且由于大多数物种的实验证据有限,它往往并不适用。因此,我们构建了高质量的“金标准”直系同源组,可作为细菌物种直系同源推断的基准集。在此,我们使用该数据集来证明:1)为什么基于系统发育的人工整理数据集比其他常用方法更适合作为直系同源基准;2)它如何通过仔细的错误量化来指导数据库设计和参数化。更具体地说,我们说明了基于功能的测试如何经常无法识别错误分配,错误判断直系同源推断方法的真实性能。我们还研究了我们的数据集如何指导选择“核心”物种库以提高检测准确性。我们得出结论,纳入适当进化距离的更多基因组会影响直系同源检测的整体质量。这些经过整理的基因家族,称为参考直系同源组,可在http://eggnog.embl.de/orthobench2上公开获取。