Max Planck Institute of Molecular Cell Biology and Genetics, Dresden 01307, Germany.
Max Planck Institute for the Physics of Complex Systems, Dresden 01187, Germany.
Bioinformatics. 2017 Dec 15;33(24):3985-3987. doi: 10.1093/bioinformatics/btx527.
Homology-based gene prediction is a powerful concept to annotate newly sequenced genomes. We have previously demonstrated that whole genome alignments can be utilized for accurate comparative coding gene annotation.
Here we present CESAR 2.0 that utilizes genome alignments to transfer coding gene annotations from one reference to many other aligned genomes. We show that CESAR 2.0 is 77 times faster and requires 31 times less memory compared to its predecessor. CESAR 2.0 substantially improves the ability to align splice sites that have shifted over larger distances, allowing for precise identification of the exon boundaries in the aligned genome. Finally, CESAR 2.0 supports entire genes, which enables the annotation of joined exons that arose by complete intron deletions. CESAR 2.0 can readily be applied to new genome alignments to annotate coding genes in many other genomes at improved accuracy and without necessitating large-computational resources.
Source code is freely available at https://github.com/hillerlab/CESAR2.0.
Supplementary data are available at Bioinformatics online.
基于同源性的基因预测是注释新测序基因组的强大概念。我们之前已经证明,全基因组比对可用于准确的比较编码基因注释。
这里我们展示了 CESAR 2.0,它利用基因组比对将编码基因注释从一个参考基因组转移到许多其他对齐的基因组。我们表明,CESAR 2.0 比其前身快 77 倍,所需的内存少 31 倍。CESAR 2.0 极大地提高了对齐跨越较大距离的剪接位点的能力,从而能够精确识别对齐基因组中的外显子边界。最后,CESAR 2.0 支持整个基因,从而能够注释由完全内含子缺失产生的连接外显子。CESAR 2.0 可以轻松应用于新的基因组比对,以提高准确性并无需大量计算资源的情况下注释许多其他基因组中的编码基因。
源代码可在 https://github.com/hillerlab/CESAR2.0 上免费获得。
补充数据可在 Bioinformatics 在线获得。