Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA, Department of Bioengineering, University of Illinois at Urbana-Champaign, IL 61801, USA, Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL 61801, USA, School of Computational Science & Engineering, Georgia Tech, Atlanta, GA 30332, USA and Department of Bioinformatics, Moscow Institute of Physics and Technology, Moscow, 141700, Russia.
Nucleic Acids Res. 2014 Feb;42(4):e25. doi: 10.1093/nar/gkt1141. Epub 2013 Nov 19.
Accurate mapping of spliced RNA-Seq reads to genomic DNA has been known as a challenging problem. Despite significant efforts invested in developing efficient algorithms, with the human genome as a primary focus, the best solution is still not known. A recently introduced tool, TrueSight, has demonstrated better performance compared with earlier developed algorithms such as TopHat and MapSplice. To improve detection of splice junctions, TrueSight uses information on statistical patterns of nucleotide ordering in intronic and exonic DNA. This line of research led to yet another new algorithm, UnSplicer, designed for eukaryotic species with compact genomes where functional alternative splicing is likely to be dominated by splicing noise. Genome-specific parameters of the new algorithm are generated by GeneMark-ES, an ab initio gene prediction algorithm based on unsupervised training. UnSplicer shares several components with TrueSight; the difference lies in the training strategy and the classification algorithm. We tested UnSplicer on RNA-Seq data sets of Arabidopsis thaliana, Caenorhabditis elegans, Cryptococcus neoformans and Drosophila melanogaster. We have shown that splice junctions inferred by UnSplicer are in better agreement with knowledge accumulated on these well-studied genomes than predictions made by earlier developed tools.
准确地将拼接 RNA-Seq 读段映射到基因组 DNA 一直是一个具有挑战性的问题。尽管人们投入了大量精力来开发高效的算法,以人类基因组为主要关注点,但仍没有找到最佳解决方案。最近引入的 TrueSight 工具与 TopHat 和 MapSplice 等早期开发的算法相比,表现出了更好的性能。为了提高拼接点的检测能力,TrueSight 利用了内含子和外显子 DNA 中核苷酸排序统计模式的信息。这一研究方向催生了另一个新算法 UnSplicer,它专为基因组较小的真核生物设计,在这些生物中,功能上的选择性拼接很可能由拼接噪声主导。新算法的基因组特异性参数是由基于无监督训练的从头基因预测算法 GeneMark-ES 生成的。UnSplicer 与 TrueSight 有几个共同的组件;区别在于训练策略和分类算法。我们在拟南芥、秀丽隐杆线虫、新型隐球菌和黑腹果蝇的 RNA-Seq 数据集上测试了 UnSplicer。我们已经表明,UnSplicer 推断的拼接点与在这些研究充分的基因组上积累的知识更为一致,而不是与早期开发的工具的预测结果更为一致。