School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
Bioinformatics. 2009 Nov 15;25(22):2897-905. doi: 10.1093/bioinformatics/btp537. Epub 2009 Sep 10.
The computational identification of non-coding RNA (ncRNA) genes represents one of the most important and challenging problems in computational biology. Existing methods for ncRNA gene prediction rely mostly on homology information, thus limiting their applications to ncRNA genes with known homologues.
We present a novel de novo prediction algorithm for ncRNA genes using features derived from the sequences and structures of known ncRNA genes in comparison to decoys. Using these features, we have trained a neural network-based classifier and have applied it to Escherichia coli and Sulfolobus solfataricus for genome-wide prediction of ncRNAs. Our method has an average prediction sensitivity and specificity of 68% and 70%, respectively, for identifying windows with potential for ncRNA genes in E.coli. By combining windows of different sizes and using positional filtering strategies, we predicted 601 candidate ncRNAs and recovered 41% of known ncRNAs in E.coli. We experimentally investigated six novel candidates using Northern blot analysis and found expression of three candidates: one represents a potential new ncRNA, one is associated with stable mRNA decay intermediates and one is a case of either a potential riboswitch or transcription attenuator involved in the regulation of cell division. In general, our approach enables the identification of both cis- and trans-acting ncRNAs in partially or completely sequenced microbial genomes without requiring homology or structural conservation.
The source code and results are available at http://csbl.bmb.uga.edu/publications/materials/tran/.
非编码 RNA(ncRNA)基因的计算识别是计算生物学中最重要和最具挑战性的问题之一。现有的 ncRNA 基因预测方法主要依赖于同源信息,因此限制了它们在具有已知同源物的 ncRNA 基因中的应用。
我们提出了一种新的基于从头预测算法的 ncRNA 基因,使用从已知 ncRNA 基因的序列和结构中提取的特征与诱饵进行比较。使用这些特征,我们训练了一个基于神经网络的分类器,并将其应用于大肠杆菌和硫矿硫化叶菌的全基因组预测 ncRNA。我们的方法在识别大肠杆菌中潜在 ncRNA 基因的窗口时,平均预测灵敏度和特异性分别为 68%和 70%。通过组合不同大小的窗口并使用位置过滤策略,我们预测了 601 个候选 ncRNA,并在大肠杆菌中恢复了 41%的已知 ncRNA。我们通过 Northern blot 分析对六个新的候选物进行了实验研究,发现了三个候选物的表达:一个代表潜在的新 ncRNA,一个与稳定的 mRNA 降解中间体相关,另一个是参与细胞分裂调节的潜在核糖开关或转录衰减子的情况。总的来说,我们的方法能够在部分或完全测序的微生物基因组中识别顺式和反式作用的 ncRNA,而不需要同源性或结构保守性。
源代码和结果可在 http://csbl.bmb.uga.edu/publications/materials/tran/ 上获得。