Reese M G, Eeckman F H, Kulp D, Haussler D
Human Genome Informatics Group, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA.
J Comput Biol. 1997 Fall;4(3):311-23. doi: 10.1089/cmb.1997.4.311.
We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 86% of coding nucleotides correctly with a specificity of 85%, versus 80% and 84% in the older system. In further splice site experiments, we also looked at correlations between splice site scores and intron and exon lengths, as well as at the effect of distance to the nearest splice site on false positive rates.
我们为基因发现程序Genie提出了一种改进的剪接位点预测器。Genie基于广义隐马尔可夫模型(GHMM),该模型描述了DNA序列中多外显子基因合法解析的语法。在Genie中,通过使用动态规划来组合来自多个内容和信号传感器的信息(包括整合与数据库中同源序列匹配的传感器)来估计基因特征的概率。基因发现中最困难的问题之一是正确确定完整的基因结构。剪接位点传感器是解决此问题的关键信号传感器。我们用两个基于二核苷酸频率的新型神经网络取代了Genie中现有的剪接位点传感器。使用这些新型传感器,Genie在基因结构识别的灵敏度和特异性方面有了显著提高。使用一组标准注释基因进行测试的实验结果表明,Genie正确识别了86%的编码核苷酸,特异性为85%,而旧系统分别为80%和84%。在进一步的剪接位点实验中,我们还研究了剪接位点得分与内含子和外显子长度之间的相关性,以及与最近剪接位点的距离对假阳性率的影响。