Ahmad Shandar, Sarai Akinori
Department of Bioinformatics and Bioscience, Kyushu Institute of Technology, Iizuka 820 8502, Fukuoka, Japan.
BMC Bioinformatics. 2005 Feb 19;6:33. doi: 10.1186/1471-2105-6-33.
Detection of DNA-binding sites in proteins is of enormous interest for technologies targeting gene regulation and manipulation. We have previously shown that a residue and its sequence neighbor information can be used to predict DNA-binding candidates in a protein sequence. This sequence-based prediction method is applicable even if no sequence homology with a previously known DNA-binding protein is observed. Here we implement a neural network based algorithm to utilize evolutionary information of amino acid sequences in terms of their position specific scoring matrices (PSSMs) for a better prediction of DNA-binding sites.
An average of sensitivity and specificity using PSSMs is up to 8.7% better than the prediction with sequence information only. Much smaller data sets could be used to generate PSSM with minimal loss of prediction accuracy.
One problem in using PSSM-derived prediction is obtaining lengthy and time-consuming alignments against large sequence databases. In order to speed up the process of generating PSSMs, we tried to use different reference data sets (sequence space) against which a target protein is scanned for PSI-BLAST iterations. We find that a very small set of proteins can actually be used as such a reference data without losing much of the prediction value. This makes the process of generating PSSMs very rapid and even amenable to be used at a genome level. A web server has been developed to provide these predictions of DNA-binding sites for any new protein from its amino acid sequence.
Online predictions based on this method are available at http://www.netasa.org/dbs-pssm/
对于旨在进行基因调控和操纵的技术而言,检测蛋白质中的DNA结合位点极具意义。我们之前已经表明,一个残基及其序列邻域信息可用于预测蛋白质序列中的DNA结合候选位点。即使未观察到与先前已知的DNA结合蛋白的序列同源性,这种基于序列的预测方法也适用。在此,我们实现了一种基于神经网络的算法,以利用氨基酸序列的进化信息(根据其位置特异性得分矩阵,即PSSM)来更好地预测DNA结合位点。
使用PSSM的敏感性和特异性平均比仅使用序列信息的预测提高了8.7%。可以使用小得多的数据集来生成PSSM,而预测准确性的损失最小。
使用源自PSSM的预测存在的一个问题是,针对大型序列数据库进行比对既冗长又耗时。为了加快生成PSSM的过程,我们尝试使用不同的参考数据集(序列空间),针对这些数据集对目标蛋白进行PSI-BLAST迭代扫描。我们发现,实际上可以使用非常小的一组蛋白质作为这样的参考数据,而不会损失太多预测价值。这使得生成PSSM的过程非常迅速,甚至适用于在基因组水平上使用。我们已经开发了一个网络服务器,可根据任何新蛋白质的氨基酸序列提供这些DNA结合位点的预测。
基于此方法的在线预测可在http://www.netasa.org/dbs-pssm/获得