Thompson Julie D, Prigent Véronique, Poch Olivier
Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP, BP 163, 67404 Illkirch Cedex, France.
Nucleic Acids Res. 2004 Feb 24;32(4):1298-307. doi: 10.1093/nar/gkh294. Print 2004.
Sequence alignments are fundamental to a wide range of applications, including database searching, functional residue identification and structure prediction techniques. These applications predict or propagate structural/functional/evolutionary information based on a presumed homology between the aligned sequences. If the initial hypothesis of homology is wrong, no subsequent application, however sophisticated, can be expected to yield accurate results. Here we present a novel method, LEON, to predict homology between proteins based on a multiple alignment of complete sequences (MACS). In MACS, weak signals from distantly related proteins can be considered in the overall context of the family. Intermediate sequences and the combination of individual weak matches are used to increase the significance of low-scoring regions. Residue composition is also taken into account by incorporation of several existing methods for the detection of compositionally biased sequence segments. The accuracy and reliability of the predictions is demonstrated in large-scale comparisons with structural and sequence family databases, where the specificity was shown to be >99% and the sensitivity was estimated to be approximately 76%. LEON can thus be used to reliably identify the complex relationships between large multidomain proteins and should be useful for automatic high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
序列比对对于广泛的应用至关重要,包括数据库搜索、功能残基识别和结构预测技术。这些应用基于比对序列之间假定的同源性来预测或传播结构/功能/进化信息。如果同源性的初始假设错误,那么无论后续应用多么复杂,都无法期望得到准确的结果。在此,我们提出一种新方法LEON,基于完整序列的多重比对(MACS)来预测蛋白质之间的同源性。在MACS中,可以在家族的整体背景下考虑来自远缘相关蛋白质的微弱信号。中间序列以及单个弱匹配的组合用于提高低得分区域的显著性。通过纳入几种现有的检测组成偏向性序列片段的方法,还考虑了残基组成。在与结构和序列家族数据库的大规模比较中证明了预测的准确性和可靠性,其中特异性显示大于99%,敏感性估计约为76%。因此,LEON可用于可靠地识别大型多结构域蛋白质之间的复杂关系,并且应该对自动高通量基因组注释、二维/三维结构预测、蛋白质-蛋白质相互作用预测等有用。