Vanhoutreve Renaud, Kress Arnaud, Legrand Baptiste, Gass Hélène, Poch Olivier, Thompson Julie D
Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France.
BMC Bioinformatics. 2016 Jul 7;17(1):271. doi: 10.1186/s12859-016-1146-y.
A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences.
Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including 'core blocks', 'regions' and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity.
LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
在生物信息学的许多领域,一个标准程序是使用多序列比对(MSA)作为各种基于同源性推断的基础。应用包括三维结构建模、蛋白质功能注释、分子相互作用预测等。然而,这些应用无论多么复杂,通常都对所使用的比对高度敏感,并且忽略比对中的非同源或不确定区域可能会导致后续推断出现重大偏差。
在此,我们提出一种新方法LEON - BIS,它使用稳健的贝叶斯框架来估计蛋白质多序列比对中序列之间的同源关系。序列被聚类成亚家族,并在不同层次上预测关系,包括“核心区域”、“区域”和全长蛋白质。在使用注释良好的比对数据库进行的大规模比较中证明了预测的准确性和可靠性,其中同源序列片段以非常高的灵敏度和特异性被检测到。
LEON - BIS使用稳健的贝叶斯统计来区分在整个家族或亚家族内保守的多序列比对部分。因此,LEON - BIS对于自动、高通量的基因组注释、二维/三维结构预测、蛋白质 - 蛋白质相互作用预测等应该是有用的。