Nguyen Nam-Phuong, Nute Michael, Mirarab Siavash, Warnow Tandy
Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, CA, USA.
Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Urbana, 61820, IL, USA.
BMC Genomics. 2016 Nov 11;17(Suppl 10):765. doi: 10.1186/s12864-016-3097-0.
Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics.
We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy.
HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp .
给定一个新的生物序列,在许多生物信息学分析中,检测其是否属于已知家族是一个基本步骤,可应用于蛋白质结构和功能预测、宏基因组分类群鉴定及丰度分析等。然而,对于与公共数据库中的序列关系较远或为片段性的序列进行家族鉴定,仍然是生物信息学中较为困难的分析问题之一。
我们提出了一种用于家族鉴定的新技术,称为HIPPI(用于蛋白质家族鉴定的分层轮廓隐马尔可夫模型)。HIPPI使用一种新颖的技术,通过使用HMMER计算的一组轮廓隐马尔可夫模型来表示给定蛋白质家族或超家族的多序列比对。在Pfam数据库上对HIPPI的评估表明,HIPPI比blastp、HMMER以及基于HHsearch的流程具有更好的总体精度和召回率,并且即使对于片段性查询序列以及平均成对序列同一性较低的蛋白质家族,也能保持良好的准确性,而在这两种情况下其他方法的准确性都会下降。
HIPPI提供了准确的蛋白质家族鉴定,并且对困难的模型条件具有鲁棒性。我们的结果与先前研究的观察结果相结合,表明轮廓隐马尔可夫模型的集合比单个轮廓隐马尔可夫模型能更好地表示多序列比对,因此可以改善各种生物信息学任务的下游分析。需要进一步研究以确定构建轮廓隐马尔可夫模型集合的最佳实践。HIPPI可在GitHub上获取,网址为https://github.com/smirarab/sepp 。