Suppr超能文献

将蛋白质序列分配到现有的域和家族分类系统:Pfam 和 PDB。

Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

机构信息

Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.

出版信息

Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.

Abstract

MOTIVATION

Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed.

RESULTS

We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues.

AVAILABILITY

The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

摘要

动机

将现有域和蛋白质家族分类自动分配给新的序列集是一项重要任务。当前的方法经常错过分配,因为远程关系无法达到统计显著性。由于局部比对方法经常缩短比对,因此某些分配的长度不如实际的域定义长。查询序列中的长插入通常会错误地导致为查询分配的域的两个副本。蛋白质中的发散重复序列经常被忽略。

结果

我们开发了一种多级程序,可将现有分类系统的蛋白质家族几乎完整地分配给一组大型序列。我们将其应用于将 Pfam 结构域分配给序列和蛋白质数据库(PDB)中的结构的任务。我们发现 HHsearch 比对经常在 Pfam 家族中得分更高的 Pfam 簇中更远程相关的 Pfam,从而导致 Pfam 家族级别错误分配。因此,首先应用允许部分重叠的贪婪算法对序列/HMM 比对、HMM-HMM 比对和结构比对进行处理,注意将由大插入分开的部分比对合并为单个域分配。在重复 HMM 的强分配之后,允许对重复 Pfam 进行较弱 E 值的额外分配。我们的分配数据库,以称为 PDBfam 的数据库形式呈现,包含大于 50 个残基的链的 99.4%的 Pfam。

可用性

PDBfam 中的 Pfam 分配数据可在 http://dunbrack2.fccc.edu/ProtCid/PDBfam 上获得,可通过 PDB 代码和 Pfam 标识符进行搜索。它们将定期更新。

相似文献

1
Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.
Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.
2
Pfam: multiple sequence alignments and HMM-profiles of protein domains.
Nucleic Acids Res. 1998 Jan 1;26(1):320-2. doi: 10.1093/nar/26.1.320.
3
Identifying protein domains with the Pfam database.
Curr Protoc Bioinformatics. 2003 May;Chapter 2:Unit 2.5. doi: 10.1002/0471250953.bi0205s01.
4
BioAssemblyModeler (BAM): user-friendly homology modeling of protein homo- and heterooligomers.
PLoS One. 2014 Jun 12;9(6):e98309. doi: 10.1371/journal.pone.0098309. eCollection 2014.
5
Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins.
Nucleic Acids Res. 1999 Jan 1;27(1):260-2. doi: 10.1093/nar/27.1.260.
6
Pandit: a database of protein and associated nucleotide domains with inferred trees.
Bioinformatics. 2003 Aug 12;19(12):1556-63. doi: 10.1093/bioinformatics/btg188.
8
The Pfam protein families database.
Nucleic Acids Res. 2002 Jan 1;30(1):276-80. doi: 10.1093/nar/30.1.276.
9
A sequence family database built on ECOD structural domains.
Bioinformatics. 2018 Sep 1;34(17):2997-3003. doi: 10.1093/bioinformatics/bty214.
10
The PAS fold. A redefinition of the PAS domain based upon structural prediction.
Eur J Biochem. 2004 Mar;271(6):1198-208. doi: 10.1111/j.1432-1033.2004.04023.x.

引用本文的文献

3
The protein common assembly database (ProtCAD)-a comprehensive structural resource of protein complexes.
Nucleic Acids Res. 2023 Jan 6;51(D1):D466-D478. doi: 10.1093/nar/gkac937.
4
Orchestrating copper binding: structure and variations on the cupredoxin fold.
J Biol Inorg Chem. 2022 Sep;27(6):529-540. doi: 10.1007/s00775-022-01955-2. Epub 2022 Aug 22.
6
Isoforms from the Phytocyanin Gene Family Regulated Verticillium Wilt Resistance in Cotton.
Int J Mol Sci. 2022 Mar 8;23(6):2913. doi: 10.3390/ijms23062913.
8
Evaluation of residue-residue contact prediction methods: From retrospective to prospective.
PLoS Comput Biol. 2021 May 24;17(5):e1009027. doi: 10.1371/journal.pcbi.1009027. eCollection 2021 May.
9
Genomic-Wide Analysis of the PLC Family and Detection of GmPI-PLC7 Responses to Drought and Salt Stresses in Soybean.
Front Plant Sci. 2021 Mar 3;12:631470. doi: 10.3389/fpls.2021.631470. eCollection 2021.
10
ProtCID: a data resource for structural information on protein interactions.
Nat Commun. 2020 Feb 5;11(1):711. doi: 10.1038/s41467-020-14301-4.

本文引用的文献

1
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.
Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818.
2
The Pfam protein families database.
Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.
3
The protein common interface database (ProtCID)--a comprehensive database of interactions of homologous proteins in multiple crystal forms.
Nucleic Acids Res. 2011 Jan;39(Database issue):D761-70. doi: 10.1093/nar/gkq1059. Epub 2010 Oct 29.
4
3did: identification and classification of domain-based interactions of known three-dimensional structure.
Nucleic Acids Res. 2011 Jan;39(Database issue):D718-23. doi: 10.1093/nar/gkq962. Epub 2010 Oct 21.
5
The Pfam protein families database.
Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17.
6
PSI-2: structural genomics to cover protein domain family space.
Structure. 2009 Jun 10;17(6):869-81. doi: 10.1016/j.str.2009.03.015.
8
InterPro: the integrative protein signature database.
Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. doi: 10.1093/nar/gkn785. Epub 2008 Oct 21.
9
Powerful fusion: PSI-BLAST and consensus sequences.
Bioinformatics. 2008 Sep 15;24(18):1987-93. doi: 10.1093/bioinformatics/btn384. Epub 2008 Aug 4.
10
Statistical analysis of interface similarity in crystals of homologous proteins.
J Mol Biol. 2008 Aug 29;381(2):487-507. doi: 10.1016/j.jmb.2008.06.002. Epub 2008 Jun 7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验