将蛋白质序列分配到现有的域和家族分类系统：Pfam 和 PDB。

Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

机构信息

Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.

出版信息

Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.

DOI:10.1093/bioinformatics/bts533

PMID:22942020

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3476341/

Abstract

MOTIVATION

Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed.

RESULTS

We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues.

AVAILABILITY

The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

摘要

动机

将现有域和蛋白质家族分类自动分配给新的序列集是一项重要任务。当前的方法经常错过分配，因为远程关系无法达到统计显著性。由于局部比对方法经常缩短比对，因此某些分配的长度不如实际的域定义长。查询序列中的长插入通常会错误地导致为查询分配的域的两个副本。蛋白质中的发散重复序列经常被忽略。

结果

我们开发了一种多级程序，可将现有分类系统的蛋白质家族几乎完整地分配给一组大型序列。我们将其应用于将 Pfam 结构域分配给序列和蛋白质数据库（PDB）中的结构的任务。我们发现 HHsearch 比对经常在 Pfam 家族中得分更高的 Pfam 簇中更远程相关的 Pfam，从而导致 Pfam 家族级别错误分配。因此，首先应用允许部分重叠的贪婪算法对序列/HMM 比对、HMM-HMM 比对和结构比对进行处理，注意将由大插入分开的部分比对合并为单个域分配。在重复 HMM 的强分配之后，允许对重复 Pfam 进行较弱 E 值的额外分配。我们的分配数据库，以称为 PDBfam 的数据库形式呈现，包含大于 50 个残基的链的 99.4%的 Pfam。

可用性

PDBfam 中的 Pfam 分配数据可在 http://dunbrack2.fccc.edu/ProtCid/PDBfam 上获得，可通过 PDB 代码和 Pfam 标识符进行搜索。它们将定期更新。

相似文献

Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.

Pfam: multiple sequence alignments and HMM-profiles of protein domains.

Nucleic Acids Res. 1998 Jan 1;26(1):320-2. doi: 10.1093/nar/26.1.320.

Identifying protein domains with the Pfam database.

Curr Protoc Bioinformatics. 2003 May;Chapter 2:Unit 2.5. doi: 10.1002/0471250953.bi0205s01.

BioAssemblyModeler (BAM): user-friendly homology modeling of protein homo- and heterooligomers.

PLoS One. 2014 Jun 12;9(6):e98309. doi: 10.1371/journal.pone.0098309. eCollection 2014.

Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins.

Nucleic Acids Res. 1999 Jan 1;27(1):260-2. doi: 10.1093/nar/27.1.260.

Pandit: a database of protein and associated nucleotide domains with inferred trees.

Bioinformatics. 2003 Aug 12;19(12):1556-63. doi: 10.1093/bioinformatics/btg188.

SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.

Nucleic Acids Res. 2002 Jan 1;30(1):268-72. doi: 10.1093/nar/30.1.268.

The Pfam protein families database.

Nucleic Acids Res. 2002 Jan 1;30(1):276-80. doi: 10.1093/nar/30.1.276.

A sequence family database built on ECOD structural domains.

Bioinformatics. 2018 Sep 1;34(17):2997-3003. doi: 10.1093/bioinformatics/bty214.

The PAS fold. A redefinition of the PAS domain based upon structural prediction.

Eur J Biochem. 2004 Mar;271(6):1198-208. doi: 10.1111/j.1432-1033.2004.04023.x.

引用本文的文献

Genome-Wide Identification of Calmodulin-Binding Protein 60 Gene Family and the Function of in Cotton Growth and Development and Abiotic Stress Response.

Int J Mol Sci. 2024 Apr 15;25(8):4349. doi: 10.3390/ijms25084349.

GhCKX14 responding to drought stress by modulating antioxi-dative enzyme activity in Gossypium hirsutum compared to CKX family genes.

BMC Plant Biol. 2023 Sep 2;23(1):409. doi: 10.1186/s12870-023-04419-0.

The protein common assembly database (ProtCAD)-a comprehensive structural resource of protein complexes.

Nucleic Acids Res. 2023 Jan 6;51(D1):D466-D478. doi: 10.1093/nar/gkac937.

Orchestrating copper binding: structure and variations on the cupredoxin fold.

J Biol Inorg Chem. 2022 Sep;27(6):529-540. doi: 10.1007/s00775-022-01955-2. Epub 2022 Aug 22.

Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection.

Methods Mol Biol. 2022;2449:149-167. doi: 10.1007/978-1-0716-2095-3_5.

Isoforms from the Phytocyanin Gene Family Regulated Verticillium Wilt Resistance in Cotton.

Int J Mol Sci. 2022 Mar 8;23(6):2913. doi: 10.3390/ijms23062913.

Probiotic Properties of KABP042 and KABP041 Show Potential to Counteract Functional Gastrointestinal Disorders in an Observational Pilot Trial in Infants.

Front Microbiol. 2022 Jan 12;12:741391. doi: 10.3389/fmicb.2021.741391. eCollection 2021.

Evaluation of residue-residue contact prediction methods: From retrospective to prospective.

PLoS Comput Biol. 2021 May 24;17(5):e1009027. doi: 10.1371/journal.pcbi.1009027. eCollection 2021 May.

Genomic-Wide Analysis of the PLC Family and Detection of GmPI-PLC7 Responses to Drought and Salt Stresses in Soybean.

Front Plant Sci. 2021 Mar 3;12:631470. doi: 10.3389/fpls.2021.631470. eCollection 2021.

ProtCID: a data resource for structural information on protein interactions.

Nat Commun. 2020 Feb 5;11(1):711. doi: 10.1038/s41467-020-14301-4.

本文引用的文献

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818.

The Pfam protein families database.

Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

The protein common interface database (ProtCID)--a comprehensive database of interactions of homologous proteins in multiple crystal forms.

Nucleic Acids Res. 2011 Jan;39(Database issue):D761-70. doi: 10.1093/nar/gkq1059. Epub 2010 Oct 29.

3did: identification and classification of domain-based interactions of known three-dimensional structure.

Nucleic Acids Res. 2011 Jan;39(Database issue):D718-23. doi: 10.1093/nar/gkq962. Epub 2010 Oct 21.

The Pfam protein families database.

Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17.

PSI-2: structural genomics to cover protein domain family space.

Structure. 2009 Jun 10;17(6):869-81. doi: 10.1016/j.str.2009.03.015.

SCWRL and MolIDE: computer programs for side-chain conformation prediction and homology modeling.

Nat Protoc. 2008;3(12):1832-47. doi: 10.1038/nprot.2008.184.

InterPro: the integrative protein signature database.

Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. doi: 10.1093/nar/gkn785. Epub 2008 Oct 21.

Powerful fusion: PSI-BLAST and consensus sequences.

Bioinformatics. 2008 Sep 15;24(18):1987-93. doi: 10.1093/bioinformatics/btn384. Epub 2008 Aug 4.

Statistical analysis of interface similarity in crystals of homologous proteins.

J Mol Biol. 2008 Aug 29;381(2):487-507. doi: 10.1016/j.jmb.2008.06.002. Epub 2008 Jun 7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

将蛋白质序列分配到现有的域和家族分类系统：Pfam 和 PDB。

Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献