Lee David, Grant Alastair, Marsden Russell L, Orengo Christine
Biomolecular Structure and Modelling Group, Department of Biochemistry, University College London, Gower Street, London.
Proteins. 2005 May 15;59(3):603-15. doi: 10.1002/prot.20409.
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.
我们使用一种新的协议PFscape,对120个完整基因组中的蛋白质家族和结构域架构进行了系统鉴定。PFscape使用马尔可夫聚类算法(Enright等人,《核酸研究》2002年;30:1575 - 1584)将序列聚类成蛋白质家族,随后根据序列同一性进行完全连锁聚类。在每个蛋白质家族中,使用包含CATH结构域和Pfam功能结构域的隐马尔可夫模型库识别结构域。然后使用DomainFinder(Pearl等人,《蛋白质科学》2002年;11:233 - 244)确定结构域架构,并将蛋白质家族和结构域架构数据合并到Gene3D数据库(Buchan等人,《基因组研究》2002年;12:503 - 514)中。利用Gene3D,我们研究了蛋白质序列空间、结构注释的范围以及来自生命所有王国的完整基因组中不同结构域架构的分布。与其他研究人员早期的研究一样,结构域家族的分布呈现幂律行为,即最大的2000个结构域家族可以映射到大约70%的非单拷贝基因组序列;其余序列则被分配到小得多的家族中。虽然基因组内大约50%的结构域注释被分配到219个通用结构域家族,但只有小得多的比例(<10%)的蛋白质序列被分配到通用蛋白质家族。这支持了进化的镶嵌理论,即结构域复制后接着结构域改组产生新的结构域架构,从而可以扩展生物体的蛋白质功能库。整合在Gene3D中的功能数据(如COG/KEGG/GO)形成了一个全面的资源,目前正用于结构基因组学计划,可通过http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/访问。