Suppr超能文献

对微生物基因组中的蛋白质进行多分辨率水平的聚类分析。

Clustering analysis of proteins from microbial genomes at multiple levels of resolution.

作者信息

Zaslavsky Leonid, Ciufo Stacy, Fedorov Boris, Tatusova Tatiana

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.

出版信息

BMC Bioinformatics. 2016 Aug 31;17 Suppl 8(Suppl 8):276. doi: 10.1186/s12859-016-1112-8.

Abstract

BACKGROUND

Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy.

RESULTS

Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters.

CONCLUSION

The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations.

摘要

背景

美国国立生物技术信息中心(NCBI)的微生物基因组代表了超过35000个组装体的大量集合。这些数据存在几个复杂之处:采样密度差异很大,因为人类病原体采样密集,而其他细菌的代表性较低;不同的蛋白质家族在注释中出现的频率不同;并且基因组注释的质量差异很大。为了从这些复杂的数据中提取有用信息,需要采用适当的采样策略,在多个系统发育分辨率和蛋白质相似性水平上进行分析。

结果

蛋白质聚类用于构建有意义且稳定的相似蛋白质组,以用于分析和功能注释。我们的方法是在三个层次上创建蛋白质聚类。首先,使用一种综合方法在密切相关的基因组(物种水平的进化枝)组中构建紧密聚类,该方法同时考虑序列相似性和基因组背景。其次,保守的进化枝内聚类的类聚中心被组织成种子全局聚类。最后,围绕种子聚类构建全局蛋白质聚类。我们提出了过滤策略,以限制全局聚类中包含的蛋白质集。进化枝内聚类过程、随后对类聚中心的选择以及组织成种子全局聚类提供了强大的表示和高压缩率。通过添加相关蛋白质进一步扩展种子蛋白质聚类。扩展的种子聚类包括大部分数据,并代表所有主要的已知细胞机制。其余部分来自非保守(独特)或快速进化的蛋白质、稀有基因组或低质量注释,不能很好地聚集在一起。处理这些蛋白质需要大量计算资源,并导致大量有问题的聚类。

结论

所开发的过滤策略允许识别和排除此类外围蛋白质,从而在全局聚类中限制蛋白质数据集。总体而言,所提出的方法允许在保持生物学上有趣的变异的同时,获得不同细节水平的相关数据并消除数据冗余。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00f9/5009818/d0f5990f8644/12859_2016_1112_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验