National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA.
Skolkovo Institute of Science and Technology, Skolkovo, Russia.
Nat Protoc. 2019 Oct;14(10):3013-3031. doi: 10.1038/s41596-019-0211-1. Epub 2019 Sep 13.
Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a 'bait' gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR-Cas systems using the 'CRISPRicity' metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.
细菌和古菌基因组中的功能关联基因通常组织成操纵子。然而,操纵子的组成和结构高度可变,即使在密切相关的基因组之间也经常不同。因此,为了从快速增长的基因组数据库的比较分析中有效地提取未被描述的基因的可靠功能预测,需要专门的计算方法。我们开发了一种通过使用相关性指标系统地和自动识别可能与“诱饵”基因或基因座具有功能关联的基因的方案。给定一组诱饵基因座和用户定义的基因组数据库,该方案通过计算给定基因在诱饵基因座内外的丰度以及与诱饵的距离,比较诱饵的基因组邻域,以识别可能与诱饵具有功能关联的基因。我们使用三个测试案例来说明该方案的性能,即使用“CRISPRicity”指标与 CRISPR-Cas 系统相关的基因、与古菌前病毒相关的基因以及与盐杆菌 Argonaute 基因相关的基因。该方案可以由具有基本计算技能的用户运行。计算成本取决于基因组数据集的大小和参考基因座列表,在超级计算机上可能需要一个 CPU 小时到数百个小时不等。