Suppr超能文献

HiFiBGC:一种用于提高 PacBio HiFi 读长宏基因组中生物合成基因簇检测的集成方法。

HiFiBGC: an ensemble approach for improved biosynthetic gene cluster detection in PacBio HiFi-read metagenomes.

机构信息

CSIR-Institute of Microbial Technology (IMTECH), Sector 39-A, Chandigarh, 160036, India.

Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.

出版信息

BMC Genomics. 2024 Nov 16;25(1):1096. doi: 10.1186/s12864-024-10950-7.

Abstract

BACKGROUND

Microbes produce diverse bioactive natural products with applications in fields such as medicine and agriculture. In their genomes, these natural products are encoded by physically clustered genes known as biosynthetic gene clusters (BGCs). Genome and metagenome sequencing advances have enabled high-throughput identification of BGCs as a promising avenue for natural product discovery. BGC mining from (meta)genomes using in silico tools has allowed access to a vast diversity of potentially novel natural products. However, a fundamental limitation has been the ability to assemble complete BGCs, especially from complex metagenomes. With their fragmented assemblies, short-read technologies struggle to recover complete BGCs, such as the long and repetitive nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS). Recent advances in long-read sequencing, such as the High Fidelity (HiFi) technology from PacBio, have reduced this limitation and can help retrieve both accurate and complete BGCs from metagenomes, warranting improvement in the existing BGC identification approach for better utilization of HiFi data.

RESULTS

Here, we present HiFiBGC, a command-line-based workflow to identify BGCs in PacBio HiFi metagenomes. HiFiBGC leverages an ensemble of assemblies from three HiFi-tailored metagenome assemblers and the reads not represented in these assemblies. Based on our analyses of four HiFi metagenomic datasets from four different environments, we show that HiFiBGC identifies, on average, 78% more BGCs than the top-performing single-assembler-based method. This increase is due to HiFiBGC's ensemble assembly approach, which improves recovery by 25%, as well as from the inclusion of mostly fragmented BGCs identified in the unmapped reads.

CONCLUSIONS

HiFiBGC is a computational workflow for identifying BGCs in long-read HiFi metagenomes, implemented majorly using Python programming language and workflow manager Snakemake. HiFiBGC is available on GitHub at https://github.com/ay-amityadav/HiFiBGC under the MIT license. The code related to the figures and analyses presented in the manuscript is available at https://github.com/ay-amityadav/HiFiBGC_analyses .

摘要

背景

微生物产生具有医学和农业等领域应用的多样化生物活性天然产物。在它们的基因组中,这些天然产物由物理上聚集的基因编码,这些基因被称为生物合成基因簇(BGCs)。基因组和宏基因组测序的进步使高通量鉴定 BGCs 成为发现天然产物的有前途的途径。使用计算机工具从(宏)基因组中挖掘 BGCs 可以获得大量潜在的新型天然产物。然而,一个根本的限制是能够组装完整的 BGCs,特别是来自复杂的宏基因组。由于其碎片化的组装,短读长技术难以恢复完整的 BGCs,例如长而重复的非核糖体肽合成酶(NRPS)和聚酮合酶(PKS)。长读长测序的最新进展,例如 PacBio 的高保真度(HiFi)技术,已经减少了这一限制,并可以帮助从宏基因组中恢复准确和完整的 BGCs,从而改进现有的 BGC 鉴定方法,更好地利用 HiFi 数据。

结果

在这里,我们提出了 HiFiBGC,这是一种用于鉴定 PacBio HiFi 宏基因组中 BGCs 的基于命令行的工作流程。HiFiBGC 利用了三种针对 HiFi 定制的宏基因组组装器的组装结果和未在这些组装中出现的reads。基于我们对来自四个不同环境的四个 HiFi 宏基因组数据集的分析,我们表明,HiFiBGC 平均比表现最好的单组装器方法多识别 78%的 BGCs。这种增加是由于 HiFiBGC 的组装方法,它通过组装提高了 25%的恢复率,以及包含在未映射的 reads 中识别出的大多数碎片化 BGCs。

结论

HiFiBGC 是一种用于识别长读长 HiFi 宏基因组中 BGCs 的计算工作流程,主要使用 Python 编程语言和工作流管理器 Snakemake 实现。HiFiBGC 可在 GitHub 上通过 https://github.com/ay-amityadav/HiFiBGC 获得,采用 MIT 许可证。与本文档中呈现的图和分析相关的代码可在 https://github.com/ay-amityadav/HiFiBGC_analyses 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/69a0/11569603/0f901f22a436/12864_2024_10950_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验