Suppr超能文献

利用组合 BLAST 和 MEGAN 方法从 NCBI-nr 数据库构建定制子数据库,快速注释大量宏基因组数据集。

Construction of customized sub-databases from NCBI-nr database for rapid annotation of huge metagenomic datasets using a combined BLAST and MEGAN approach.

机构信息

Environmental Biotechnology Laboratory, Department of Civil Engineering, The University of Hong Kong, Hong Kong SAR, China.

出版信息

PLoS One. 2013;8(4):e59831. doi: 10.1371/journal.pone.0059831. Epub 2013 Apr 1.

Abstract

We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1(st) BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1(st) BLAST were subjected to the 2(nd) BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1(st) BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ~150-385 times at the BLAST e-value of 1e-5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e-10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks.

摘要

我们开发了一种快速的方法,从 NCBI-nr 数据库中构建本地子数据库,以便基于 BLAST-MEGAN 方法快速搜索和注释庞大的宏基因组数据集。进一步提出了一个三步子数据库注释流水线 (SAP),以更高效的方式进行注释,所需的计算能力远远小于直接使用 NCBI-nr 数据库的 BLAST-MEGAN 方法。SAP 的第一步 BLAST 使用原始宏基因组数据集对构建的子数据库进行快速筛选候选目标序列。然后,将第一步 BLAST 中鉴定的候选目标序列与整个 NCBI-nr 数据库进行第二次 BLAST。最后使用 MEGAN 对 BLAST 结果进行注释,以过滤掉第一步 BLAST 中错误选择的序列,以确保结果的准确性。根据本研究中的测试,与直接对 NCBI-nr 数据库进行 BLAST 相比,SAP 在 BLAST e 值为 1e-5 时的速度提高了约 150-385 倍。SAP 的注释结果与直接使用 NCBI-nr 数据库进行 BLAST-MEGAN 方法的注释结果完全一致,后者非常耗时且计算密集。选择严格的阈值(例如 e 值为 1e-10)将进一步加速 SAP 过程。SAP 管道还可以与新型相似性搜索工具(例如 RAPsearch)结合使用,而不是 BLAST,以实现更快速地注释庞大的宏基因组数据集。总之,这种子数据库构建方法和 SAP 管道为没有高性能计算设施的实验室提供了一种新的高效、方便的注释相似性搜索策略。SAP 还为高性能计算设施提供了一种解决方案,用于处理更多的相似性搜索任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f2e3/3613424/bd6c3282dce7/pone.0059831.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验