Suppr超能文献

使用模拟序列数据评估宏基因组注释技术。

Evaluating techniques for metagenome annotation using simulated sequence data.

作者信息

Randle-Boggis Richard J, Helgason Thorunn, Sapp Melanie, Ashton Peter D

机构信息

Department of Biology, University of York, York YO10 5DD, UK

Department of Biology, University of York, York YO10 5DD, UK.

出版信息

FEMS Microbiol Ecol. 2016 Jul;92(7). doi: 10.1093/femsec/fiw095. Epub 2016 May 8.

Abstract

The advent of next-generation sequencing has allowed huge amounts of DNA sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The current challenge is to identify from which microorganisms and genes the DNA originated. Several tools and databases are available for annotating DNA sequences. The tools, databases and parameters used can have a significant impact on the results: naïve choice of these factors can result in a false representation of community composition and function. We use a simulated metagenome to show how different parameters affect annotation accuracy by evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and Megablast. This simulated metagenome allowed the recovery of known organism and function abundances to be quantitatively evaluated, which is not possible for environmental metagenomes. The performance of each program and database varied, e.g. One Codex correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced many false positive annotations. This effect decreased as the taxonomic level investigated increased. Selecting more stringent parameters decreases the annotation sensitivity, but increases precision. Ultimately, there is a trade-off between taxonomic resolution and annotation accuracy. These results should be considered when annotating metagenomes and interpreting results from previous studies.

摘要

新一代测序技术的出现使得大量DNA序列数据得以产生,推动了微生物生态系统研究的能力。当前的挑战是确定这些DNA来自哪些微生物和基因。有几种工具和数据库可用于注释DNA序列。所使用的工具、数据库和参数可能会对结果产生重大影响:对这些因素的简单选择可能会导致群落组成和功能的错误呈现。我们使用一个模拟宏基因组来展示不同参数如何通过评估MEGAN、MG-RAST、One Codex和Megablast的序列注释性能来影响注释准确性。这个模拟宏基因组能够对已知生物体和功能丰度的恢复进行定量评估,而这对于环境宏基因组来说是不可能的。每个程序和数据库的性能各不相同,例如One Codex在属水平上正确注释了许多序列,而MG-RAST RefSeq产生了许多假阳性注释。随着所研究的分类水平的提高,这种影响会减小。选择更严格的参数会降低注释敏感性,但会提高精度。最终,在分类分辨率和注释准确性之间存在权衡。在注释宏基因组和解释先前研究的结果时应考虑这些结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8875/4892715/38fa87d96614/fiw095fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验