Carr Rogan, Borenstein Elhanan
Department of Genome Sciences, University of Washington, Seattle, WA, United States of America.
Department of Genome Sciences, University of Washington, Seattle, WA, United States of America; Department of Computer Science and Engineering, University of Washington, Seattle, WA, United States of America; Santa Fe Institute, Santa Fe, NM, United States of America.
PLoS One. 2014 Aug 22;9(8):e105776. doi: 10.1371/journal.pone.0105776. eCollection 2014.
To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.
为了评估微生物群落的功能能力,包括那些存在于人体中的微生物群落,鸟枪法宏基因组读数通常会与已知基因数据库进行比对。这种基于同源性的注释方法严重依赖于这样一种假设,即短读数可以映射到功能相似的直系同源基因上。然而,这一假设以及影响短读数注释的各种因素尚未得到系统评估。为应对这一挑战,我们生成了一个极大的模拟读数数据库(总计15.9GB),涵盖超过500,000个微生物基因和170个经过精心挑选的基因组,并且对于许多基因组,还包括了给定长度的每一种可能的读数。我们使用常见的宏基因组协议对每个读数进行注释,全面表征了读数长度、测序错误、系统发育、数据库覆盖范围和映射参数的影响。我们还严格量化了基因、基因组和协议特定的注释偏差。总体而言,我们的研究结果首次全面评估了功能宏基因组注释的能力和局限性,提供了关键的针对特定目标的最佳实践指南,以为未来的宏基因组研究提供参考。