Departments of Microbiology and Molecular Genetics and Computer Science and Engineering, Michigan State University, East Lansing, MI 48824.
Proc Natl Acad Sci U S A. 2014 Apr 1;111(13):4904-9. doi: 10.1073/pnas.1402564111. Epub 2014 Mar 14.
The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches--digital normalization and partitioning--to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil.
深度采样复杂环境中的微生物群落需要大量的测序数据,这给序列分析带来了新的挑战。从头宏基因组组装有效地减少了需要分析的数据总量,但需要大量的计算资源。我们结合了两种预组装过滤方法——数字归一化和分区——来生成以前难以处理的大型宏基因组组装。使用人类肠道模拟群落数据集,我们证明这些方法可以得到与未处理数据几乎相同的组装结果。然后,我们从匹配的爱荷华州玉米和原生草原土壤中组装了两个总计 3980 亿 bp(相当于 88000 个大肠杆菌基因组)的大型土壤宏基因组。使用京都基因与基因组百科全书 Orthology 数据库,可以将生成的组装连续体用于鉴定已知代谢途径的分子相互作用和反应网络。尽管如此,组装中预测的蛋白质有 60%以上无法与已知数据库进行注释。这些未知蛋白质中的许多在玉米和草原土壤中都很丰富,这突出了组装在发现和描述土壤生物多样性中的新颖性方面的优势。此外,80%的测序数据由于覆盖度低而无法组装,这表明需要更多的测序数据来描述土壤的功能含量。