Suppr超能文献

MetaCache:基于 minhashing 的宏基因组读段上下文感知分类。

MetaCache: context-aware classification of metagenomic reads using minhashing.

机构信息

Department of Computer Science.

Molecular Genetics and Genome Analysis Group, Department of Biology, Department of Biology, Johannes Gutenberg University, 55128 Mainz, Germany.

出版信息

Bioinformatics. 2017 Dec 1;33(23):3740-3748. doi: 10.1093/bioinformatics/btx520.

Abstract

MOTIVATION

Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy.

RESULTS

We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data.

AVAILABILITY AND IMPLEMENTATION

MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache.

CONTACT

bertil.schmidt@uni-mainz.de.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

宏基因组鸟枪法测序研究越来越受到关注,其中突出的例子包括人类微生物组和各种环境的测序。在这种情况下,一个基本的计算问题是读取分类,即每个读取的分类标签。由于现代高通量测序技术产生的读取数量庞大,以及可用参考基因组数量的快速增加,相应的软件工具要么运行时间长,要么内存需求大,要么准确性低。

结果

我们引入了 MetaCache,这是一种使用大数据技术 minhashing 进行读取分类的新型软件。我们的方法通过计算探针读取和参考基因组局部约束区域内的 k-mer 代表样本进行上下文感知的读取分类。因此,与最先进的读取分类器 Kraken 和 CLARK 相比,MetaCache 消耗的内存显著减少,同时在可比速度下实现了高度竞争的灵敏度和精度。例如,使用 NCBI RefSeq 草案和完成的基因组作为参考,总长度约为 1400 亿个碱基,MetaCache 的数据库仅消耗 62GB 的内存,而 Kraken 和 CLARK 都无法在具有 512GB RAM 的工作站上构建各自的数据库。我们的实验结果还表明,当增加所使用的参考基因组数据量时,分类精度不断提高。

可用性和实现

MetaCache 是用 C++编写的开源软件,可以从 http://github.com/muellan/metacache 下载。

联系人

bertil.schmidt@uni-mainz.de

补充信息

补充数据可在《生物信息学》在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验