Wei Ze-Gang, Zhang Shao-Wu
Key Laboratory of Information Fusion Technology of Ministry of Education, College of Automation, Northwestern Polytechnical University, Xi'an 710072, China.
J Theor Biol. 2017 Jul 21;425:80-87. doi: 10.1016/j.jtbi.2017.04.019. Epub 2017 Apr 26.
Recent sequencing revolution driven by high-throughput technologies has led to rapid accumulation of 16S rRNA sequences for microbial communities. Clustering short sequences into operational taxonomic units (OTUs) is an initial crucial process in analyzing metagenomic data. Although many heuristic methods have been proposed for OTU inferences with low computational complexity, they just select one sequence as the seed for each cluster and the results are sensitive to the selected sequences that represent the clusters. To address this issue, we present a de Bruijn graph-based heuristic clustering method (DBH) for clustering massive 16S rRNA sequences into OTUs by introducing a novel seed selection strategy and greedy clustering approach. Compared with existing widely used methods on several simulated and real-life metagenomic datasets, the results show that DBH has higher clustering performance and low memory usage, facilitating the overestimation of OTUs number. DBH is more effective to handle large-scale metagenomic datasets. The DBH software can be freely downloaded from https://github.com/nwpu134/DBH.git for academic users.
由高通量技术推动的近期测序革命已导致微生物群落16S rRNA序列的快速积累。将短序列聚类为操作分类单元(OTU)是宏基因组数据分析中的一个关键初始过程。尽管已经提出了许多计算复杂度较低的启发式方法用于OTU推断,但它们只是为每个聚类选择一个序列作为种子,并且结果对代表聚类的所选序列敏感。为了解决这个问题,我们提出了一种基于德布鲁因图的启发式聚类方法(DBH),通过引入一种新颖的种子选择策略和贪婪聚类方法,将大量16S rRNA序列聚类为OTU。在几个模拟和实际宏基因组数据集上与现有的广泛使用的方法相比,结果表明DBH具有更高的聚类性能和低内存使用,有助于减少OTU数量的高估。DBH在处理大规模宏基因组数据集方面更有效。学术用户可从https://github.com/nwpu134/DBH.git免费下载DBH软件。