Department of Mechanical Engineering, The University of Melbourne, Parkville, Melbourne, 3010, Australia.
Department of Computer Engineering, University of Peradeniya, Prof. E. O. E. Pereira Mawatha, Peradeniya, 20400, Sri Lanka.
BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3.
In metagenomics, the separation of nucleotide sequences belonging to an individual or closely matched populations is termed binning. Binning helps the evaluation of underlying microbial population structure as well as the recovery of individual genomes from a sample of uncultivable microbial organisms. Both supervised and unsupervised learning methods have been employed in binning; however, characterizing a metagenomic sample containing multiple strains remains a significant challenge. In this study, we designed and implemented a new workflow, Coverage and composition based binning of Metagenomes (CoMet), for binning contigs in a single metagenomic sample. CoMet utilizes coverage values and the compositional features of metagenomic contigs. The binning strategy in CoMet includes the initial grouping of contigs in guanine-cytosine (GC) content-coverage space and refinement of bins in tetranucleotide frequencies space in a purely unsupervised manner. With CoMet, the clustering algorithm DBSCAN is employed for binning contigs. The performances of CoMet were compared against four existing approaches for binning a single metagenomic sample, including MaxBin, Metawatt, MyCC (default) and MyCC (coverage) using multiple datasets including a sample comprised of multiple strains.
Binning methods based on both compositional features and coverages of contigs had higher performances than the method which is based only on compositional features of contigs. CoMet yielded higher or comparable precision in comparison to the existing binning methods on benchmark datasets of varying complexities. MyCC (coverage) had the highest ranking score in F1-score. However, the performances of CoMet were higher than MyCC (coverage) on the dataset containing multiple strains. Furthermore, CoMet recovered contigs of more species and was 18 - 39% higher in precision than the compared existing methods in discriminating species from the sample of multiple strains. CoMet resulted in higher precision than MyCC (default) and MyCC (coverage) on a real metagenome.
The approach proposed with CoMet for binning contigs, improves the precision of binning while characterizing more species in a single metagenomic sample and in a sample containing multiple strains. The F1-scores obtained from different binning strategies vary with different datasets; however, CoMet yields the highest F1-score with a sample comprised of multiple strains.
在宏基因组学中,将属于个体或密切匹配种群的核苷酸序列分离称为分箱。分箱有助于评估潜在的微生物种群结构,以及从不可培养的微生物样品中回收单个基因组。已在分箱中使用了有监督和无监督学习方法;然而,表征包含多个菌株的宏基因组样品仍然是一个重大挑战。在这项研究中,我们设计并实现了一种新的工作流程,即基于覆盖度和组成的宏基因组分箱(CoMet),用于对单个宏基因组样品中的 contigs 进行分箱。CoMet 利用覆盖度值和宏基因组 contigs 的组成特征。CoMet 的分箱策略包括在鸟嘌呤-胞嘧啶(GC)含量-覆盖度空间中对 contigs 进行初步分组,以及在四核苷酸频率空间中以纯无监督方式对 bin 进行细化。在 CoMet 中,使用 DBSCAN 聚类算法对 contigs 进行分箱。使用多个数据集,包括包含多个菌株的样品,将 CoMet 与四种现有的单个宏基因组样品分箱方法(MaxBin、Metawatt、MyCC(默认)和 MyCC(覆盖度))进行了比较。
基于 contigs 的组成特征和覆盖度的分箱方法比仅基于 contigs 组成特征的方法具有更高的性能。在不同复杂度的基准数据集上,CoMet 的精度与现有的分箱方法相比,要么更高,要么相当。在 F1 得分方面,MyCC(覆盖度)的排名最高。然而,在包含多个菌株的数据集上,CoMet 的性能高于 MyCC(覆盖度)。此外,CoMet 在区分多个菌株样品中的物种方面,比现有的比较方法具有更高的召回率,并且在精度方面提高了 18%至 39%。CoMet 在真实宏基因组中的精度高于 MyCC(默认)和 MyCC(覆盖度)。
CoMet 提出的用于 contigs 分箱的方法,在单个宏基因组样品和包含多个菌株的样品中提高了分箱的精度,同时也能更好地区分更多的物种。不同分箱策略获得的 F1 得分因数据集而异;然而,CoMet 在包含多个菌株的样品中获得了最高的 F1 得分。