Bazinet Adam L
National Biodefense Analysis and Countermeasures Center, Fort Detrick, 21702, MD, USA.
BMC Evol Biol. 2017 Aug 2;17(1):176. doi: 10.1186/s12862-017-1020-1.
Bacillus cereus sensu lato (s. l.) is an ecologically diverse bacterial group of medical and agricultural significance. In this study, I use publicly available genomes and novel bioinformatic workflows to characterize the B. cereus s. l. pan-genome and perform the largest phylogenetic and population genetic analyses of this group to date in terms of the number of genes and taxa included. With these fundamental data in hand, I identify genes associated with particular phenotypic traits (i.e., "pan-GWAS" analysis), and quantify the degree to which taxa sharing common attributes are phylogenetically clustered.
A rapid k-mer based approach (Mash) was used to create reduced representations of selected Bacillus genomes, and a fast distance-based phylogenetic analysis of this data (FastME) was performed to determine which species should be included in B. cereus s. l. The complete genomes of eight B. cereus s. l. species were annotated de novo with Prokka, and these annotations were used by Roary to produce the B. cereus s. l. pan-genome. Scoary was used to associate gene presence and absence patterns with various phenotypes. The orthologous protein sequence clusters produced by Roary were filtered and used to build HaMStR databases of gene models that were used in turn to construct phylogenetic data matrices. Phylogenetic analyses used RAxML, DendroPy, ClonalFrameML, PAUP*, and SplitsTree. Bayesian model-based population genetic analysis assigned taxa to clusters using hierBAPS. The genealogical sorting index was used to quantify the phylogenetic clustering of taxa sharing common attributes.
The B. cereus s. l. pan-genome currently consists of ≈60,000 genes, ≈600 of which are "core" (common to at least 99% of taxa sampled). Pan-GWAS analysis revealed genes associated with phenotypes such as isolation source, oxygen requirement, and ability to cause diseases such as anthrax or food poisoning. Extensive phylogenetic analyses using an unprecedented amount of data produced phylogenies that were largely concordant with each other and with previous studies. Phylogenetic support as measured by bootstrap probabilities increased markedly when all suitable pan-genome data was included in phylogenetic analyses, as opposed to when only core genes were used. Bayesian population genetic analysis recommended subdividing the three major clades of B. cereus s. l. into nine clusters. Taxa sharing common traits and species designations exhibited varying degrees of phylogenetic clustering.
All phylogenetic analyses recapitulated two previously used classification systems, and taxa were consistently assigned to the same major clade and group. By including accessory genes from the pan-genome in the phylogenetic analyses, I produced an exceptionally well-supported phylogeny of 114 complete B. cereus s. l. genomes. The best-performing methods were used to produce a phylogeny of all 498 publicly available B. cereus s. l. genomes, which was in turn used to compare three different classification systems and to test the monophyly status of various B. cereus s. l. species. The majority of the methodology used in this study is generic and could be leveraged to produce pan-genome estimates and similarly robust phylogenetic hypotheses for other bacterial groups.
蜡样芽孢杆菌复合群(Bacillus cereus sensu lato,s. l.)是一类具有医学和农业重要性、生态多样性丰富的细菌群体。在本研究中,我利用公开可得的基因组和新颖的生物信息学工作流程来描述蜡样芽孢杆菌复合群的泛基因组,并就所包含的基因数量和分类单元进行了该群体迄今为止规模最大的系统发育和群体遗传学分析。基于这些基础数据,我鉴定出与特定表型特征相关的基因(即“泛全基因组关联研究”分析),并量化具有共同属性的分类单元在系统发育上的聚类程度。
采用一种基于快速k-mer的方法(Mash)来创建所选芽孢杆菌基因组的简化表示,并对该数据进行基于距离的快速系统发育分析(FastME),以确定哪些物种应纳入蜡样芽孢杆菌复合群。使用Prokka对8个蜡样芽孢杆菌复合群物种的完整基因组进行从头注释,Roary利用这些注释生成蜡样芽孢杆菌复合群的泛基因组。Scoary用于将基因的有无模式与各种表型相关联。对Roary产生的直系同源蛋白质序列簇进行筛选,并用于构建基因模型的HaMStR数据库,进而用于构建系统发育数据矩阵。系统发育分析使用了RAxML、DendroPy、ClonalFrameML、PAUP*和SplitsTree。基于贝叶斯模型的群体遗传学分析使用hierBAPS将分类单元分配到不同簇中。系谱分选指数用于量化具有共同属性的分类单元的系统发育聚类情况。
蜡样芽孢杆菌复合群的泛基因组目前约由60,000个基因组成,其中约600个为“核心”基因(至少99%的抽样分类单元共有)。泛全基因组关联研究分析揭示了与诸如分离源、需氧性以及导致炭疽或食物中毒等疾病能力等表型相关的基因。使用前所未有的大量数据进行的广泛系统发育分析所产生的系统发育树在很大程度上彼此一致,且与先前研究一致。与仅使用核心基因时相比,当在系统发育分析中纳入所有合适的泛基因组数据时,自展概率所衡量的系统发育支持度显著提高。贝叶斯群体遗传学分析建议将蜡样芽孢杆菌复合群的三个主要分支细分为九个簇。具有共同特征和物种名称的分类单元表现出不同程度的系统发育聚类。
所有系统发育分析都重现了两个先前使用的分类系统,并且分类单元始终被分配到相同的主要分支和组中。通过在系统发育分析中纳入泛基因组中的辅助基因,我构建了一个由114个完整的蜡样芽孢杆菌复合群基因组组成的、支持度极高的系统发育树。使用性能最佳的方法构建了所有498个公开可得的蜡样芽孢杆菌复合群基因组的系统发育树,进而用于比较三种不同的分类系统并检验各种蜡样芽孢杆菌复合群物种的单系性状态。本研究中使用的大多数方法具有通用性,可用于为其他细菌群体生成泛基因组估计值和类似可靠的系统发育假设。