Grehl Claudius, Wagner Marc, Lemnian Ioana, Glaser Bruno, Grosse Ivo
Institute of Computer Science, Bioinformatics, Martin Luther University Halle-Wittenberg, Von Seckendorff-Platz 1, Halle (Saale), Germany.
Institute of Agronomy and Nutritional Sciences, Soil Biogeochemistry, Martin Luther University Halle-Wittenberg, Von Seckendorff-Platz 3, Halle (Saale), Germany.
Front Plant Sci. 2020 Feb 28;11:176. doi: 10.3389/fpls.2020.00176. eCollection 2020.
DNA methylation is involved in many different biological processes in the development and well-being of crop plants such as transposon activation, heterosis, environment-dependent transcriptome plasticity, aging, and many diseases. Whole-genome bisulfite sequencing is an excellent technology for detecting and quantifying DNA methylation patterns in a wide variety of species, but optimized data analysis pipelines exist only for a small number of species and are missing for many important crop plants. This is especially important as most existing benchmark studies have been performed on mammals with hardly any repetitive elements and without CHG and CHH methylation. Pipelines for the analysis of whole-genome bisulfite sequencing data usually consists of four steps: read trimming, read mapping, quantification of methylation levels, and prediction of differentially methylated regions (DMRs). Here we focus on read mapping, which is challenging because un-methylated cytosines are transformed to uracil during bisulfite treatment and to thymine during the subsequent polymerase chain reaction, and read mappers must be capable of dealing with this cytosine/thymine polymorphism. Several read mappers have been developed over the last years, with different strengths and weaknesses, but their performances have not been critically evaluated. Here, we compare eight read mappers: Bismark, BismarkBwt2, BSMAP, BS-Seeker2, Bwameth, GEM3, Segemehl, and GSNAP to assess the impact of the read-mapping results on the prediction of DMRs. We used simulated data generated from the genomes of , , , , and , monitored the effects of the bisulfite conversion rate, the sequencing error rate, the maximum number of allowed mismatches, as well as the genome structure and size, and calculated precision, number of uniquely mapped reads, distribution of the mapped reads, run time, and memory consumption as features for benchmarking the eight read mappers mentioned above. Furthermore, we validated our findings using real-world data of and showed the influence of the mapping step on DMR calling in WGBS pipelines. We found that the conversion rate had only a minor impact on the mapping quality and the number of uniquely mapped reads, whereas the error rate and the maximum number of allowed mismatches had a strong impact and leads to differences of the performance of the eight read mappers. In conclusion, we recommend BSMAP which needs the shortest run time and yields the highest precision, and Bismark which requires the smallest amount of memory and yields precision and high numbers of uniquely mapped reads.
DNA甲基化参与了作物植物发育和健康过程中的许多不同生物学过程,如转座子激活、杂种优势、环境依赖性转录组可塑性、衰老以及多种疾病。全基因组亚硫酸氢盐测序是一种用于检测和定量多种物种中DNA甲基化模式的优秀技术,但仅针对少数物种存在优化的数据分析流程,许多重要的作物植物则没有。这一点尤为重要,因为大多数现有的基准研究是在几乎没有重复元件且不存在CHG和CHH甲基化的哺乳动物上进行的。全基因组亚硫酸氢盐测序数据分析流程通常包括四个步骤:读段修剪、读段比对、甲基化水平定量以及差异甲基化区域(DMR)预测。在这里,我们重点关注读段比对,这具有挑战性,因为未甲基化的胞嘧啶在亚硫酸氢盐处理过程中会转化为尿嘧啶,在随后的聚合酶链反应中又会转化为胸腺嘧啶,并且读段比对工具必须能够处理这种胞嘧啶/胸腺嘧啶多态性。在过去几年中已经开发了几种读段比对工具,各有优缺点,但它们的性能尚未得到严格评估。在这里,我们比较了八种读段比对工具:Bismark、BismarkBwt2、BSMAP、BS-Seeker2、Bwameth、GEM3、Segemehl和GSNAP,以评估读段比对结果对DMR预测的影响。我们使用了从 、 、 、 和 的基因组生成的模拟数据,监测了亚硫酸氢盐转化率、测序错误率、允许的最大错配数以及基因组结构和大小的影响,并计算了精度、唯一比对读段数、比对读段的分布、运行时间和内存消耗,作为对上述八种读段比对工具进行基准测试的特征。此外,我们使用 的实际数据验证了我们的发现,并展示了比对步骤对全基因组亚硫酸氢盐测序流程中DMR调用的影响。我们发现转化率对比对质量和唯一比对读段数的影响较小,而错误率和允许的最大错配数有很大影响,并导致八种读段比对工具的性能存在差异。总之,我们推荐运行时间最短且精度最高的BSMAP,以及内存需求最小且精度高且唯一比对读段数多的Bismark。