Delft Bioinformatics Lab, Intelligent Systems, Delft University of Technology, 2628 XE, Delft, The Netherlands.
Technical Biochemistry, TU Dortmund University, 44227, Dortmund, Germany.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24.
Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) has overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. However, the introduction of HiFi reads, which offer substantially reduced error rates, has provided a promising solution for more accurate assembly outcomes. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.
We benchmarked state-of-the-art long-read de novo assemblers to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 12 real and 64 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio continuous long-read (CLR), PacBio high-fidelity (HiFi), and ONT sequencing to evaluate the assemblers. We include 5 commonly used long-read assemblers in our benchmark: Canu, Flye, Miniasm, Raven, and wtdbg2 for ONT and PacBio CLR reads. For PacBio HiFi reads , we include 5 state-of-the-art HiFi assemblers: HiCanu, Flye, Hifiasm, LJA, and MBG. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies and report that read length can, but does not always, positively impact assembly quality.
Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results show that overall Flye is the best-performing assembler for PacBio CLR and ONT reads, both on real and simulated data. Meanwhile, best-performing PacBio HiFi assemblers are Hifiasm and LJA. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.
当研究人员使用第三代测序技术为真核生物创建基因组组装时,应慎重且合理地选择组装算法。尽管牛津纳米孔技术(ONT)和太平洋生物科学公司(PacBio)的第三代测序克服了下一代测序(NGS)特定的短读长的缺点,但第三代测序仪产生的错误率更高的读段,从而为组装算法和流水线带来了新的挑战。然而,HiFi 读段的引入提供了更准确的组装结果,该技术提供了更有希望的解决方案,大大降低了错误率。自从第三代测序技术问世以来,已经开发了许多旨在利用长读段的工具,研究人员需要为他们的项目选择正确的组装器。
我们对最先进的长读段从头组装器进行了基准测试,以帮助读者在组装真核生物时做出平衡的选择。为此,我们使用了来自不同真核基因组的 12 个真实数据集和 64 个模拟数据集,这些数据集具有不同的读长分布,模拟 PacBio 连续长读(CLR)、PacBio 高保真度(HiFi)和 ONT 测序,以评估组装器。我们的基准测试包括 5 种常用的长读段组装器:用于 ONT 和 PacBio CLR 读段的 Canu、Flye、Miniasm、Raven 和 wtdbg2。对于 PacBio HiFi 读段,我们包括 5 种最先进的 HiFi 组装器:HiCanu、Flye、Hifiasm、LJA 和 MBG。评估类别涵盖以下指标:基于参考的指标、组装统计、错误组装数、BUSCO 完整性、运行时和 RAM 使用情况。此外,我们还研究了读长增加对组装质量的影响,并报告了读长可以但并不总是会正向影响组装质量。
我们的基准测试得出的结论是,没有一种组装器在所有评估类别中都表现最佳。然而,我们的结果表明,总体而言,在真实和模拟数据上,Flye 是 PacBio CLR 和 ONT 读段的最佳组装器。同时,表现最好的 PacBio HiFi 组装器是 Hifiasm 和 LJA。接下来,使用更长的读段进行基准测试表明,读长的增加可以提高组装质量,但在多大程度上可以实现取决于参考基因组的大小和复杂性。