Suppr超能文献

用于人类和非人类全基因组测序数据混合组装的生物信息学工具的基准测试。

Benchmarking of bioinformatics tools for the hybrid assembly of human and non-human whole-genome sequencing data.

作者信息

Muñoz-Barrera Adrián, Rubio-Rodríguez Luis A, Jáspez David, Corrales Almudena, Marcelino-Rodriguez Itahisa, Ortiz Lourdes, Mendoza Pablo, Lorenzo-Salazar José M, González-Montelongo Rafaela, Flores Carlos

机构信息

Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain.

Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Santa Cruz de Tenerife, Spain.

出版信息

Comput Struct Biotechnol J. 2025 Jul 13;27:3099-3109. doi: 10.1016/j.csbj.2025.07.020. eCollection 2025.

Abstract

Accurate and complete genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilitate genome assembly. However, the literature has limited comprehensive evaluations of software performance, especially for human genome assembly. We benchmarked 11 pipelines, including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes, using the HG002 human reference material sequenced with Oxford Nanopore Technologies and Illumina. The best-performing pipeline was validated with non-reference human and non-human routine laboratory samples. Software performance was assessed using QUAST, BUSCO, and Merqury metrics, alongside computational cost analyses. We found that Flye outperformed all assemblers, particularly with Ratatosk error-corrected long-reads. Polishing improved the assembly accuracy and continuity, with two rounds of Racon and Pilon yielding the best results. The assembly of data from validation samples showed comparable assembly metrics to those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance the generation of high-quality and chromosome-level assemblies.

摘要

准确完整的基因组组装能够实现变异识别以及发现新的基因组特征和生物学功能。然而,大型复杂基因组的组装仍然具有挑战性。长读长测序数据单独使用或与短读长数据结合使用,有助于基因组组装。然而,文献中对软件性能的全面评估有限,尤其是对于人类基因组组装。我们使用牛津纳米孔技术公司和Illumina测序的HG002人类参考材料,对11种流程进行了基准测试,其中包括4种仅使用长读长的组装器和3种混合组装器,并结合了4种优化方案。性能最佳的流程使用非参考人类和非人类常规实验室样本进行了验证。使用QUAST、BUSCO和Merqury指标评估软件性能,并进行计算成本分析。我们发现Flye优于所有组装器,特别是在使用Ratatosk纠错长读长时。优化提高了组装的准确性和连续性,两轮Racon和Pilon优化产生了最佳结果。验证样本数据的组装显示出与参考材料相当的组装指标。基于这些结果,提供了一个在Nextflow上开发的用于组装、优化和重叠群整理的完整最佳分析流程,以实现高效并行化和内置依赖管理,从而进一步推动高质量和染色体水平组装的生成。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f52/12284544/c007f1dc6cd0/ga1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验