Suppr超能文献

DBG2OLC:利用第三代测序技术的长错误读长进行大规模基因组的高效组装。

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies.

机构信息

Department of Computer Science, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA.

Computational Biology and Medical Ecology Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223 China.

出版信息

Sci Rep. 2016 Aug 30;6:31900. doi: 10.1038/srep31900.

Abstract

The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.

摘要

从下一代测序(NGS)到第三代测序(3GS)的备受期待的转变主要由于高错误率和过高的测序成本而变得困难。高错误率使得组装大型基因组的长错误读取变得具有挑战性,因为现有的软件解决方案通常因纠错任务而不堪重负。在这里,我们报告了一种混合组装方法,该方法同时利用 NGS 和 3GS 数据来解决这两个问题。我们从三个通用和基本设计原则中获得了优势:(i)长读的紧凑表示导致有效的对齐。(ii)可以跳过碱基级别的错误;需要检测和纠正结构错误。(iii)结构正确的 3GS 读取被组装和抛光。在我们的实现中,预组装的 NGS 连续体用于导出长读的紧凑表示,这促使从 de Bruijn 图到重叠图的算法转换,这是两种主要的组装范例。此外,由于 NGS 和 3GS 数据可以相互补充,我们的混合组装方法降低了它们的测序要求。实验表明,我们的软件能够以比现有方法快几个数量级的速度组装哺乳动物大小的基因组,而不会消耗大量内存,同时节省大约一半的测序成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c8e/5004134/eb57a4e19fa4/srep31900-f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验