Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Nat Rev Genet. 2024 Sep;25(9):658-670. doi: 10.1038/s41576-024-00718-w. Epub 2024 Apr 22.
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
基因组序列在很大程度上决定了生物的特性,并编码了其历史,从头组装(de novo assembly)——即根据测序数据重建生物基因组序列的过程——是生物信息学四十年来的核心问题。直到最近,基因组通常最多只能组装成几个兆碱基的片段,但现在长读测序技术的进步使得许多生物的每条染色体都能近乎完整地组装在一起——也称为端粒到端粒组装(telomere-to-telomere assembly)。在这里,我们综述了组装算法和协议方面的最新进展,重点讨论了如何获得近乎端粒到端粒的组装。我们还讨论了为解决剩余的组装缺口以及组装非二倍体基因组所需的进一步发展。