Majidian Sina, Hwang Stephen, Zakeri Mohsen, Langmead Ben
Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.
XDBio Program, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf267.
Advances in long-read sequencing technology have led to a rapid increase in high-quality genome assemblies. These make it possible to compare genome sequences across the Tree of Life, deepening our understanding of evolutionary relationships. Average nucleotide identity (ANI) is a metric for estimating the genetic similarity between two genomes, usually calculated as the mean identity of their shared genomic regions. These regions are typically found with genome aligners like Basic Local Alignment Search Tool BLAST or MUMmer. ANI has been applied to species delineation, building guide trees, and searching large sequence databases. Since computing ANI via genome alignment is computationally expensive, the field has increasingly turned to sketch-based approaches that use assumptions and heuristics to speed this up. We propose a suite of simulated and real benchmark datasets, together with a rank-correlation-based metric, to study how these assumptions and heuristics impact distance estimates. We call this evaluation framework EvANI. With EvANI, we show that ANIb is the ANI estimation algorithm that best captures tree distance, though it is also the least efficient. We show that k-mer-based approaches are extremely efficient and have consistently strong accuracy. We also show that some clades have inter-sequence distances that are best computed using multiple values of $k$, e.g. $k=10$ and $k=19$ for Chlamydiales. Finally, we highlight that approaches based on maximal exact matches may represent an advantageous compromise, achieving an intermediate level of computational efficiency while avoiding over-reliance on a single fixed k-mer length.
长读长测序技术的进步使得高质量基因组组装迅速增加。这使得跨生命之树比较基因组序列成为可能,加深了我们对进化关系的理解。平均核苷酸同一性(ANI)是一种用于估计两个基因组之间遗传相似性的指标,通常计算为它们共享基因组区域的平均同一性。这些区域通常通过诸如基本局部比对搜索工具BLAST或MUMmer等基因组比对工具找到。ANI已应用于物种划分、构建引导树和搜索大型序列数据库。由于通过基因组比对计算ANI在计算上成本高昂,该领域越来越多地转向基于草图的方法,这些方法使用假设和启发式方法来加快计算速度。我们提出了一套模拟和真实的基准数据集,以及一种基于秩相关的指标,以研究这些假设和启发式方法如何影响距离估计。我们将这个评估框架称为EvANI。通过EvANI,我们表明ANIb是最能捕捉树距离的ANI估计算法,尽管它也是效率最低的。我们表明基于k-mer的方法极其高效且具有始终很强的准确性。我们还表明,一些进化枝的序列间距离最好使用多个k值来计算,例如衣原体的k = 10和k = 19。最后,我们强调基于最大精确匹配的方法可能代表一种有利的折衷方案,在避免过度依赖单一固定k-mer长度的同时,实现中等水平的计算效率。