Department of Computer Science and Engineering, The Pennsylvania State University, State College, Pennsylvania 16801, USA.
Department of Population Health and Reproduction, University of California, Davis, California 95616, USA.
Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.
草图方法为计算生物学家提供了可扩展的技术,可用于分析规模不断增长的数据集。MinHash 是一种用于估计集合相似度的技术,最近得到了广泛的应用。然而,传统的 MinHash 已被证明在应用于非常不同大小的集合时表现不佳。FracMinHash 最近被引入作为 MinHash 的一种改进,以弥补当集合大小不同时性能不足的问题。这种方法已成功应用于广泛使用的 sourmash gather 中的宏基因组分类分析。尽管实验证据令人鼓舞,但 FracMinHash 尚未从理论角度进行分析。在本文中,我们从理论角度对 FracMinHash 进行了各种分析,得出了 FracMinHash 的各种统计数据,并证明了尽管 FracMinHash 不是无偏的(从其期望值不等于它试图估计的数量的意义上来说),但这种偏差可以很容易地纠正包含和 Jaccard 指数版本的偏差。接下来,我们展示了如何通过假设一个简单的突变模型,使用 FracMinHash 来计算一对序列之间的进化突变距离的点估计和置信区间。我们还研究了这些分析可能无法有效地警告 FracMinHash 用户的边缘情况,指示了这种情况发生的可能性。我们的分析表明,与传统的 MinHash 相比,FracMinHash 更准确和精确地估计了基因组在大型宏基因组中的包含情况,点估计和置信区间在估计突变距离方面表现得更好。