Suppr超能文献

使用FracMinHash估计相似度和距离。

Estimating similarity and distance using FracMinHash.

作者信息

Rahman Hera Mahmudur, Koslicki David

机构信息

School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, USA.

Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, USA.

出版信息

Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.

Abstract

MOTIVATION

The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.

THEORETICAL CONTRIBUTIONS

In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.

PRACTICAL CONTRIBUTIONS

We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

摘要

动机

基因组和宏基因组数据的数量和规模不断增加,因此需要可扩展且强大的计算模型来进行精确分析。利用生物样本中的k - 聚体的草图绘制技术已被证明对大规模分析很有用。近年来,FracMinHash已成为一种流行的草图绘制技术,并已应用于多个有用的应用程序中。最近关于FracMinHash的研究证明了其对包含度和杰卡德指数的无偏估计。然而,对于其他度量的理论研究仍然缺乏。

理论贡献

在本文中,当度量可以用某种形式表示时,我们提出了一个使用FracMinHash草图估计相似性/距离度量的理论框架。我们建立了这种估计合理的条件,并推荐了一个最小比例因子s以获得准确的结果。实验证据支持我们的理论发现。

实际贡献

我们还展示了frac - kmc,一个快速高效的FracMinHash草图生成程序。frac - kmc是已知最快的FracMinHash草图生成器,在真实数据上进行余弦相似性估计时能提供准确精确的结果。frac - kmc也是用于此任务的第一个并行工具,可以使用多个CPU核心加速草图生成,这是现有序列化工具所没有的选项。我们表明,通过使用frac - kmc计算FracMinHash草图,我们可以在真实数据上快速准确地估计成对相似性。frac - kmc可在此处免费获取:https://github.com/KoslickiLab/frac - kmc/ 。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验