Suppr超能文献

设计用于序列相似性分析的高效随机频闪仪。

Designing efficient randstrobes for sequence similarity analyses.

作者信息

Karami Moein, Soltani Mohammadi Aryan, Martin Marcel, Ekim Barış, Shen Wei, Guo Lidong, Xu Mengyang, Pibiri Giulio Ermanno, Patro Rob, Sahlin Kristoffer

机构信息

Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden.

Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Solna SE-17121, Sweden.

出版信息

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae187.

Abstract

MOTIVATION

Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy.

RESULTS

In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

AVAILABILITY AND IMPLEMENTATION

All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

摘要

动机

长度为k的子串,通常称为k-mer,在序列分析中起着至关重要的作用。然而,k-mer仅限于序列之间的精确匹配,这导致了替代结构的出现。我们最近引入了一类新的结构,即频闪子串(strobemer),它可以跨替换以及较小的插入和缺失进行匹配。随机频闪子串(Randstrobes)是萨林(Sahlin)提出的最敏感的频闪子串(《使用频闪子串进行有效的序列相似性检测。基因组研究2021a;31:2080 - 2094。https://doi.org/10.1101/gr.275648.121》),已被用于多种生物信息学应用,如 reads 分类、短 reads 映射和 reads 重叠检测。最近,我们表明构建行为的伪随机性越高(以熵衡量),序列相似性分析的种子就越有效。伪随机程度取决于构建算子,但尚未有研究调查其有效性。

结果

在本研究中,我们引入了新颖的构建方法,包括一种基于二叉搜索树的方法,该方法比以前的方法提高了时间复杂度。据我们所知,我们也是第一个解决构建偏差问题并设计了三个用于测量偏差的指标的。我们的评估表明,与现有方法相比,我们的方法具有良好的速度和采样均匀性。最后,根据我们的结果,我们改变了短 reads 映射器 strobealign 中的种子构建,并发现结果有显著变化。我们建议结合这两个结果来提高 strobealign 在我们评估数据集中最短 reads 的准确性。我们的评估突出了可能出现的采样偏差,并为实现随机频闪子串时使用哪些算子提供了指导。

可用性和实现

所有方法和评估基准都可在公共Github仓库https://github.com/Moein-Karami/RandStrobes中获取。运行 strobealign 分析的脚本位于https://github.com/NBISweden/strobealign-evaluation。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/eac0/11034988/80f12bcc8f83/btae187f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验