Suppr超能文献

PanKmer:基于 k-mer 的无参考基因组泛基因组分析。

PanKmer: k-mer-based and reference-free pangenome analysis.

机构信息

The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, United States.

出版信息

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad621.

Abstract

SUMMARY

Pangenomes are replacing single reference genomes as the definitive representation of DNA sequence within a species or clade. Pangenome analysis predominantly leverages graph-based methods that require computationally intensive multiple genome alignments, do not scale to highly complex eukaryotic genomes, limit their scope to identifying structural variants (SVs), or incur bias by relying on a reference genome. Here, we present PanKmer, a toolkit designed for reference-free analysis of pangenome datasets consisting of dozens to thousands of individual genomes. PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes SNPs, INDELs, and SVs. It also includes functions for downstream analysis of the k-mer index, such as calculating sequence similarity statistics between individuals at whole-genome or local scales. For example, k-mers can be "anchored" in any individual genome to quantify sequence variability or conservation at a specific locus. This facilitates workflows with various biological applications, e.g. identifying cases of hybridization between plant species. PanKmer provides researchers with a valuable and convenient means to explore the full scope of genetic variation in a population, without reference bias.

AVAILABILITY AND IMPLEMENTATION

PanKmer is implemented as a Python package with components written in Rust, released under a BSD license. The source code is available from the Python Package Index (PyPI) at https://pypi.org/project/pankmer/ as well as Gitlab at https://gitlab.com/salk-tm/pankmer. Full documentation is available at https://salk-tm.gitlab.io/pankmer/.

摘要

摘要

泛基因组正在取代单一参考基因组,成为物种或进化枝内 DNA 序列的明确表示。泛基因组分析主要利用基于图的方法,这些方法需要计算密集型的多基因组比对,无法扩展到高度复杂的真核基因组,限制了它们识别结构变异(SV)的范围,或者依赖参考基因组而产生偏差。在这里,我们提出了 PanKmer,这是一个专为包含数十到数千个个体基因组的泛基因组数据集的无参考分析而设计的工具包。PanKmer 将一组输入基因组分解为一个观察到的 k-mer 表及其在每个基因组中的存在-缺失值。这些存储在一个有效的 k-mer 索引数据格式中,该格式编码 SNP、INDEL 和 SV。它还包括用于下游 k-mer 索引分析的功能,例如在全基因组或局部尺度上计算个体之间的序列相似性统计信息。例如,可以在任何个体基因组中“锚定”k-mer 来量化特定基因座的序列变异或保守性。这为具有各种生物学应用的工作流程提供了便利,例如识别植物物种之间杂交的情况。PanKmer 为研究人员提供了一种有价值且方便的方法,可以在没有参考偏差的情况下探索群体中遗传变异的全部范围。

可用性和实现

PanKmer 是一个用 Rust 编写的 Python 包实现的,用 Python 编写的组件,根据 BSD 许可证发布。源代码可从 Python 包索引 (PyPI) 获得,网址为 https://pypi.org/project/pankmer/,也可从 Gitlab 获得,网址为 https://gitlab.com/salk-tm/pankmer。完整的文档可在 https://salk-tm.gitlab.io/pankmer/ 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c1e3/10603592/78b9a7782c8c/btad621f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验