RefSeq：通过蛋白质家族模型编纂扩展原核生物基因组注释管道的覆盖范围。

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA.

出版信息

Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105.

DOI:10.1093/nar/gkaa1105

PMID:33270901

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7779008/

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

摘要

国家生物技术信息中心 (NCBI) 的参考序列 (RefSeq) 项目包含近 20 万种细菌和古菌基因组以及 1.5 亿种具有最新注释的蛋白质。自 2018 年以来，原核生物基因组注释流水线 (PGAP) 的变化导致虚假注释大量减少。PGAP 用作结构和功能注释证据的蛋白质家族模型 (PFM) 的分层集合已扩展到超过 35000 个蛋白质轮廓隐马尔可夫模型 (HMM)、12300 个 BlastRules 和 36000 个经过策管的 CDD 架构。因此，现在超过 1.22 亿或 79%的 RefSeq 蛋白质是根据与策管 PFM 的匹配来命名的。超过 40%的 PFM 具有基因符号、酶委员会编号或支持出版物属性，并通过它们命名的蛋白质和特征继承，从而促进多基因组分析和与文献的联系。为了遵守 FAIR（可发现、可访问、可互操作、可重用）原则，任何用户都可以在蛋白质家族模型 Entrez 数据库中访问 PFM。最后，参考和代表性基因组集是 RefSeq 原核生物基因组的一个具有分类多样性的子集，现在定期重新计算，并可用于下载和与 BLAST 进行同源搜索。RefSeq 可在 https://www.ncbi.nlm.nih.gov/refseq/ 找到。

相似文献

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105.

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.

Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988.

RefSeq: an update on prokaryotic genome annotation and curation.

Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8.

RefSeq: an update on mammalian reference sequences.

Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19.

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.

Nucleic Acids Res. 2012 Jan;40(Database issue):D130-5. doi: 10.1093/nar/gkr1079. Epub 2011 Nov 24.

Update on RefSeq microbial genomes resources.

Nucleic Acids Res. 2015 Jan;43(Database issue):D599-605. doi: 10.1093/nar/gku1062. Epub 2014 Dec 15.

Prot2HG: a database of protein domains mapped to the human genome.

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baz161.

Comparison of RefSeq protein-coding regions in human and vertebrate genomes.

BMC Genomics. 2013 Sep 25;14:654. doi: 10.1186/1471-2164-14-654.

The UCSC Genome Browser database: 2021 update.

Nucleic Acids Res. 2021 Jan 8;49(D1):D1046-D1057. doi: 10.1093/nar/gkaa1070.

引用本文的文献

Conservation of sporulation genes and a transmembrane-containing Spo0B variant in .

bioRxiv. 2025 Aug 24:2025.08.24.672004. doi: 10.1101/2025.08.24.672004.

Genome mining of Streptomyces bambergiensis AC-800 unravels the biosynthetic gene cluster for inhibitors of prolyl hydroxylase fibrostatins.

Sci Rep. 2025 Sep 1;15(1):32142. doi: 10.1038/s41598-025-17585-y.

Effect of dietary zinc supplementation on the gastrointestinal microbiome and host gene expression in the mouse model of autism spectrum disorder.

Front Microbiol. 2025 Aug 12;16:1607045. doi: 10.3389/fmicb.2025.1607045. eCollection 2025.

Discovery of a Novel Antimicrobial Peptide from sp. Na14 with Potent Activity Against Gram-Negative Bacteria and Genomic Insights into Its Biosynthetic Pathway.

Antibiotics (Basel). 2025 Aug 6;14(8):805. doi: 10.3390/antibiotics14080805.

Bringing the uncultivated microbial majority of freshwater ecosystems into culture.

Nat Commun. 2025 Aug 26;16(1):7971. doi: 10.1038/s41467-025-63266-9.

A telomere-to-telomere genome of wild soybean with resistance to soybean cyst nematode X12.

Sci Data. 2025 Aug 13;12(1):1412. doi: 10.1038/s41597-025-05741-y.

Selection Maintains Photosynthesis in a Symbiotic Cyanobacterium Despite Redundancy With its Fern Host.

Mol Biol Evol. 2025 Jul 30;42(8). doi: 10.1093/molbev/msaf181.

Fold first, ask later: structure-informed function annotation of phage proteins.

bioRxiv. 2025 Jul 20:2025.07.17.665397. doi: 10.1101/2025.07.17.665397.

Pathway polygenic risk scores (pPRS) for the analysis of gene-environment interaction.

PLoS Genet. 2025 Aug 5;21(8):e1011543. doi: 10.1371/journal.pgen.1011543. eCollection 2025 Aug.

Complete mitochondrial genome assembly and comparative analysis of Fagopyrum dibotrys (Golden Buckwheat).

BMC Plant Biol. 2025 Jul 30;25(1):985. doi: 10.1186/s12870-025-06990-0.

本文引用的文献

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.

Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.

UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase.

Bioinformatics. 2020 Nov 1;36(17):4643-4648. doi: 10.1093/bioinformatics/btaa485.

NCBI's Conserved Domain Database and Tools for Protein Domain Analysis.

Curr Protoc Bioinformatics. 2020 Mar;69(1):e90. doi: 10.1002/cpbi.90.

CDD/SPARCLE: the conserved domain database in 2020.

Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268. doi: 10.1093/nar/gkz991.

tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences.

Methods Mol Biol. 2019;1962:1-14. doi: 10.1007/978-1-4939-9173-0_1.

The EcoCyc Database.

EcoSal Plus. 2018 Nov;8(1). doi: 10.1128/ecosalplus.ESP-0006-2018.

VFDB 2019: a comparative pathogenomic platform with an interactive web interface.

Nucleic Acids Res. 2019 Jan 8;47(D1):D687-D692. doi: 10.1093/nar/gky1080.

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification.

Genome Biol. 2018 Oct 30;19(1):165. doi: 10.1186/s13059-018-1554-6.

Genome properties in 2019: a new companion database to InterPro for the inference of complete functional attributes.

Nucleic Acids Res. 2019 Jan 8;47(D1):D564-D572. doi: 10.1093/nar/gky1013.

The Pfam protein families database in 2019.

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

RefSeq：通过蛋白质家族模型编纂扩展原核生物基因组注释管道的覆盖范围。

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献