National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA.
Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105.
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
国家生物技术信息中心 (NCBI) 的参考序列 (RefSeq) 项目包含近 20 万种细菌和古菌基因组以及 1.5 亿种具有最新注释的蛋白质。自 2018 年以来,原核生物基因组注释流水线 (PGAP) 的变化导致虚假注释大量减少。PGAP 用作结构和功能注释证据的蛋白质家族模型 (PFM) 的分层集合已扩展到超过 35000 个蛋白质轮廓隐马尔可夫模型 (HMM)、12300 个 BlastRules 和 36000 个经过策管的 CDD 架构。因此,现在超过 1.22 亿或 79%的 RefSeq 蛋白质是根据与策管 PFM 的匹配来命名的。超过 40%的 PFM 具有基因符号、酶委员会编号或支持出版物属性,并通过它们命名的蛋白质和特征继承,从而促进多基因组分析和与文献的联系。为了遵守 FAIR(可发现、可访问、可互操作、可重用)原则,任何用户都可以在蛋白质家族模型 Entrez 数据库中访问 PFM。最后,参考和代表性基因组集是 RefSeq 原核生物基因组的一个具有分类多样性的子集,现在定期重新计算,并可用于下载和与 BLAST 进行同源搜索。RefSeq 可在 https://www.ncbi.nlm.nih.gov/refseq/ 找到。