Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambridgeshire, UK.
Comparative Genomics Lab, Instituto de Biologica Evolutiva, Universitat Pompeu Fabra, Barcelona, Spain.
Nucleic Acids Res. 2018 Aug 21;46(14):7070-7084. doi: 10.1093/nar/gky587.
Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.
人类基因组测序工作完成 17 年后,人类蛋白质组仍在不断修订中。在 Ensembl/GENCODE、RefSeq 和 UniProtKB 参考数据库中列出的 22210 个编码基因中,有八分之一在这三个数据库中的注释方式不同。我们对一组或多组手动注释为编码而其他组注释为非编码的 2764 个基因进行了深入研究。来自大规模遗传变异分析的数据表明,大多数基因不受类似蛋白质的纯化选择的影响,因此不太可能编码功能性蛋白质。另外 1470 个在三个参考数据库中都注释为编码的基因具有非编码基因或假基因的典型特征。这些潜在的非编码基因似乎也在经历中性进化,与其他编码基因相比,它们的转录本和蛋白质证据要少得多。我们认为,这三个参考数据库目前至少高估了 2000 个人类编码基因的数量,这使得大规模的生物医学实验变得更加复杂,并增加了噪音。确定哪些潜在的非编码基因不编码蛋白质是一项困难但至关重要的任务,因为人类参考蛋白质组是大多数基础研究的基本支柱,并且支持几乎所有大规模的生物医学项目。