DeepMind, London, UK.
European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
Nature. 2021 Aug;596(7873):590-596. doi: 10.1038/s41586-021-03828-1. Epub 2021 Jul 22.
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure. Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold, at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.
蛋白质结构可以提供非常有价值的信息,既可以用于推理生物过程,也可以用于干预,如基于结构的药物开发或靶向诱变。经过几十年的努力,人类蛋白质序列中总残基的 17% 被实验确定的结构所覆盖。在这里,我们通过应用最先进的机器学习方法 AlphaFold,以几乎涵盖整个人类蛋白质组(98.5%的人类蛋白质)的规模,显著扩大了蛋白质组的结构覆盖范围。由此产生的数据集涵盖了 58%有可靠预测的残基,其中一部分(所有残基的 36%)具有非常高的置信度。我们引入了一些通过构建 AlphaFold 模型开发的指标,并使用它们来解释数据集,识别出强的多结构域预测以及可能无序的区域。最后,我们提供了一些案例研究来说明如何使用高质量的预测来生成生物学假设。我们正在将我们的预测免费提供给社区,并预计常规的大规模和高精度结构预测将成为一个重要的工具,它将允许从结构角度提出新的问题。