Yang Kevin K, Zanichelli Niccolò, Yeh Hugh
Microsoft Research, 1 Memorial Drive, Cambridge, MA, USA.
OpenBioML.
Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.
Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.
对蛋白质序列进行自监督预训练已在蛋白质功能和适应性预测方面取得了领先的性能。然而,仅基于序列的方法忽略了实验和预测蛋白质结构中包含的丰富信息。同时,反向折叠方法根据蛋白质的结构重建其氨基酸序列,但没有利用那些没有已知结构的序列。在本研究中,我们训练了一个以结构化图神经网络为参数化的掩码反向折叠蛋白质掩码语言模型。在预训练期间,该模型学习根据主链结构重建被破坏的序列。然后我们表明,将预训练的仅基于序列的蛋白质掩码语言模型的输出作为反向折叠模型的输入,可以进一步提高预训练困惑度。我们在下游蛋白质工程任务上评估了这两种模型,并分析了使用来自实验或预测结构的信息对性能的影响。