Suppr超能文献

用于蛋白质表示学习的带序列转移的掩码反向折叠

Masked inverse folding with sequence transfer for protein representation learning.

作者信息

Yang Kevin K, Zanichelli Niccolò, Yeh Hugh

机构信息

Microsoft Research, 1 Memorial Drive, Cambridge, MA, USA.

OpenBioML.

出版信息

Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.

Abstract

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.

摘要

对蛋白质序列进行自监督预训练已在蛋白质功能和适应性预测方面取得了领先的性能。然而,仅基于序列的方法忽略了实验和预测蛋白质结构中包含的丰富信息。同时,反向折叠方法根据蛋白质的结构重建其氨基酸序列,但没有利用那些没有已知结构的序列。在本研究中,我们训练了一个以结构化图神经网络为参数化的掩码反向折叠蛋白质掩码语言模型。在预训练期间,该模型学习根据主链结构重建被破坏的序列。然后我们表明,将预训练的仅基于序列的蛋白质掩码语言模型的输出作为反向折叠模型的输入,可以进一步提高预训练困惑度。我们在下游蛋白质工程任务上评估了这两种模型,并分析了使用来自实验或预测结构的信息对性能的影响。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验