用于蛋白质表示学习的带序列转移的掩码反向折叠

Masked inverse folding with sequence transfer for protein representation learning.

作者信息

Yang Kevin K, Zanichelli Niccolò, Yeh Hugh

机构信息

Microsoft Research, 1 Memorial Drive, Cambridge, MA, USA.

OpenBioML.

出版信息

Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.

DOI:10.1093/protein/gzad015

PMID:37883472

Abstract

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.

摘要

对蛋白质序列进行自监督预训练已在蛋白质功能和适应性预测方面取得了领先的性能。然而，仅基于序列的方法忽略了实验和预测蛋白质结构中包含的丰富信息。同时，反向折叠方法根据蛋白质的结构重建其氨基酸序列，但没有利用那些没有已知结构的序列。在本研究中，我们训练了一个以结构化图神经网络为参数化的掩码反向折叠蛋白质掩码语言模型。在预训练期间，该模型学习根据主链结构重建被破坏的序列。然后我们表明，将预训练的仅基于序列的蛋白质掩码语言模型的输出作为反向折叠模型的输入，可以进一步提高预训练困惑度。我们在下游蛋白质工程任务上评估了这两种模型，并分析了使用来自实验或预测结构的信息对性能的影响。

相似文献

Masked inverse folding with sequence transfer for protein representation learning.

Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.

Convolutions are competitive with transformers for protein sequence pretraining.

Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence.

J Cheminform. 2024 Jun 19;16(1):71. doi: 10.1186/s13321-024-00848-7.

MolPROP: Molecular Property prediction with multimodal language and graph fusion.

J Cheminform. 2024 May 22;16(1):56. doi: 10.1186/s13321-024-00846-9.

Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction.

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae163.

Sequential autoencoders for feature engineering and pretraining in major depressive disorder risk prediction.

JAMIA Open. 2023 Oct 9;6(4):ooad086. doi: 10.1093/jamiaopen/ooad086. eCollection 2023 Dec.

SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures.

BMC Bioinformatics. 2023 Oct 3;24(1):373. doi: 10.1186/s12859-023-05498-4.

Structure-aware protein self-supervised learning.

Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad189.

Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles.

Proteins. 2014 Oct;82(10):2565-73. doi: 10.1002/prot.24620. Epub 2014 Jun 19.

A Multimodal Protein Representation Framework for Quantifying Transferability Across Biochemical Downstream Tasks.

Adv Sci (Weinh). 2023 Aug;10(22):e2301223. doi: 10.1002/advs.202301223. Epub 2023 May 30.

引用本文的文献

BEST: Basic Embedding Search Tool Enhancing Discovery of Novel Enzyme.

Interdiscip Sci. 2025 Aug 11. doi: 10.1007/s12539-025-00753-z.

From high-throughput evaluation to wet-lab studies: advancing mutation effect prediction with a retrieval-enhanced model.

Bioinformatics. 2025 Jul 1;41(Supplement_1):i401-i409. doi: 10.1093/bioinformatics/btaf189.

Large Language Model (LLM)-Based Advances in Prediction of Post-translational Modification Sites in Proteins.

Methods Mol Biol. 2025;2941:313-355. doi: 10.1007/978-1-0716-4623-6_19.

Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI.

Natl Sci Rev. 2025 Jun 10;12(6):nwaf231. doi: 10.1093/nsr/nwaf231. eCollection 2025 Jun.

Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.

Adv Neural Inf Process Syst. 2024;37:130265-130290.

VenusMutHub: A systematic evaluation of protein mutation effect predictors on small-scale experimental data.

Acta Pharm Sin B. 2025 May;15(5):2454-2467. doi: 10.1016/j.apsb.2025.03.028. Epub 2025 Mar 14.

VenusMutHub-A benchmark for protein mutation effect prediction.

Acta Pharm Sin B. 2025 May;15(5):2805-2807. doi: 10.1016/j.apsb.2025.05.001. Epub 2025 May 14.

Machine learning models for pharmacogenomic variant effect predictions - recent developments and future frontiers.

Pharmacogenomics. 2025 Apr-Apr;26(5-6):171-182. doi: 10.1080/14622416.2025.2504863. Epub 2025 May 22.

Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability.

Elife. 2025 May 2;13:RP98033. doi: 10.7554/eLife.98033.

Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants.

Sci Adv. 2025 Feb 14;11(7):eadr7338. doi: 10.1126/sciadv.adr7338. Epub 2025 Feb 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于蛋白质表示学习的带序列转移的掩码反向折叠

Masked inverse folding with sequence transfer for protein representation learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献