通过几何学习和预训练策略提高突变后蛋白质稳定性变化的预测。

Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy.

机构信息

MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.

Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.

出版信息

Nat Comput Sci. 2024 Nov;4(11):840-850. doi: 10.1038/s43588-024-00716-2. Epub 2024 Oct 25.

DOI:10.1038/s43588-024-00716-2

PMID:39455825

Abstract

Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models-GeoFitness, GeoDDG and GeoDTm-for the prediction of fitness score, ΔΔG and ΔT of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔT prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.

摘要

准确预测蛋白质突变效应在蛋白质工程和设计中具有重要意义。在这里，我们提出了 GeoStab-suite，这是一套基于几何学习的三个模型——GeoFitness、GeoDDG 和 GeoDTm——分别用于预测蛋白质突变后的适合度得分、ΔΔG 和 ΔT。GeoFitness 采用了专门的损失函数，允许使用深度突变扫描数据库中大量多标签适合度数据对统一模型进行有监督训练。为了进一步提高 ΔΔG 和 ΔT 预测的下游任务，GeoFitness 的编码器被重新用作 GeoDDG 和 GeoDTm 的预训练模块，以克服缺乏足够标记数据的挑战。这种预训练策略，结合数据扩展，显著提高了模型的性能和泛化能力。在基准测试中，GeoDDG 和 GeoDTm 在斯皮尔曼相关系数方面分别比其他最先进的方法至少高出 30%和 70%。