Suppr超能文献

门控全局预测系统(Gated-GPS):通过可扩展学习和不平衡感知优化增强蛋白质-蛋白质相互作用位点预测

Gated-GPS: enhancing protein-protein interaction site prediction with scalable learning and imbalance-aware optimization.

作者信息

Gao Xin, Cao Hanqun, Li Jinpeng, Qiu Jiezhong, Chen Guangyong, Heng Pheng-Ann

机构信息

Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Ma Liu Shui, Shatin, Hong Kong SAR, Hong Kong SAR 000000, China.

出版信息

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf248.

Abstract

In protein-protein interaction site (PPIS) prediction, existing machine learning models struggle with small datasets, limiting their predictive accuracy for unseen proteins. Additionally, class imbalance in protein complexes, where binding residues constitute a small fraction of all residues, hinders model performance. To address these challenges, we constructed a training dataset 9$\times $ larger than previous benchmarks by filtering the latest protein-protein complex data, improving diversity and generalization. We propose Gated-GPS, a Graph Transformer model with a novel gating mechanism designed to effectively leverage this expanded dataset. Additionally, we integrate cross-entropy loss with Tversky Loss to adjust sensitivity to positive and negative samples, mitigating class imbalance by emphasizing underrepresented binding residues. Experimental results show that Gated-GPS outperforms state-of-the-art (SOTA) models across four test sets. Notably, on the UBTest dataset, designed to evaluate generalization on unbounded proteins, our method improves MCC and AUPRC by 18.5% and 21.4%, respectively, over the previous SOTA. In a case study of snake venom toxin-protein interactions, our model accurately identified interaction sites, demonstrating its potential for therapeutic design and advancing the understanding of complex protein interactions.

摘要

在蛋白质-蛋白质相互作用位点(PPIS)预测中,现有的机器学习模型在处理小数据集时面临困难,限制了它们对未见蛋白质的预测准确性。此外,蛋白质复合物中的类别不平衡问题,即结合残基在所有残基中占比很小,也会影响模型性能。为应对这些挑战,我们通过筛选最新的蛋白质-蛋白质复合物数据构建了一个比以前的基准数据集大9倍的训练数据集,提高了数据的多样性和泛化能力。我们提出了Gated-GPS,这是一种具有新型门控机制的图Transformer模型,旨在有效利用这个扩展后的数据集。此外,我们将交叉熵损失与Tversky损失相结合,以调整对正样本和负样本的敏感度,通过强调代表性不足的结合残基来缓解类别不平衡问题。实验结果表明,Gated-GPS在四个测试集上均优于当前的最优(SOTA)模型。值得注意的是,在用于评估对无界蛋白质泛化能力的UBTest数据集上,我们的方法相比于之前的SOTA,将马修斯相关系数(MCC)和精确率-召回率曲线下面积(AUPRC)分别提高了18.5%和21.4%。在一项关于蛇毒毒素-蛋白质相互作用的案例研究中,我们的模型准确识别了相互作用位点,展示了其在治疗设计方面的潜力,并有助于增进对复杂蛋白质相互作用的理解。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验