Wang Ruheng, Jin Junru, Zou Quan, Nakai Kenta, Wei Leyi
School of Software, Shandong University, Jinan 250101, China.
Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.
Bioinformatics. 2022 Jun 27;38(13):3351-3360. doi: 10.1093/bioinformatics/btac352.
Identifying the protein-peptide binding residues is fundamentally important to understand the mechanisms of protein functions and explore drug discovery. Although several computational methods have been developed, most of them highly rely on third-party tools or complex data preprocessing for feature design, easily resulting in low computational efficacy and suffering from low predictive performance. To address the limitations, we propose PepBCL, a novel BERT (Bidirectional Encoder Representation from Transformers) -based contrastive learning framework to predict the protein-peptide binding residues based on protein sequences only. PepBCL is an end-to-end predictive model that is independent of feature engineering. Specifically, we introduce a well pre-trained protein language model that can automatically extract and learn high-latent representations of protein sequences relevant for protein structures and functions. Further, we design a novel contrastive learning module to optimize the feature representations of binding residues underlying the imbalanced dataset. We demonstrate that our proposed method significantly outperforms the state-of-the-art methods under benchmarking comparison, and achieves more robust performance. Moreover, we found that we further improve the performance via the integration of traditional features and our learnt features. Interestingly, the interpretable analysis of our model highlights the flexibility and adaptability of deep learning-based protein language model to capture both conserved and non-conserved sequential characteristics of peptide-binding residues. Finally, to facilitate the use of our method, we establish an online predictive platform as the implementation of the proposed PepBCL, which is now available at http://server.wei-group.net/PepBCL/.
https://github.com/Ruheng-W/PepBCL.
Supplementary data are available at Bioinformatics online.
识别蛋白质-肽结合残基对于理解蛋白质功能机制和探索药物发现至关重要。尽管已经开发了几种计算方法,但其中大多数高度依赖第三方工具或复杂的数据预处理来进行特征设计,容易导致计算效率低下且预测性能不佳。为了解决这些局限性,我们提出了PepBCL,这是一种基于新型BERT(来自Transformer的双向编码器表示)的对比学习框架,仅基于蛋白质序列预测蛋白质-肽结合残基。PepBCL是一个独立于特征工程的端到端预测模型。具体而言,我们引入了一个经过良好预训练的蛋白质语言模型,该模型可以自动提取和学习与蛋白质结构和功能相关的蛋白质序列的高潜在表示。此外,我们设计了一种新颖的对比学习模块,以优化不平衡数据集中结合残基的特征表示。我们证明,在基准比较下,我们提出的方法显著优于现有方法,并实现了更稳健的性能。此外,我们发现通过整合传统特征和我们学习到的特征可以进一步提高性能。有趣的是,我们模型的可解释分析突出了基于深度学习的蛋白质语言模型在捕获肽结合残基的保守和非保守序列特征方面的灵活性和适应性。最后,为了便于使用我们的方法,我们建立了一个在线预测平台作为所提出的PepBCL的实现,现在可在http://server.wei-group.net/PepBCL/上获得。
https://github.com/Ruheng-W/PepBCL。
补充数据可在《生物信息学》在线获取。