Zhang Zimeng, Wang Xin, Shang Wenhui
School of Science, Dalian Maritime University, Dalian, 116026, China.
Funct Integr Genomics. 2025 Jul 4;25(1):147. doi: 10.1007/s10142-025-01641-x.
Anticancer peptides (ACPs) are acknowledged for their potential in cancer therapy, attributed to their safety, low side effects, and high target specificity. However, the discovery of ACPs is slowed by the high cost and labor-intensive nature of experimental validation, resulting in a limited number of confirmed ACPs. Although various computational methods have been proposed, existing models commonly suffer from three critical limitations: reliance on small-scale datasets, lack of interpretable feature learning mechanisms, and insufficient generalization capability. To address these challenges, this study constructs a larger and more diverse dataset by consolidating data from existing literature and databases, and proposes a novel deep learning predictive model named iACP-DPNet. The model utilizes the protein language model ProtBert with positional encoding to convert protein sequences into feature vectors, then applies a two-step feature selection process via LightGBM and MIC. The selected features undergo processing by a causal dilated convolution network. A dual-pooling mechanism is designed to enhance the model's ability to synergistically model local critical residues and global sequence contexts, integrating parallel GlobalAveragePooling and attention pooling layers. Compared to traditional single-pooling models (e.g., ACP‑MHCNN), this architecture significantly improves feature extraction capability. To understand the model's decision-making process, we employ t-SNE for visualizing key steps, ISM for interpreting sequence regions, and SHAP analysis for evaluating feature importance. These approaches significantly improve the model's interpretability. The model exhibits outstanding performances on the novel dataset, as evidenced by rigorous tenfold cross-validation. Achieving remarkable metrics-including Sp of 96.1%, Sn of 92.91%, Acc of 94.5% and MCC of 89.05%, it significantly outperforms all existing state-of-the-art methods in comparative analyses. Furthermore, to assess its generalizability, we evaluated iACP-DPNet on an additional dataset, where it outperformed other current models. In conclusion, the iACP-DPNet exhibits exceptional performance and generalizability, showcasing its advanced design and effectiveness in ACPs prediction. This research provides a robust and interpretable framework for advancing research in anticancer peptide discovery. HIGHLIGHTS: • We have established a larger and more diverse dataset for ACPs prediction, addressing the limitations of existing datasets and providing a robust foundation for model training and evaluation. • The implementation of a dual-pooling layer mechanism (GlobalAveragePooling and attention pooling) bolsters the model's capacity to learn diverse features, ultimately enhancing its prediction efficiency. • We employed t-SNE visualization and ISM-based interpretability analysis to provide insights into the model's decision-making process, highlighting key regions and amino acids critical for ACPs functionality. • The iACP-DPNet model demonstrates strong generalizability across diverse datasets, making it a reliable tool for ACPs prediction and potentially other peptide-related tasks.
抗癌肽(ACPs)因其在癌症治疗中的潜力而受到认可,这归因于其安全性、低副作用和高靶向特异性。然而,ACPs的发现因实验验证成本高且劳动强度大而放缓,导致已确认的ACPs数量有限。尽管已经提出了各种计算方法,但现有模型通常存在三个关键局限性:依赖小规模数据集、缺乏可解释的特征学习机制以及泛化能力不足。为了应对这些挑战,本研究通过整合现有文献和数据库中的数据构建了一个更大、更多样化的数据集,并提出了一种名为iACP-DPNet的新型深度学习预测模型。该模型利用带有位置编码的蛋白质语言模型ProtBert将蛋白质序列转换为特征向量,然后通过LightGBM和MIC应用两步特征选择过程。所选特征经过因果扩张卷积网络处理。设计了一种双池化机制,以增强模型协同建模局部关键残基和全局序列上下文的能力,集成了并行的全局平均池化层和注意力池化层。与传统的单池化模型(如ACP-MHCNN)相比,这种架构显著提高了特征提取能力。为了理解模型的决策过程,我们采用t-SNE进行关键步骤可视化、ISM进行序列区域解释以及SHAP分析进行特征重要性评估。这些方法显著提高了模型的可解释性。该模型在新数据集上表现出色,严格的十折交叉验证证明了这一点。在比较分析中,该模型取得了显著的指标,包括96.1%的Sp、92.91%的Sn、94.5%的Acc和89.05%的MCC,明显优于所有现有的先进方法。此外,为了评估其泛化能力,我们在另一个数据集上对iACP-DPNet进行了评估,它在该数据集上的表现优于其他当前模型。总之,iACP-DPNet表现出卓越的性能和泛化能力,展示了其在ACPs预测方面的先进设计和有效性。本研究为推进抗癌肽发现研究提供了一个强大且可解释的框架。
• 我们为ACPs预测建立了一个更大、更多样化的数据集,解决了现有数据集的局限性,为模型训练和评估提供了坚实的基础。
• 双池化层机制(全局平均池化和注意力池化)的实施增强了模型学习多样特征的能力,最终提高了其预测效率。
• 我们采用t-SNE可视化和基于ISM的可解释性分析来深入了解模型的决策过程,突出了对ACPs功能至关重要的关键区域和氨基酸。
• iACP-DPNet模型在不同数据集上表现出强大的泛化能力,使其成为ACPs预测以及潜在的其他肽相关任务的可靠工具。