Suppr超能文献

用于预测蛋白质p值和电离状态的KaMLs:你只需要决策树吗?

KaMLs for Predicting Protein p Values and Ionization States: Are Trees All You Need?

作者信息

Shen Mingzhe, Kortzak Daniel, Ambrozak Simon, Bhatnagar Shubham, Buchanan Ian, Liu Ruibin, Shen Jana

机构信息

Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, U.S.A.

Department of Computer Science, University of Maryland College Park, College Park, MD 20742, U.S.A.

出版信息

bioRxiv. 2025 Jan 30:2024.11.09.622800. doi: 10.1101/2024.11.09.622800.

Abstract

Despite its importance in understanding biology and computer-aided drug discovery, the accurate prediction of protein ionization states remains a formidable challenge. Physics-based approaches struggle to capture the small, competing contributions in the complex protein environment, while machine learning (ML) is hampered by scarcity of experimental data. Here we report the development of p ML (KaML) models based on decision trees and graph attention networks (GAT), exploiting physicochemical understanding and a new experiment p database (PKAD-3) enriched with highly shifted p 's. KaML-CBtree significantly outperforms the current state of the art in predicting p values and ionization states across all six titratable amino acids, notably achieving accurate predictions for deprotonated cysteines and lysines - a blind spot in previous models. The superior performance of KaMLs is achieved in part through several innovations, including separate treatment of acid and base, data augmentation using AlphaFold structures, and model pretraining on a theoretical p database. We also introduce the classification of protonation states as a metric for evaluating p prediction models. A meta-feature analysis suggests a possible reason for the lightweight tree model to outperform the more complex deep learning GAT. We release an end-to-end p predictor based on KaML-CBtree and the new PKAD-3 database, which facilitates a variety of applications and provides the foundation for further advances in protein electrostatics research.

摘要

尽管准确预测蛋白质电离状态在理解生物学和计算机辅助药物发现方面具有重要意义,但它仍然是一项艰巨的挑战。基于物理的方法难以捕捉复杂蛋白质环境中的微小竞争贡献,而机器学习(ML)则受到实验数据稀缺的阻碍。在此,我们报告了基于决策树和图注意力网络(GAT)开发的p ML(KaML)模型,利用物理化学理解和一个富含高度偏移p值的新实验数据库(PKAD - 3)。在预测所有六种可滴定氨基酸的p值和电离状态方面,KaML - CBtree显著优于当前的先进技术,尤其在对去质子化的半胱氨酸和赖氨酸的预测上取得了准确结果——这是先前模型的一个盲点。KaML模型的卓越性能部分得益于多项创新,包括对酸和碱的分别处理、使用AlphaFold结构进行数据增强以及在理论p数据库上进行模型预训练。我们还引入了质子化状态分类作为评估p预测模型的指标。元特征分析揭示了轻量级树模型优于更复杂的深度学习GAT的一个可能原因。我们发布了基于KaML - CBtree和新的PKAD - 3数据库的端到端p预测器,它有助于各种应用,并为蛋白质静电学研究的进一步发展奠定了基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfda/11781493/2a61cb53c51c/nihpp-2024.11.09.622800v3-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验