Liu Ruibin, Clayton Joseph, Shen Mingzhe, Bhatnagar Shubham, Shen Jana
Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, United States.
Division of Applied Regulatory Science, Office of Clinical Pharmacology, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland 20993, United States.
JACS Au. 2024 Apr 5;4(4):1374-1384. doi: 10.1021/jacsau.3c00749. eCollection 2024 Apr 22.
Machine learning (ML) identification of covalently ligandable sites may accelerate targeted covalent inhibitor design and help expand the druggable proteome space. Here, we report the rigorous development and validation of the tree-based models and convolutional neural networks (CNNs) trained on a newly curated database (LigCys3D) of over 1000 liganded cysteines in nearly 800 proteins represented by over 10,000 three-dimensional structures in the protein data bank. The unseen tests yielded 94 and 93% area under the receiver operating characteristic curves for the tree models and CNNs, respectively. Based on the AlphaFold2 predicted structures, the ML models recapitulated the newly liganded cysteines in the PDB with over 90% recall values. To assist the community of covalent drug discoveries, we report the predicted ligandable cysteines in 392 human kinases and their locations in the sequence-aligned kinase structure, including the PH and SH2 domains. Furthermore, we disseminate a searchable online database LigCys3D (https://ligcys.computchem.org/) and a web prediction server DeepCys (https://deepcys.computchem.org/), both of which will be continuously updated and improved by including newly published experimental data. The present work represents an important step toward the ML-led integration of big genome data and structure models to annotate the human proteome space for the next-generation covalent drug discoveries.
通过机器学习(ML)识别可共价配体结合位点,可能会加速靶向共价抑制剂的设计,并有助于扩大可成药蛋白质组空间。在此,我们报告了基于树的模型和卷积神经网络(CNN)的严格开发与验证,这些模型是在一个新整理的数据库(LigCys3D)上训练的,该数据库包含近800种蛋白质中1000多个带配体的半胱氨酸,由蛋白质数据库中超过10,000个三维结构表示。在未见测试中,树模型和CNN的受试者工作特征曲线下面积分别为94%和93%。基于AlphaFold2预测的结构,ML模型对PDB中新结合配体的半胱氨酸的召回值超过90%。为了帮助共价药物发现领域,我们报告了392种人类激酶中预测的可配体半胱氨酸及其在序列比对激酶结构中的位置,包括PH和SH2结构域。此外,我们发布了一个可搜索的在线数据库LigCys3D(https://ligcys.computchem.org/)和一个网络预测服务器DeepCys(https://deepcys.computchem.org/),这两者都将通过纳入新发表的实验数据不断更新和改进。目前的工作代表了迈向由ML主导的大基因组数据与结构模型整合的重要一步,以注释人类蛋白质组空间,用于下一代共价药物发现。