Suppr超能文献

基于图卷积神经网络在高度精选数据集上对水溶性进行预测。

Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset.

作者信息

Ulrich Nadin, Voigt Karsten, Kudria Anton, Böhme Alexander, Ebert Ralf-Uwe

机构信息

Department of Exposure Science, Helmholtz Centre for Environmental Research-UFZ, Permoserstrasse 15, 04318, Leipzig, Germany.

PAULY, Theresienstrasse 50, 04129, Leipzig, Germany.

出版信息

J Cheminform. 2025 Apr 21;17(1):55. doi: 10.1186/s13321-025-01000-9.

Abstract

Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol-water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log S based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log S values range from - 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an r of 0.901, a q of 0.896, and an rmse of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.Scientific contribution Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.

摘要

在环境化学、毒理学和药物设计中,水溶性是一种相关的物理化学性质。尽管除了辛醇 - 水分配系数、熔点和沸点外,水溶性也是一种有大量可用实验数据的性质,但在化学领域中仍有更多化合物缺乏其水溶性信息。因此,需要具有广泛应用领域的预测工具来填补相应的数据空白。为此,我们基于9800种化学物质的精心策划数据集,开发了一种图卷积神经网络模型(GNN),以log S的形式预测水溶性。我们从AqSolDB数据的整理工作流程开始模型开发,最终得到7605个数据点。我们将在文献中找到的2195种有实验数据的化学物质添加到数据集中。在最终数据集中,log S值范围为 - 13.17至0.50。通过引入一个截止值排除了较高的值,以消除完全互溶的化学物质。我们通过对相应训练集(数据的70%)和验证集(20%)进行五折划分来开发一个共识GNN,并使用10%作为独立测试集来评估不同划分和共识模型的性能。通过这样做,我们在独立选择的测试集上实现了r为0.901、q为0.896和rmse为0.657,这接近0.5至0.6对数单位的实验误差。我们进一步提供了关于应用领域的信息,并将我们的性能与其他现有预测工具进行了比较。科学贡献基于精心策划的数据集,我们开发了一个神经网络,用于预测广泛应用领域中化学物质的水溶性。我们通过逐步程序进行数据整理,在此过程中我们识别了实验数据中的各种错误。基于独立测试集,我们将我们的预测结果与可用预测模型的结果进行了比较。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/68ec/12012962/06b33baf9702/13321_2025_1000_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验