Suppr超能文献

使用朴素贝叶斯分类器方法在科学出版物文本中进行化学命名实体识别。

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach.

作者信息

Tarasova O A, Rudik A V, Biziukova N Yu, Filimonov D A, Poroikov V V

机构信息

Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia.

出版信息

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

Abstract

MOTIVATION

Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical-chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced.

METHODS AND RESULTS

We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method.

CONCLUSION

The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry.

摘要

动机

应用化学命名实体识别(CNER)算法可从文本中检索有关化合物标识符的信息,并建立与物理化学性质和生物活性的关联。科学文本是低形式化的信息来源。大多数针对CNER的方法都基于机器学习方法,包括条件随机场和深度神经网络。一般来说,大多数机器学习方法都需要文本的向量或稀疏词表示。化学命名实体(CNE)仅占整个文本的一小部分,并且用于训练的数据集高度不平衡。

方法与结果

我们提出了一种基于朴素贝叶斯分类器并结合专门开发的过滤器从文本中提取CNE的新方法。与早期开发的CNER方法相比,我们的方法将数据表示为一组文本片段(FoT),随后为每个FoT准备一组多n元语法(从一个到n个符号的序列)。我们的方法可能会识别出新的CNE。对于CHEMDNER语料库,基于五折交叉验证,灵敏度(召回率)值为0.95,精确率为0.74,特异性为0.88,平衡准确率为0.92。我们将开发的算法应用于提取的潜在严重急性呼吸综合征冠状病毒2(SARS-CoV-2)主要蛋白酶(Mpro)抑制剂的CNE。检索到了一组与用于发现Mpro抑制剂的生化分析中评估的化学物质相对应的CNE。对相关文本的人工分析表明,我们的方法成功识别了潜在的SARS-CoV-2 Mpro抑制剂的CNE。

结论

所得结果表明,所提出的方法可用于过滤掉与CNE无关的词;因此,它可成功应用于化学信息学和药物化学目的的CNE提取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e27/9375323/0ae0bcb12f7e/13321_2022_633_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验