Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia.
Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research , National Cancer Institute , Frederick , Maryland 21702 , United States.
J Chem Inf Model. 2019 Sep 23;59(9):3635-3644. doi: 10.1021/acs.jcim.9b00164. Epub 2019 Sep 10.
A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure-activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.
在整个药物发现过程中,都需要大量高质量的化合物生物活性数据:从开发结构-活性关系的计算模型到对先导化合物的实验测试及其在临床中的验证。目前,大量此类数据可从数据库、科学出版物和专利中获得。生物数据的特点是不完整性、不确定性和低重现性。尽管存在免费和商业可用的化合物生物活性数据库,但它们通常缺乏有关生物测定特点的明确信息。另一方面,科学论文是首次向科学界披露新数据的主要来源。在这项研究中,我们开发并验证了一种从包含生物测定描述的文本片段中提取信息的挖掘方法。我们使用此方法评估了科学出版物中报道的化合物及其生物活性。我们发现,可以根据对摘要的机器学习分析,将论文分为相关和不相关两类。从出版物全文中提取的文本片段可根据生物测定的特点进一步分为几个类别。我们证明了我们的方法可用于比较参考化合物的生物活性和细胞毒性的终点值。