NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, 1070-312, Lisbon, Portugal.
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.
Sci Rep. 2024 Oct 23;14(1):25016. doi: 10.1038/s41598-024-76440-8.
Life sciences research and experimentation are resource-intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is transforming this traditional approach, providing methods to expedite processes and accelerate discoveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant potential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of chemical language representations, specifically SMILES and SELFIES, using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atom Pair Encoding (APE), in BERT-based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiology classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical elements, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood-brain barrier penetration, with ROC-AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and suggests that refining these techniques could lead to significant advancements in drug discovery and material science.
生命科学研究和实验是资源密集型的,需要进行广泛的试验和大量的时间。通常,实验并不能达到预期的目标,但通过反复试验,取得了进展,最终取得了突破。机器学习正在改变这种传统的方法,提供了加速进程和加速发现的方法。深度学习在化学中越来越突出,卷积图网络(CGN)是一个关键焦点,尽管其他方法也显示出了很大的潜力。这项研究探索了自然语言处理(NLP)在评估化学语言表示(特别是 SMILES 和 SELFIES)有效性方面的应用,使用了字节对编码(BPE)和本研究中开发的一种新方法——原子对编码(APE)等标记化方法,在基于 BERT 的模型中。主要目标是评估这些标记化技术如何影响生物物理和生理学分类任务中化学语言模型的性能。研究结果表明,APE,特别是在使用 SMILES 表示形式时,通过保留化学元素之间的完整性和上下文关系,显著优于 BPE,从而提高了分类准确性。使用三个不同的数据集(HIV、毒理学和血脑屏障穿透)进行下游分类任务的性能评估,ROC-AUC 作为评估指标。这项研究强调了标记化在处理化学语言中的关键作用,并表明细化这些技术可能会在药物发现和材料科学领域取得重大进展。