Suppr超能文献

金发姑娘范式:比较经典机器学习、大语言模型和少样本学习在药物发现应用中的表现

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications.

作者信息

Snyder Scott H, Vignaux Patricia A, Ozalp Mustafa Kemal, Gerlach Jacob, Puhl Ana C, Lane Thomas R, Corbett John, Urbina Fabio, Ekins Sean

机构信息

Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.

出版信息

Commun Chem. 2024 Jun 12;7(1):134. doi: 10.1038/s42004-024-01220-4.

Abstract

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the 'no-free lunch' theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a 'goldilocks zone' for each model type, in which dataset size and feature distribution (i.e. dataset "diversity") determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.

摘要

机器学习(ML)的最新进展催生了更新的模型架构,包括在文本生成和图像分析方面展现出前沿成果的Transformer(大语言模型,LLMs)以及在极小数据集上也具备预测能力的少样本学习(FSLC)模型。这些新架构或许带来了希望,但“没有免费的午餐”定理表明,没有单一的模型算法能在所有可能的任务中都表现出色。在此,我们在一系列数据集任务中探究了经典模型(支持向量回归,SVR)、FSLC模型和Transformer模型(MolBART)的能力,并展示了每种模型类型的“金发姑娘区”,即数据集大小和特征分布(即数据集“多样性”)决定了最优算法策略。当数据集较小时(<50个分子),FSLC模型往往优于经典机器学习模型和Transformer模型。当数据集为中小规模(50 - 240个分子)且具有多样性时,Transformer模型优于经典模型和少样本学习模型。最后,当数据集规模较大且足够大时,经典模型表现最佳,这表明要选择的最优模型可能取决于可用的数据集、其大小和多样性。这些发现可能有助于回答长期以来的问题:面对新数据集时应使用哪种机器学习算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c86/11169557/7152d0289e40/42004_2024_1220_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验