Suppr超能文献

集成桥算法:一种新的药物发现问题建模工具。

The ensemble bridge algorithm: a new modeling tool for drug discovery problems.

机构信息

Department of Statistics, West Virginia University, Morgantown, West Virginia 26506, USA.

出版信息

J Chem Inf Model. 2010 Feb 22;50(2):309-16. doi: 10.1021/ci9003392.

Abstract

Ensemble algorithms have been historically categorized into two separate paradigms, boosting and random forests, which differ significantly in the way each ensemble is constructed. Boosting algorithms represent one extreme, where an iterative greedy optimization strategy, weak learners (e.g., small classification trees), and stage weights are employed to target difficult-to-classify regions in the training space. On the other extreme, random forests rely on randomly selected features and complex learners (learners that exhibit low bias, e.g., large regression trees) to classify well over the entire training data. Because the approach is not targeting the next learner for inclusion, it tends to provide a natural robustness to noisy labels. In this work, we introduce the ensemble bridge algorithm, which is capable of transitioning between boosting and random forests using a regularization parameter nu in [0,1]. Because the ensemble bridge algorithm is a compromise between the greedy nature of boosting and the randomness present in random forests, it yields robust performance in the presence of a noisy response and superior performance in the presence of a clean response. Often, drug discovery data (e.g., computational chemistry data) have varying levels of noise. Hence, this method enables a practitioner to employ a single method to evaluate ensemble performance. The method's robustness is verified across a variety of data sets where the algorithm repeatedly yields better performance than either boosting or random forests alone. Finally, we provide diagnostic tools for the new algorithm, including a measure of variable importance and an observational clustering tool.

摘要

集成算法历史上被分为两个独立的范式,即 boosting 和随机森林,它们在集成的构建方式上有很大的不同。Boosting 算法代表了一个极端,其中采用迭代贪婪优化策略、弱学习者(例如,小的分类树)和阶段权重来针对训练空间中难以分类的区域。另一方面,随机森林依赖于随机选择的特征和复杂的学习者(表现出低偏差的学习者,例如,大的回归树)来对整个训练数据进行很好的分类。由于该方法不是针对下一个学习者进行包含,因此它往往对噪声标签具有自然的鲁棒性。在这项工作中,我们引入了集成桥接算法,它可以使用 [0,1] 中的正则化参数 nu 在 boosting 和随机森林之间进行转换。由于集成桥接算法是 boosting 的贪婪性质和随机森林中存在的随机性之间的折衷,因此它在存在噪声响应时具有稳健的性能,在存在干净响应时具有优越的性能。通常,药物发现数据(例如,计算化学数据)具有不同程度的噪声。因此,这种方法使从业者能够使用单一方法来评估集成性能。该方法在各种数据集上得到了验证,该算法在多个数据集上重复产生比单独使用 boosting 或随机森林更好的性能。最后,我们为新算法提供了诊断工具,包括变量重要性度量和观察聚类工具。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验