Suppr超能文献

随机森林变量重要性度量内在稳定性的实验研究

An experimental study of the intrinsic stability of random forest variable importance measures.

作者信息

Wang Huazhen, Yang Fan, Luo Zhiyuan

机构信息

College of Computer Science and Technology, Huaqiao University, Jimei Avenue, Xiamen, 361021, China.

Computer Learning Research Centre, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK.

出版信息

BMC Bioinformatics. 2016 Feb 3;17:60. doi: 10.1186/s12859-016-0900-5.

Abstract

BACKGROUND

The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability.

RESULTS

The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability.

CONCLUSION

First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.

摘要

背景

基于随机森林的变量重要性度量(VIMs)的稳定性近来受到了更多关注。尽管对数据扰动或参数变化的传统稳定性给予了广泛关注,但很少有研究考虑生成VIMs过程中内在随机性(即装袋法、随机化和排列)所带来的影响。为解决这些影响,本文引入了VIMs内在稳定性的新概念,其定义为在无数据扰动和参数变化的情况下,VIMs重复运行中特征排名之间的自一致性。对两种广泛使用的VIMs,即平均精度下降(MDA)和平均基尼系数下降(MDG)进行了全面研究。本研究的动机有两个方面。首先,我们通过实证验证了VIMs内在稳定性在许多真实世界数据集上的普遍性,以突出VIMs的不稳定性并非仅源于数据扰动或参数变化,还源于VIMs的内在随机性。其次,通过斯皮尔曼和皮尔逊检验,我们全面研究了不同因素如何影响内在稳定性。

结果

在19个具有不同特征的基准数据集上进行了实验,包括10个高维小样本基因表达数据集。实验结果证明了VIMs内在稳定性的普遍性。对内在稳定性与不同因素之间相关性的斯皮尔曼和皮尔逊检验表明,特征数量(#feature)和样本大小(#sample)对内在稳定性具有耦合效应。综合指标#feature/#sample与内在稳定性呈现负单调相关和负线性相关,而袋外精度(OOB accuracy)与内在稳定性具有单调相关性。这表明高维、小样本和高复杂度的数据集可能更容易受到VIMs内在不稳定性的影响。此外,关于随机森林的参数设置,较多的树数量更优。内在稳定性与其他因素之间未见显著相关性。最后,内在稳定性的幅度始终小于传统稳定性的幅度。

结论

首先,VIMs内在稳定性的普遍性表明,VIMs的不稳定性不仅源于数据扰动或参数变化,还源于VIMs的内在随机性。这一发现有助于更好地理解VIMs的稳定性,并可能有助于降低VIMs的不稳定性。其次,通过研究内在稳定性的潜在因素,用户在使用VIMs时会更加意识到风险,从而更加谨慎,尤其是在处理高维、小样本和高复杂度数据集时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8524/4739337/e845919c7603/12859_2016_900_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验