Whalen Sean, Pandey Om Prakash, Pandey Gaurav
Gladstone Institutes, University of California, San Francisco, CA, USA.
Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.
Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.
生物医学科学中的预测问题,包括蛋白质功能预测(PFP),通常相当困难。部分原因在于对感兴趣的细胞现象的了解不完整、用于预测的变量和测量的适用性及数据质量,以及对于特定问题的理想预测器缺乏共识。在这种情况下,提高预测性能的一种有效方法是构建异构集成预测器,它将捕捉问题和/或数据集互补方面的不同个体预测器的输出结合起来。在本文中,我们展示了源自堆叠和集成选择方法的此类异构集成在解决PFP和其他类似生物医学预测问题方面的潜力。对这些结果的深入分析表明,这些方法,尤其是堆叠方法的卓越预测能力,可归因于它们对集成学习过程以下方面的关注:(i)更好地平衡多样性和性能,(ii)更有效地校准输出,以及(iii)更稳健地纳入额外的基础预测器。最后,为了使异构集成有效地应用于大型复杂数据集(大数据)成为可能,我们提出了DataSink,一个分布式集成学习框架,并使用所研究的数据集展示了它良好的可扩展性。DataSink可从https://github.com/shwhalen/datasink公开获取。