Suppr超能文献

利用异构集成预测蛋白质功能和其他生物医学特征。

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.

作者信息

Whalen Sean, Pandey Om Prakash, Pandey Gaurav

机构信息

Gladstone Institutes, University of California, San Francisco, CA, USA.

Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

出版信息

Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.

Abstract

Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.

摘要

生物医学科学中的预测问题,包括蛋白质功能预测(PFP),通常相当困难。部分原因在于对感兴趣的细胞现象的了解不完整、用于预测的变量和测量的适用性及数据质量,以及对于特定问题的理想预测器缺乏共识。在这种情况下,提高预测性能的一种有效方法是构建异构集成预测器,它将捕捉问题和/或数据集互补方面的不同个体预测器的输出结合起来。在本文中,我们展示了源自堆叠和集成选择方法的此类异构集成在解决PFP和其他类似生物医学预测问题方面的潜力。对这些结果的深入分析表明,这些方法,尤其是堆叠方法的卓越预测能力,可归因于它们对集成学习过程以下方面的关注:(i)更好地平衡多样性和性能,(ii)更有效地校准输出,以及(iii)更稳健地纳入额外的基础预测器。最后,为了使异构集成有效地应用于大型复杂数据集(大数据)成为可能,我们提出了DataSink,一个分布式集成学习框架,并使用所研究的数据集展示了它良好的可扩展性。DataSink可从https://github.com/shwhalen/datasink公开获取。

相似文献

1
Predicting protein function and other biomedical characteristics with heterogeneous ensembles.
Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.
2
LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS.
Pac Symp Biocomput. 2017;22:288-299. doi: 10.1142/9789813207813_0028.
3
Large-scale protein function prediction using heterogeneous ensembles.
F1000Res. 2018 Sep 28;7. doi: 10.12688/f1000research.16415.1. eCollection 2018.
4
Network inference with ensembles of bi-clustering trees.
BMC Bioinformatics. 2019 Oct 28;20(1):525. doi: 10.1186/s12859-019-3104-y.
5
Forecasting Corn Yield With Machine Learning Ensembles.
Front Plant Sci. 2020 Jul 31;11:1120. doi: 10.3389/fpls.2020.01120. eCollection 2020.
6
Constructing query-driven dynamic machine learning model with application to protein-ligand binding sites prediction.
IEEE Trans Nanobioscience. 2015 Jan;14(1):45-58. doi: 10.1109/TNB.2015.2394328.
7
Drug-target interaction prediction with tree-ensemble learning and output space reconstruction.
BMC Bioinformatics. 2020 Feb 7;21(1):49. doi: 10.1186/s12859-020-3379-z.
8
Ensemble blood glucose prediction in diabetes mellitus: A review.
Comput Biol Med. 2022 Aug;147:105674. doi: 10.1016/j.compbiomed.2022.105674. Epub 2022 Jun 10.
9
Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.
PLoS One. 2020 Aug 28;15(8):e0228520. doi: 10.1371/journal.pone.0228520. eCollection 2020.
10
Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems.
Mol Inform. 2015 Sep;34(9):634-47. doi: 10.1002/minf.201400122. Epub 2015 Mar 25.

引用本文的文献

1
Prediction of future dementia among patients with mild cognitive impairment (MCI) by integrating multimodal clinical data.
Heliyon. 2024 Aug 22;10(17):e36728. doi: 10.1016/j.heliyon.2024.e36728. eCollection 2024 Sep 15.
2
Improving transparency of computational tools for variant effect prediction.
Nat Genet. 2024 Jul;56(7):1324-1326. doi: 10.1038/s41588-024-01821-8.
4
Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?
Bone Jt Open. 2024 Feb 15;5(2):139-146. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1.
5
Developing better digital health measures of Parkinson's disease using free living data and a crowdsourced data analysis challenge.
PLOS Digit Health. 2023 Mar 28;2(3):e0000208. doi: 10.1371/journal.pdig.0000208. eCollection 2023 Mar.
6
Integrating multimodal data through interpretable heterogeneous ensembles.
Bioinform Adv. 2022 Sep 12;2(1):vbac065. doi: 10.1093/bioadv/vbac065. eCollection 2022.
7
Integrating multimodal data through interpretable heterogeneous ensembles.
bioRxiv. 2022 Jul 25:2020.05.29.123497. doi: 10.1101/2020.05.29.123497.
9
Gene function finding through cross-organism ensemble learning.
BioData Min. 2021 Feb 12;14(1):14. doi: 10.1186/s13040-021-00239-w.

本文引用的文献

1
Hierarchical ensemble methods for protein function prediction.
ISRN Bioinform. 2014 May 4;2014:901419. doi: 10.1155/2014/901419. eCollection 2014.
2
Toward better benchmarking: challenge-based methods assessment in cancer genomics.
Genome Biol. 2014 Sep 17;15(9):462. doi: 10.1186/s13059-014-0462-7.
3
Genetic interaction networks: better understand to better predict.
Front Genet. 2013 Dec 17;4:290. doi: 10.3389/fgene.2013.00290.
4
Protein function prediction using multilabel ensemble classification.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):1045-57. doi: 10.1109/TCBB.2013.111.
5
Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma.
Int J Cancer. 2013 Nov;133(9):2123-32. doi: 10.1002/ijc.28235. Epub 2013 Jun 4.
6
A large-scale evaluation of computational protein function prediction.
Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.
7
Minimalist ensemble algorithms for genome-wide protein localization prediction.
BMC Bioinformatics. 2012 Jul 3;13:157. doi: 10.1186/1471-2105-13-157.
8
Multiple genetic interaction experiments provide complementary information useful for gene function prediction.
PLoS Comput Biol. 2012;8(6):e1002559. doi: 10.1371/journal.pcbi.1002559. Epub 2012 Jun 21.
9
Ensemble sparse classification of Alzheimer's disease.
Neuroimage. 2012 Apr 2;60(2):1106-16. doi: 10.1016/j.neuroimage.2012.01.055. Epub 2012 Jan 14.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验