Suppr超能文献

使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.

作者信息

Zhao Xin, Cheung Leo Wang-Kit

机构信息

Department of Information and Computer Sciences, University of Hawaii, 1680 East-West Road, Honolulu, Hawaii 96822, USA.

出版信息

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

Abstract

BACKGROUND

Designing appropriate machine learning methods for identifying genes that have a significant discriminating power for disease outcomes has become more and more important for our understanding of diseases at genomic level. Although many machine learning methods have been developed and applied to the area of microarray gene expression data analysis, the majority of them are based on linear models, which however are not necessarily appropriate for the underlying connection between the target disease and its associated explanatory genes. Linear model based methods usually also bring in false positive significant features more easily. Furthermore, linear model based algorithms often involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. This leads to problems of numerical instability. To overcome these limitations, a few non-linear methods have recently been introduced to the area. Many of the existing non-linear methods have a couple of critical problems, the model selection problem and the model parameter tuning problem, that remain unsolved or even untouched. In general, a unified framework that allows model parameters of both linear and non-linear models to be easily tuned is always preferred in real-world applications. Kernel-induced learning methods form a class of approaches that show promising potentials to achieve this goal.

RESULTS

A hierarchical statistical model named kernel-imbedded Gaussian process (KIGP) is developed under a unified Bayesian framework for binary disease classification problems using microarray gene expression data. In particular, based on a probit regression setting, an adaptive algorithm with a cascading structure is designed to find the appropriate kernel, to discover the potentially significant genes, and to make the optimal class prediction accordingly. A Gibbs sampler is built as the core of the algorithm to make Bayesian inferences. Simulation studies showed that, even without any knowledge of the underlying generative model, the KIGP performed very close to the theoretical Bayesian bound not only in the case with a linear Bayesian classifier but also in the case with a very non-linear Bayesian classifier. This sheds light on its broader usability to microarray data analysis problems, especially to those that linear methods work awkwardly. The KIGP was also applied to four published microarray datasets, and the results showed that the KIGP performed better than or at least as well as any of the referred state-of-the-art methods did in all of these cases.

CONCLUSION

Mathematically built on the kernel-induced feature space concept under a Bayesian framework, the KIGP method presented in this paper provides a unified machine learning approach to explore both the linear and the possibly non-linear underlying relationship between the target features of a given binary disease classification problem and the related explanatory gene expression data. More importantly, it incorporates the model parameter tuning into the framework. The model selection problem is addressed in the form of selecting a proper kernel type. The KIGP method also gives Bayesian probabilistic predictions for disease classification. These properties and features are beneficial to most real-world applications. The algorithm is naturally robust in numerical computation. The simulation studies and the published data studies demonstrated that the proposed KIGP performs satisfactorily and consistently.

摘要

背景

设计合适的机器学习方法来识别对疾病结局具有显著判别力的基因,对于我们在基因组水平上理解疾病变得越来越重要。尽管已经开发了许多机器学习方法并应用于微阵列基因表达数据分析领域,但其中大多数基于线性模型,然而这些模型不一定适用于目标疾病与其相关解释性基因之间的潜在联系。基于线性模型的方法通常也更容易引入假阳性显著特征。此外,基于线性模型的算法通常涉及计算一个矩阵的逆,当潜在重要基因的数量相对较大时,该矩阵可能是奇异的。这会导致数值不稳定问题。为了克服这些限制,最近一些非线性方法被引入该领域。许多现有的非线性方法存在几个关键问题,即模型选择问题和模型参数调整问题,这些问题仍然未得到解决甚至未被触及。一般来说,在实际应用中,总是更倾向于一个允许轻松调整线性和非线性模型参数的统一框架。核诱导学习方法构成了一类显示出有望实现这一目标的方法。

结果

在一个统一的贝叶斯框架下,开发了一种名为核嵌入高斯过程(KIGP)的分层统计模型,用于使用微阵列基因表达数据进行二元疾病分类问题。具体而言,基于概率单位回归设置,设计了一种具有级联结构的自适应算法,以找到合适的核,发现潜在的显著基因,并据此做出最优的类别预测。构建了一个吉布斯采样器作为算法的核心来进行贝叶斯推断。模拟研究表明,即使对潜在的生成模型一无所知,KIGP不仅在使用线性贝叶斯分类器的情况下,而且在使用非常非线性的贝叶斯分类器的情况下,其表现都非常接近理论贝叶斯边界。这揭示了其在微阵列数据分析问题上更广泛的适用性,并特别适用于那些线性方法效果不佳的问题。KIGP还应用于四个已发表的微阵列数据集,结果表明在所有这些情况下,KIGP的表现优于或至少与任何所引用的先进方法一样好。

结论

本文提出的KIGP方法在数学上基于贝叶斯框架下的核诱导特征空间概念,提供了一种统一的机器学习方法,用于探索给定二元疾病分类问题的目标特征与相关解释性基因表达数据之间的线性和可能的非线性潜在关系。更重要的是,它将模型参数调整纳入框架。以选择合适的核类型的形式解决了模型选择问题。KIGP方法还为疾病分类提供贝叶斯概率预测。这些特性和特征对大多数实际应用有益。该算法在数值计算中自然稳健。模拟研究和已发表的数据研究表明,所提出的KIGP表现令人满意且一致。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b8c/1821044/120cb8d9e9fd/1471-2105-8-67-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验