Kim Jaehee, Kim Haseong
Department of Statistics, Duksung Women's University, Seoul National University, Seoul, S. Korea.
Bioinformatics. 2008 Jan 15;24(2):184-91. doi: 10.1093/bioinformatics/btm568. Epub 2007 Nov 19.
To understand the behavior of genes, it is important to explore how the patterns of gene expression change over a time period because biologically related gene groups can share the same change patterns. Many clustering algorithms have been proposed to group observation data. However, because of the complexity of the underlying functions there have not been many studies on grouping data based on change patterns. In this study, the problem of finding similar change patterns is induced to clustering with the derivative Fourier coefficients. The sample Fourier coefficients not only provide information about the underlying functions, but also reduce the dimension. In addition, as their limiting distribution is a multivariate normal, a model-based clustering method incorporating statistical properties would be appropriate.
This work is aimed at discovering gene groups with similar change patterns that share similar biological properties. We developed a statistical model using derivative Fourier coefficients to identify similar change patterns of gene expression. We used a model-based method to cluster the Fourier series estimation of derivatives. The model-based method is advantageous over other methods in our proposed model because the sample Fourier coefficients asymptotically follow the multivariate normal distribution. Change patterns are automatically estimated with the Fourier representation in our model. Our model was tested in simulations and on real gene data sets. The simulation results showed that the model-based clustering method with the sample Fourier coefficients has a lower clustering error rate than K-means clustering. Even when the number of repeated time points was small, the same results were obtained. We also applied our model to cluster change patterns of yeast cell cycle microarray expression data with alpha-factor synchronization. It showed that, as the method clusters with the probability-neighboring data, the model-based clustering with our proposed model yielded biologically interpretable results. We expect that our proposed Fourier analysis with suitably chosen smoothing parameters could serve as a useful tool in classifying genes and interpreting possible biological change patterns.
The R program is available upon the request.
为了理解基因的行为,探索基因表达模式在一段时间内如何变化很重要,因为生物学相关的基因组可能共享相同的变化模式。已经提出了许多聚类算法来对观测数据进行分组。然而,由于潜在函数的复杂性,基于变化模式对数据进行分组的研究并不多。在本研究中,寻找相似变化模式的问题被归纳为使用导数傅里叶系数进行聚类。样本傅里叶系数不仅提供了关于潜在函数的信息,还降低了维度。此外,由于它们的极限分布是多元正态分布,结合统计特性的基于模型的聚类方法将是合适的。
这项工作旨在发现具有相似变化模式且共享相似生物学特性的基因组。我们开发了一种使用导数傅里叶系数的统计模型来识别基因表达的相似变化模式。我们使用基于模型的方法对导数的傅里叶级数估计进行聚类。在我们提出的模型中,基于模型的方法比其他方法更具优势,因为样本傅里叶系数渐近地服从多元正态分布。在我们的模型中,变化模式通过傅里叶表示自动估计。我们的模型在模拟和真实基因数据集上进行了测试。模拟结果表明,使用样本傅里叶系数的基于模型的聚类方法比K均值聚类具有更低的聚类错误率。即使重复时间点的数量很少,也能得到相同的结果。我们还将我们的模型应用于对酵母细胞周期微阵列表达数据与α因子同步的变化模式进行聚类。结果表明,由于该方法是对概率相邻的数据进行聚类,我们提出的模型的基于模型的聚类产生了具有生物学可解释性的结果。我们期望我们提出的具有适当选择的平滑参数的傅里叶分析可以作为一种有用的工具,用于对基因进行分类和解释可能的生物学变化模式。
可根据要求提供R程序。