Wu Peng, An Mo, Zou Hai-Ren, Zhong Cai-Ying, Wang Wei, Wu Chang-Peng
Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China.
PeerJ. 2020 Oct 16;8:e10091. doi: 10.7717/peerj.10091. eCollection 2020.
Single-cell RNA-sequencing (scRNA-seq) technology is a powerful tool to study organism from a single cell perspective and explore the heterogeneity between cells. Clustering is a fundamental step in scRNA-seq data analysis and it is the key to understand cell function and constitutes the basis of other advanced analysis. Nonnegative Matrix Factorization (NMF) has been widely used in clustering analysis of transcriptome data and achieved good performance. However, the existing NMF model is unsupervised and ignores known gene functions in the process of clustering. Knowledges of cell markers genes (genes that only express in specific cells) in human and model organisms have been accumulated a lot, such as the Molecular Signatures Database (MSigDB), which can be used as prior information in the clustering analysis of scRNA-seq data. Because the same kind of cells is likely to have similar biological functions and specific gene expression patterns, the marker genes of cells can be utilized as prior knowledge in the clustering analysis.
We propose a robust and semi-supervised NMF (rssNMF) model, which introduces a new variable to absorb noises of data and incorporates marker genes as prior information into a graph regularization term. We use rssNMF to solve the clustering problem of scRNA-seq data.
Twelve scRNA-seq datasets with true labels are used to test the model performance and the results illustrate that our model outperforms original NMF and other common methods such as KMeans and Hierarchical Clustering. Biological significance analysis shows that rssNMF can identify key subclasses and latent biological processes. To our knowledge, this study is the first method that incorporates prior knowledge into the clustering analysis of scRNA-seq data.
单细胞RNA测序(scRNA-seq)技术是从单细胞角度研究生物体并探索细胞间异质性的强大工具。聚类是scRNA-seq数据分析的基本步骤,是理解细胞功能的关键,也是其他高级分析的基础。非负矩阵分解(NMF)已广泛应用于转录组数据的聚类分析并取得了良好性能。然而,现有的NMF模型是无监督的,在聚类过程中忽略了已知的基因功能。在人类和模式生物中,细胞标记基因(仅在特定细胞中表达的基因)的知识已经积累了很多,例如分子特征数据库(MSigDB),其可作为scRNA-seq数据聚类分析中的先验信息。由于同类型细胞可能具有相似的生物学功能和特定的基因表达模式,细胞的标记基因可作为聚类分析中的先验知识。
我们提出了一种稳健的半监督NMF(rssNMF)模型,该模型引入了一个新变量来吸收数据噪声,并将标记基因作为先验信息纳入图正则化项。我们使用rssNMF来解决scRNA-seq数据的聚类问题。
使用十二个带有真实标签的scRNA-seq数据集来测试模型性能,结果表明我们的模型优于原始NMF以及其他常用方法,如KMeans和层次聚类。生物学意义分析表明,rssNMF可以识别关键亚类和潜在的生物学过程。据我们所知,本研究是第一种将先验知识纳入scRNA-seq数据聚类分析的方法。