Zhao Jian-Ping, Hou Tong-Shuai, Su Yansen, Zheng Chun-Hou
College of Mathematics and System Sciences, Xinjiang University, Urumqi, China; Institute of Mathematics and Physics, Xinjiang University, Urumqi, China.
College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
Methods. 2022 Dec;208:66-74. doi: 10.1016/j.ymeth.2022.10.006. Epub 2022 Oct 28.
Single cell sequencing is a technology for high-throughput sequencing analysis of genome, transcriptome and epigenome at the single cell level. It can improve the shortcomings of traditional methods, reveal the gene structure and gene expression state of a single cell, and reflect the heterogeneity between cells. Among them, the clustering analysis of single-cell RNA data is a very important step, but the clustering of single-cell RNA data is faced with two difficulties, dropout events and dimension curse. At present, many methods are only driven by data, and do not make full use of the existing biological information.
In this work, we propose scSSA, a clustering model based on semi-supervised autoencoder, fast independent component analysis (FastICA) and Gaussian mixture clustering. Firstly, the semi-supervised autoencoder imputes and denoises the scRNA-seq data, and then get the low-dimensional latent representation. Secondly, the low-dimensional representation is reduced the dimension and clustered by FastICA and Gaussian mixture model respectively. Finally, scSSA is compared with Seurat, CIDR and other methods on 10 public scRNA-seq datasets.
The results show that scSSA has superior performance in cell clustering on 10 public datasets. In conclusion, scSSA can accurately identify the cell types and is generally applicable to all kinds of single cell datasets. scSSA has great application potential in the field of scRNA-seq data analysis. Details in the code have been uploaded to the website https://github.com/houtongshuai123/scSSA/.
单细胞测序是一种在单细胞水平上对基因组、转录组和表观基因组进行高通量测序分析的技术。它可以改善传统方法的缺点,揭示单个细胞的基因结构和基因表达状态,并反映细胞间的异质性。其中,单细胞RNA数据的聚类分析是非常重要的一步,但单细胞RNA数据的聚类面临两个困难,即数据丢失事件和维度诅咒。目前,许多方法仅由数据驱动,没有充分利用现有的生物学信息。
在这项工作中,我们提出了scSSA,一种基于半监督自动编码器、快速独立成分分析(FastICA)和高斯混合聚类的聚类模型。首先,半监督自动编码器对scRNA-seq数据进行插补和去噪,然后得到低维潜在表示。其次,分别通过FastICA和高斯混合模型对低维表示进行降维和聚类。最后,在10个公开的scRNA-seq数据集上,将scSSA与Seurat、CIDR等方法进行比较。
结果表明,scSSA在10个公开数据集的细胞聚类中具有优越的性能。总之,scSSA可以准确识别细胞类型,并且普遍适用于各类单细胞数据集。scSSA在scRNA-seq数据分析领域具有很大的应用潜力。代码细节已上传至网站https://github.com/houtongshuai123/scSSA/ 。