CREST (Ensai, Université Bretagne Loire), Bruz, France.
Brief Bioinform. 2020 Jul 15;21(4):1209-1223. doi: 10.1093/bib/bbz063.
Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics.
All the source code and data are available at https://github.com/kuanglab/single-cell-review.
单细胞 RNA 测序 (scRNA-seq) 技术使对细胞群体中的每个单个细胞的全转录组进行大规模分析成为可能。scRNA-seq 转录组谱的核心分析是对单细胞进行聚类,以根据细胞之间的关系揭示细胞亚型并推断细胞谱系。本文综述了过去几年中开发的用于聚类 scRNA-seq 转录组的机器学习和统计方法。综述重点介绍了如何修改或定制传统聚类技术,如层次聚类、基于图的聚类、混合模型、k-means、集成学习、神经网络和基于密度的聚类,以解决 scRNA-seq 数据分析中的独特挑战,例如低表达基因的缺失、转录物的低和不均匀读取覆盖、单个细胞中高度可变的总 mRNAs 以及存在技术偏差和无关混杂生物学变异时细胞标记物不明确。我们回顾了如何应用细胞特异性归一化、缺失值插补和降维方法,并结合新的统计或优化策略来改善单细胞聚类。我们还将介绍那些更先进的方法,用于聚类时间序列数据和多个细胞群体中的 scRNA-seq 转录组,并检测稀有细胞类型。还对开发用于支持 scRNA-seq 数据聚类分析的几个软件包进行了综述,并进行了实验比较以评估它们的性能和效率。最后,我们对 scRNA-seq 数据分析的有用观察和可能的未来方向进行了总结。
所有的源代码和数据都可以在 https://github.com/kuanglab/single-cell-review 上找到。