Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA, USA.
Division of Pulmonary Medicine, Allergy and Immunology and Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, USA.
Bioinformatics. 2018 Jan 1;34(1):139-146. doi: 10.1093/bioinformatics/btx490.
Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored.
We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.
DIMM-SC has been implemented in a user-friendly R package with a detailed tutorial available on www.pitt.edu/∼wec47/singlecell.html.
wei.chen@chp.edu or hum@ccf.org.
Supplementary data are available at Bioinformatics online.
单细胞转录组测序(scRNA-Seq)已成为研究单细胞水平细胞和分子过程的革命性工具。在现有技术中,最近开发的基于液滴的平台通过使用独特分子标识符(UMI)直接计数转录本拷贝,能够高效地并行处理数千个单细胞。尽管技术有所进步,但分析基于液滴的 scRNA-Seq 数据的统计方法和计算工具仍然缺乏。特别是,基于模型的方法仍然在探索中,用于对大规模单细胞转录组数据进行聚类。
我们开发了 DIMM-SC,这是一种用于基于液滴的单细胞转录组数据聚类的狄利克雷混合模型。该方法明确地对 scRNA-Seq 实验中的 UMI 计数数据进行建模,并通过狄利克雷混合先验来描述不同细胞簇之间的变化。我们进行了全面的模拟评估,将 DIMM-SC 与现有的聚类方法(如 K-means、CellTree 和 Seurat)进行了比较。此外,我们分析了具有已知聚类标签的公共 scRNA-Seq 数据集和来自系统性硬化症研究的内部 scRNA-Seq 数据集,以基准测试和验证 DIMM-SC。模拟研究和真实数据应用均表明,总体而言,与其他现有聚类方法相比,DIMM-SC 可显著提高聚类准确性,并大大降低聚类变异性。更重要的是,作为一种基于模型的方法,DIMM-SC 能够量化每个单细胞的聚类不确定性,促进严格的统计推断和生物学解释,而这通常是现有聚类方法无法提供的。
DIMM-SC 已在一个用户友好的 R 包中实现,并在 www.pitt.edu/∼wec47/singlecell.html 上提供了详细的教程。
wei.chen@chp.edu 或 hum@ccf.org。
补充数据可在生物信息学在线获取。