Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA.
Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
Bioinformatics. 2022 May 13;38(10):2692-2699. doi: 10.1093/bioinformatics/btac168.
Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the measurement of transcriptomic profiles at the single-cell level. With the increasing application of scRNA-seq in larger-scale studies, the problem of appropriately clustering cells emerges when the scRNA-seq data are from multiple subjects. One challenge is the subject-specific variation; systematic heterogeneity from multiple subjects may have a significant impact on clustering accuracy. Existing methods seeking to address such effects suffer from several limitations.
We develop a novel statistical method, EDClust, for multi-subject scRNA-seq cell clustering. EDClust models the sequence read counts by a mixture of Dirichlet-multinomial distributions and explicitly accounts for cell-type heterogeneity, subject heterogeneity and clustering uncertainty. An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. We perform a series of simulation studies to evaluate the proposed method and demonstrate the outstanding performance of EDClust. Comprehensive benchmarking on four real scRNA-seq datasets with various tissue types and species demonstrates the substantial accuracy improvement of EDClust compared to existing methods.
The R package is freely available at https://github.com/weix21/EDClust.
Supplementary data are available at Bioinformatics online.
单细胞 RNA 测序(scRNA-seq)通过在单细胞水平上测量转录组谱,彻底改变了生物学研究。随着 scRNA-seq 在更大规模研究中的应用越来越多,当 scRNA-seq 数据来自多个个体时,就会出现适当对细胞进行聚类的问题。其中一个挑战是个体特异性变化;来自多个个体的系统异质性可能会对聚类准确性产生重大影响。现有的旨在解决此类影响的方法存在一些局限性。
我们开发了一种新的统计方法 EDClust,用于多主体 scRNA-seq 细胞聚类。EDClust 通过狄利克雷-多项分布的混合模型对序列读取计数进行建模,并明确考虑了细胞类型异质性、个体异质性和聚类不确定性。导出了一种 EM-MM 混合算法来最大化数据似然并对细胞进行聚类。我们进行了一系列模拟研究来评估所提出的方法,并证明了 EDClust 的出色性能。在具有各种组织类型和物种的四个真实 scRNA-seq 数据集上进行的全面基准测试表明,与现有方法相比,EDClust 的准确性有了显著提高。
R 包可在 https://github.com/weix21/EDClust 上免费获得。
补充数据可在生物信息学在线获得。