Choudhary Saket, Satija Rahul
New York Genome Center, 101 Avenue of the Americas, New York, 100013, USA.
Center for Genomics and Systems Biology, New York University, 12 Waverly Pl, New York, 10003, USA.
Genome Biol. 2022 Jan 18;23(1):27. doi: 10.1186/s13059-021-02584-9.
Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate.
Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation.
Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.
单细胞RNA测序(scRNA-seq)数据中的异质性由多种来源驱动,包括细胞状态的生物学变异以及实验处理过程中引入的技术变异。对这些影响进行解卷积是预处理工作流程的关键挑战。最近的工作已经证明了计数模型在scRNA-seq分析中的重要性和实用性,但对于哪种统计分布和参数设置合适,目前尚无共识。
在这里,我们分析了59个scRNA-seq数据集,这些数据集涵盖了广泛的技术、系统和测序深度,以评估不同误差模型的性能。我们发现,虽然泊松误差模型似乎适用于稀疏数据集,但在所有生物系统中,我们观察到有足够测序深度的基因存在明显的过度离散证据,因此需要使用负二项式模型。此外,我们发现过度离散的程度在不同数据集、系统和基因丰度之间差异很大,这表明需要一种数据驱动的参数估计方法。
基于这些分析,我们为scRNA-seq数据变异建模提供了一组建议,特别是在使用广义线性模型或基于似然的方法进行预处理和下游分析时。