Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, Rahway, New Jersey, USA.
Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, USA.
Stat Med. 2022 Aug 15;41(18):3492-3510. doi: 10.1002/sim.9430. Epub 2022 Jun 2.
The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.
计算方法和软件在单细胞 RNA 测序(scRNA-seq)中识别差异表达特征的性能已被证明受到多种因素的影响,包括所使用的归一化方法的选择以及用于在单个细胞中分析基因表达的实验平台(或文库制备方案)的选择。目前,实践人员可以从目前为止可用的 100 多种差异表达(DE)工具中选择最合适的 DE 方法,每种方法都依赖于自己的假设来对 scRNA-seq 表达特征进行建模。为了对跨平台 scRNA-seq 数据中的技术变异性进行建模,我们在这里建议使用 Tweedie 广义线性模型,该模型可以灵活地捕获跨实验平台的观察到的 scRNA-seq 表达谱的大范围动态范围,这些表达谱由平台和基因特异性统计特性(如重尾、稀疏性和基因表达分布)引起。我们还提出了一个零膨胀 Tweedie 模型,允许零概率质量超过传统的 Tweedie 分布,以对具有过多零计数的零膨胀 scRNA-seq 数据进行建模。我们使用合成和已发表的基于板和基于液滴的 scRNA-seq 数据集,对 10 多种代表性的 DE 方法进行了系统的基准评估,并证明我们的方法(Tweedieverse)在统计功效和假发现率控制方面优于跨实验平台的最先进的 DE 方法。我们的开源软件(R / Bioconductor 包)可在 https://github.com/himelmallick/Tweedieverse 上获得。