Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Comput Biol Chem. 2022 Oct;100:107733. doi: 10.1016/j.compbiolchem.2022.107733. Epub 2022 Jul 18.
Single-cell RNA sequencing (scRNA-seq) data exhibit an unusual abundance of zero counts with a considerable fraction due to the dropout events, which introduces challenges to differential expression analysis. To correct biases in differential expression due to the informative dropouts, an inverse non-dropout-probability weighting method is proposed given that the dropout rate is negatively dependent on the underlying gene expression magnitude in scRNA-seq data. The weights are estimated using the maximum likelihood method where dropout values are integrated out using the Gauss-Hermite quadrature. Linear, generalized linear and mixed regressions with the estimated weights are fitted on original or transformed scRNA-seq data. Variances of coefficient estimators from the weighted regressions are estimated using the jackknife method. Extensive simulation studies are carried out to compare the proposed method to five cutting-edge methods (Limma, edgeR, MAST, ZIAQ and scImpute), where the proposed method performs among the best under all scenarios in terms of AUC, sensitivity, specificity and FDR. Rate of detecting true positives is examined for the proposed method and five comparison methods using mouse embryonic stem cells and fibroblasts where differentially expressed (DE) genes detected in bulk RNA-seq data on the same set of genes under the same conditions from independent source serve as true positives. Specificity is compared for these methods on true negative data by random splitting of a real dataset. Furthermore, the proposed method is illustrated on a lineage study where cells in the same embryo are correlated and genes differentially expressed between cell division lineages are identified.
单细胞 RNA 测序 (scRNA-seq) 数据表现出异常多的零计数,其中相当一部分是由于缺失事件造成的,这给差异表达分析带来了挑战。为了纠正由于信息缺失而导致的差异表达中的偏差,提出了一种逆非缺失概率加权方法,因为在 scRNA-seq 数据中,缺失率与潜在基因表达幅度呈负相关。使用最大似然法估计权重,其中使用 Gauss-Hermite 求积法集成缺失值。使用估计的权重对原始或转换后的 scRNA-seq 数据拟合线性、广义线性和混合回归。使用自举法估计加权回归中系数估计值的方差。进行了广泛的模拟研究,将所提出的方法与五种前沿方法(Limma、edgeR、MAST、ZIAQ 和 scImpute)进行比较,在所提出的方法在 AUC、敏感性、特异性和 FDR 方面,在所研究的所有场景中均表现最佳。使用来自独立来源的相同条件下相同基因集的批量 RNA-seq 数据检测到的差异表达 (DE) 基因作为真阳性,检查了所提出的方法和五种比较方法在小鼠胚胎干细胞和成纤维细胞中的真阳性检测率。通过对真实数据集的随机拆分,比较这些方法在真阴性数据上的特异性。此外,该方法在谱系研究中得到了说明,其中同一胚胎中的细胞是相关的,并且识别了细胞分裂谱系之间差异表达的基因。