Sen Puliparambil Bhavithry, Tomal Jabed H, Yan Yan
Master of Science in Data Science Program, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada.
Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada.
Biology (Basel). 2022 Oct 12;11(10):1495. doi: 10.3390/biology11101495.
With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.
随着单细胞RNA测序(scRNA-seq)技术的出现,科学家们能够在单细胞分辨率下检测基因表达。scRNA-seq数据分析有其自身的挑战,这些挑战源于其高维度性。机器学习方法具有从高维scRNA-seq数据中进行基因(特征)选择的潜力。尽管存在多种似乎适用于特征选择的机器学习方法,如惩罚回归,但在每个数据集都有其自身挑战的情况下,并没有对它们在不同数据集上的性能进行严格比较。因此,在本文中,我们分析并比较了用于scRNA-seq数据的多种惩罚回归方法。根据我们分析的scRNA-seq数据集,结果表明,使用受试者工作特征曲线下面积(AUC)和计算时间等指标,稀疏组套索(SGL)优于其他六种方法(岭回归、套索回归、弹性网络、下拉套索、组套索和大套索)。基于这些发现,我们提出了一种使用惩罚回归方法进行特征选择的新算法。所提出的算法通过选择一小部分基因子集并应用SGL来选择scRNA-seq数据中的差异表达基因。通过使用层次聚类对基因进行分组,该方法无需领域特定知识来获取基因分组信息。此外,所提出的算法在使用的数据集上始终提供更好的AUC。