Pierre-Jean Morgane, Mauger Florence, Deleuze Jean-François, Le Floch Edith
Centre National de Recherche en Génomique Humaine, CEA, Université de Paris-Saclay, Evry, France.
Bioinformatics. 2022 Jan 27;38(4):900-907. doi: 10.1093/bioinformatics/btab786.
It is more and more common to perform multi-omics analyses to explore the genome at diverse levels and not only at a single level. Through integrative statistical methods, multi-omics data have the power to reveal new biological processes, potential biomarkers and subgroups in a cohort. Matrix factorization (MF) is an unsupervised statistical method that allows a clustering of individuals, but also reveals relevant omics variables from the various blocks.
Here, we present PIntMF (Penalized Integrative Matrix Factorization), an MF model with sparsity, positivity and equality constraints. To induce sparsity in the model, we used a classical Lasso penalization on variable and individual matrices. For the matrix of samples, sparsity helps in the clustering, while normalization (matching an equality constraint) of inferred coefficients is added to improve interpretation. Moreover, we added an automatic tuning of the sparsity parameters using the famous glmnet package. We also proposed three criteria to help the user to choose the number of latent variables. PIntMF was compared with other state-of-the-art integrative methods including feature selection techniques in both synthetic and real data. PIntMF succeeds in finding relevant clusters as well as variables in two types of simulated data (correlated and uncorrelated). Next, PIntMF was applied to two real datasets (Diet and cancer), and it revealed interpretable clusters linked to available clinical data. Our method outperforms the existing ones on two criteria (clustering and variable selection). We show that PIntMF is an easy, fast and powerful tool to extract patterns and cluster samples from multi-omics data.
An R package is available at https://github.com/mpierrejean/pintmf.
Supplementary data are available at Bioinformatics online.
进行多组学分析以在不同层面而非仅在单一层面探索基因组变得越来越普遍。通过整合统计方法,多组学数据有能力揭示队列中新的生物学过程、潜在生物标志物和亚组。矩阵分解(MF)是一种无监督统计方法,它不仅可以对个体进行聚类,还能从各个模块中揭示相关的组学变量。
在此,我们提出了PIntMF(惩罚整合矩阵分解),这是一种具有稀疏性、正性和平等性约束的MF模型。为了在模型中引入稀疏性,我们对变量矩阵和个体矩阵使用了经典的套索惩罚。对于样本矩阵,稀疏性有助于聚类,同时添加推断系数的归一化(匹配平等性约束)以改善解释。此外,我们使用著名的glmnet包对稀疏性参数进行自动调整。我们还提出了三个标准来帮助用户选择潜在变量的数量。在合成数据和真实数据中,我们将PIntMF与其他包括特征选择技术在内的最新整合方法进行了比较。PIntMF成功地在两种模拟数据(相关和不相关)中找到了相关的聚类以及变量。接下来,PIntMF被应用于两个真实数据集(饮食与癌症),并揭示了与现有临床数据相关的可解释聚类。我们的方法在两个标准(聚类和变量选择)上优于现有方法。我们表明,PIntMF是一种从多组学数据中提取模式和聚类样本的简单、快速且强大的工具。
可在https://github.com/mpierrejean/pintmf获取R包。
补充数据可在《生物信息学》在线获取。