Danning Rebecca, Hu Frank B, Lin Xihong
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02215.
Department of Nutritional Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02215.
Proc Natl Acad Sci U S A. 2025 Apr 29;122(17):e2423341122. doi: 10.1073/pnas.2423341122. Epub 2025 Apr 23.
Disease and behavior subtype identification is of significant interest in biomedical research. However, in many settings, subtype discovery is limited by a lack of robust statistical clustering methods appropriate for binary data. Here, we introduce LACE-UP [latent class analysis ensembled with UMAP (uniform manifold approximation and projection) and PCA (principal components analysis)], an ensemble machine-learning method for clustering multidimensional binary data that does not require prespecifying the number of clusters and is robust to realistic data settings, such as the correlation of variables observed from the same individual and the inclusion of variables unrelated to the underlying subtype. The method ensembles latent class analysis, a model-based clustering method; principal components analysis, a spectral signal processing method; and UMAP, a cutting-edge model-free dimensionality reduction algorithm. In simulations, LACE-UP outperforms gold-standard techniques across a variety of realistic scenarios, including in the presence of correlated and extraneous data. We apply LACE-UP to dietary behavior data from the UK Biobank to demonstrate its power to uncover interpretable dietary subtypes that are associated with lipids and cardiovascular risk.
疾病与行为亚型识别在生物医学研究中具有重大意义。然而,在许多情况下,亚型发现受到缺乏适用于二元数据的强大统计聚类方法的限制。在此,我们引入了LACE-UP[结合UMAP(均匀流形近似与投影)和PCA(主成分分析)的潜在类别分析],这是一种用于对多维二元数据进行聚类的集成机器学习方法,它不需要预先指定聚类数量,并且对现实数据设置具有鲁棒性,例如从同一个体观察到的变量之间的相关性以及包含与潜在亚型无关的变量。该方法将基于模型的聚类方法潜在类别分析、光谱信号处理方法主成分分析以及前沿的无模型降维算法UMAP进行了集成。在模拟中,LACE-UP在各种现实场景中均优于金标准技术,包括存在相关和无关数据的情况。我们将LACE-UP应用于英国生物银行的饮食行为数据,以证明其揭示与脂质和心血管风险相关的可解释饮食亚型的能力。