Division of Biostatistics and Epidemiology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.
Division of Statistics and Data Science, Department of Mathematical Sciences, University of Cincinnati, Cincinnati, Ohio, USA.
Stat Med. 2022 Feb 20;41(4):681-697. doi: 10.1002/sim.9279. Epub 2021 Dec 12.
In omics experiments, estimation and variable selection can involve thousands of proteins/genes observed from a relatively small number of subjects. Many regression regularization procedures have been developed for estimation and variable selection in such high-dimensional problems. However, approaches have predominantly focused on linear regression models that ignore correlation arising from long sequences of repeated measurements on the outcome. Our work is motivated by the need to identify proteomic biomarkers that improve the prediction of rapid lung-function decline for individuals with cystic fibrosis (CF) lung disease. We extend four Bayesian penalized regression approaches for a Gaussian linear mixed effects model with nonstationary covariance structure to account for the complicated structure of longitudinal lung function data while simultaneously estimating unknown parameters and selecting important protein isoforms to improve predictive performance. Different types of shrinkage priors are evaluated to induce variable selection in a fully Bayesian framework. The approaches are studied with simulations. We apply the proposed method to real proteomics and lung-function outcome data from our motivating CF study, identifying a set of relevant clinical/demographic predictors and a proteomic biomarker for rapid decline of lung function. We also illustrate the methods on CD4 yeast cell-cycle genomic data, confirming that the proposed method identifies transcription factors that have been highlighted in the literature for their importance as cell cycle transcription factors.
在组学实验中,估计和变量选择可能涉及从相对较少的个体中观察到的数千种蛋白质/基因。已经开发了许多回归正则化程序,用于处理此类高维问题中的估计和变量选择。然而,这些方法主要集中在忽略了由于对结果进行重复测量而产生的相关性的线性回归模型上。我们的工作的动机是需要确定蛋白质组生物标志物,以提高对囊性纤维化(CF)肺部疾病个体的快速肺功能下降的预测能力。我们扩展了四种贝叶斯惩罚回归方法,用于具有非平稳协方差结构的高斯线性混合效应模型,以考虑到纵向肺功能数据的复杂结构,同时估计未知参数并选择重要的蛋白质同工型以提高预测性能。评估了不同类型的收缩先验来在完全贝叶斯框架中进行变量选择。我们通过模拟研究了这些方法。我们将提出的方法应用于我们的 CF 研究中的真实蛋白质组学和肺功能结果数据,确定了一组相关的临床/人口统计学预测因子和一个蛋白质组生物标志物,用于快速肺功能下降。我们还在 CD4 酵母细胞周期基因组数据上说明了这些方法,证实了所提出的方法可以识别文献中强调的作为细胞周期转录因子的重要转录因子。