Masica David L, Sosnay Patrick R, Raraigh Karen S, Cutting Garry R, Karchin Rachel
Department of Biomedical Engineering and Institute for Computational Medicine, The Johns Hopkins University, Baltimore, MD, USA.
McKusick-Nathans Institute of Genetic Medicine.
Hum Mol Genet. 2015 Apr 1;24(7):1908-17. doi: 10.1093/hmg/ddu607. Epub 2014 Dec 8.
Predicting the impact of genetic variation on human health remains an important and difficult challenge. Often, algorithmic classifiers are tasked with predicting binary traits (e.g. positive or negative for a disease) from missense variation. Though useful, this arrangement is limiting and contrived, because human diseases often comprise a spectrum of severities, rather than a discrete partitioning of patient populations. Furthermore, labeling variants as causal or benign can be error prone, which is problematic for training supervised learning algorithms (the so-called garbage in, garbage out phenomenon). We explore the potential value of training classifiers using continuous-valued quantitative measurements, rather than binary traits. Using 20 variants from cystic fibrosis transmembrane conductance regulator (CFTR) nucleotide-binding domains and six quantitative measures of cystic fibrosis (CF) severity, we trained classifiers to predict CF severity from CFTR variants. Employing cross validation, classifier prediction and measured clinical/functional values were significantly correlated for four of six quantitative traits (correlation P-values from 1.35 × 10(-4) to 4.15 × 10(-3)). Classifiers were also able to stratify variants by three clinically relevant risk categories with 85-100% accuracy, depending on which of the six quantitative traits was used for training. Finally, we characterized 11 additional CFTR variants using clinical sweat chloride testing, two functional assays, or all three diagnostics, and validated our classifier using blind prediction. Predictions were within the measured sweat chloride range for seven of eight variants, and captured the differential impact of specific variants on the two functional assays. This work demonstrates a promising and novel framework for assessing the impact of genetic variation.
预测基因变异对人类健康的影响仍然是一项重要且艰巨的挑战。通常,算法分类器的任务是根据错义变异预测二元性状(例如疾病的阳性或阴性)。尽管这种方法有用,但它具有局限性且人为设定,因为人类疾病往往包含一系列严重程度,而不是对患者群体进行离散划分。此外,将变异标记为因果性或良性可能容易出错,这对于训练监督学习算法来说是个问题(即所谓的“垃圾进,垃圾出”现象)。我们探索了使用连续值定量测量而非二元性状来训练分类器的潜在价值。我们使用来自囊性纤维化跨膜传导调节因子(CFTR)核苷酸结合结构域的20个变异以及六种囊性纤维化(CF)严重程度的定量测量指标,训练分类器从CFTR变异预测CF严重程度。采用交叉验证,对于六个定量性状中的四个,分类器预测与测量的临床/功能值显著相关(相关P值从1.35×10⁻⁴到4.15×10⁻³)。根据用于训练的六个定量性状中的哪一个,分类器还能够以85%至100%的准确率将变异分为三个临床相关风险类别。最后,我们使用临床汗液氯化物测试、两种功能测定或所有三种诊断方法对另外11个CFTR变异进行了特征描述,并通过盲预测验证了我们的分类器。对于八个变异中的七个,预测值在测量的汗液氯化物范围内,并捕捉到了特定变异对两种功能测定的不同影响。这项工作展示了一个用于评估基因变异影响的有前景且新颖的框架。