MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 0SR, UK.
Nucleic Acids Res. 2013 Feb 1;41(3):1450-63. doi: 10.1093/nar/gks1339. Epub 2012 Dec 28.
Cell type-specific gene expression in humans involves complex interactions between regulatory factors and DNA at enhancers and promoters. Mapping studies for expression quantitative trait loci (eQTLs), transcription factors (TFs) and chromatin markers have become widely used tools for identifying gene regulatory elements, but prediction of target genes remains a major challenge. Here, we integrate genome-wide data on TF-binding sites, chromatin markers and functional annotations to predict genes associated with human eQTLs. Using the random forest classifier, we found that genomic proximity plus five TF and chromatin features are able to predict >90% of target genes within 1 megabase of eQTLs. Despite being regularly used to map target genes, proximity is not a good indicator of eQTL targets for genes 150 kilobases away, but insulators, TF co-occurrence, open chromatin and functional similarities between TFs and genes are better indicators. Using all six features in the classifier achieved an area under the specificity and sensitivity curve of 0.91, much better compared with at most 0.75 for using any single feature. We hope this study will not only provide validation of eQTL-mapping studies, but also provide insight into the molecular mechanisms explaining how genetic variation can influence gene expression.
人类细胞类型特异性基因表达涉及调控因子和增强子及启动子处 DNA 之间的复杂相互作用。用于鉴定基因调控元件的表达数量性状基因座(eQTL)、转录因子(TF)和染色质标记的作图研究已成为广泛使用的工具,但靶基因的预测仍然是一个主要挑战。在这里,我们整合了 TF 结合位点、染色质标记和功能注释的全基因组数据,以预测与人类 eQTL 相关的基因。使用随机森林分类器,我们发现基因组邻近性加上五个 TF 和染色质特征能够预测 eQTL 附近 1 兆碱基内超过 90%的靶基因。尽管经常用于映射靶基因,但对于 150 千碱基之外的基因来说,邻近性并不是 eQTL 靶基因的良好指标,而绝缘子、TF 共现、开放染色质以及 TF 和基因之间的功能相似性则是更好的指标。在分类器中使用所有六个特征可实现特异性和敏感性曲线下面积为 0.91,明显优于使用任何单个特征时的最多 0.75。我们希望这项研究不仅能够验证 eQTL 作图研究,还能够深入了解解释遗传变异如何影响基因表达的分子机制。