Suppr超能文献

利用表观基因组学数据预测肺癌中的基因表达。

Using epigenomics data to predict gene expression in lung cancer.

作者信息

Li Jeffery, Ching Travers, Huang Sijia, Garmire Lana X

出版信息

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S10. doi: 10.1186/1471-2105-16-S5-S10. Epub 2015 Mar 18.

Abstract

BACKGROUND

Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.

METHODS

A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.

RESULTS

A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

CONCLUSIONS

By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

摘要

背景

已知表观遗传改变与包括癌症在内的各种疾病中的基因表达变化相关。然而,目前缺乏能够准确预测基因表达上调或下调的定量模型。

方法

在肺癌背景下开发了一种基于机器学习的新基因表达预测方法。该方法使用来自癌症基因组图谱(TCGA)中配对肺癌和相邻正常组织的Illumina Infinium HumanMethylation450K 芯片 CpG 甲基化阵列数据以及来自 ENCODE 项目的组蛋白修饰标记 CHIP-Seq 数据,来预测 TCGA 肺癌中 RNA-Seq 数据的差异表达。它考虑了涵盖 CpG 甲基化、组蛋白 H3 甲基化修饰、核苷酸组成和保守性这四类的 1424 个特征的综合列表。比较了各种特征选择和分类方法,以在训练数据集中通过 10 折交叉验证选择最佳模型。

结果

通过基于 ReliefF 的特征选择和随机森林分类方法选择了一个包含 67 个特征的最佳模型,训练集的 10 折交叉验证中 AUC = 0.864,测试集的 AUC = 0.836。所选特征涵盖了所有四种数据类型,其中组蛋白 H3 甲基化修饰(32 个特征)和 CpG 甲基化(15 个特征)最为丰富。在基于单个数据类型特征的剔除测试中,去除 CpG 甲基化特征导致模型性能下降最多。在最佳模型中,19 个所选特征来自启动子区域(TSS200 和 TSS1500),相对于转录本的所有位置中该区域的特征数量最多。相对于蛋白质编码转录本上不同区域的 CpG 甲基化特征的顺序剔除表明,启动子区域对基因表达的准确预测贡献最大。

结论

通过考虑表观基因组和基因组特征的综合列表,我们构建了一个准确的模型来预测转录组差异表达,以肺癌为例。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897a/4402699/aba2c0c24d70/1471-2105-16-S5-S10-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验