Shihab Hashem A, Rogers Mark F, Campbell Colin, Gaunt Tom R
MRC Integrative Epidemiology Unit (IEU), University of Bristol, Bristol, UK.
Intelligent Systems Laboratory, University of Bristol, Bristol, UK.
Bioinformatics. 2017 Jun 15;33(12):1751-1757. doi: 10.1093/bioinformatics/btx028.
A major cause of autosomal dominant disease is haploinsufficiency, whereby a single copy of a gene is not sufficient to maintain the normal function of the gene. A large proportion of existing methods for predicting haploinsufficiency incorporate biological networks, e.g. protein-protein interaction networks that have recently been shown to introduce study bias. As a result, these methods tend to perform best on well-studied genes, but underperform on less studied genes. The advent of large genome sequencing consortia, such as the 1000 genomes project, NHLBI Exome Sequencing Project and the Exome Aggregation Consortium creates an urgent need for unbiased haploinsufficiency prediction methods.
Here, we describe a machine learning approach, called HIPred, that integrates genomic and evolutionary information from ENSEMBL, with functional annotations from the Encyclopaedia of DNA Elements consortium and the NIH Roadmap Epigenomics Project to predict haploinsufficiency, without the study bias described earlier. We benchmark HIPred using several datasets and show that our unbiased method performs as well as, and in most cases, outperforms existing biased algorithms.
HIPred scores for all gene identifiers are available at: https://github.com/HAShihab/HIPred .
h.shihab@bristol.ac.uk or tom.gaunt@bristol.ac.uk.
Supplementary data are available at Bioinformatics online.
常染色体显性疾病的一个主要原因是单倍剂量不足,即一个基因的单拷贝不足以维持该基因的正常功能。现有预测单倍剂量不足的方法中,很大一部分都纳入了生物网络,例如蛋白质-蛋白质相互作用网络,而最近的研究表明这些网络会引入研究偏差。因此,这些方法往往在研究充分的基因上表现最佳,但在研究较少的基因上表现不佳。大型基因组测序联盟的出现,如千人基因组计划、美国国立心肺血液研究所外显子测序计划和外显子聚合联盟,迫切需要无偏差的单倍剂量不足预测方法。
在此,我们描述了一种名为HIPred的机器学习方法,该方法整合了来自ENSEMBL的基因组和进化信息,以及来自DNA元件百科全书联盟和美国国立卫生研究院路线图表观基因组学计划的功能注释,以预测单倍剂量不足,而不会出现前述的研究偏差。我们使用多个数据集对HIPred进行了基准测试,结果表明我们的无偏差方法表现与现有有偏差算法相当,且在大多数情况下优于它们。
所有基因标识符的HIPred评分可在以下网址获取:https://github.com/HAShihab/HIPred 。
h.shihab@bristol.ac.uk或tom.gaunt@bristol.ac.uk。
补充数据可在《生物信息学》在线版获取。