Thomas Philippe, Durek Pawel, Solt Illés, Klinger Bertram, Witzel Franziska, Schulthess Pascal, Mayer Yvonne, Tikk Domonkos, Blüthgen Nils, Leser Ulf
Humboldt-Universität zu Berlin, Institute for Computer Science, Knowledge Management in Bioinformatics, 10099 Berlin, Germany, Institute of Pathology, Charité-Universitätsmedizin Berlin, Deutsches Rheuma Forschungszentrum, Charitéplatz 1, 10117 Berlin, Germany, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary and Integrative Research Institute for the Life Sciences, Humboldt Universität zu Berlin, Philippstr. 13 Haus 18, 10115 Berlin, Germany.
Humboldt-Universität zu Berlin, Institute for Computer Science, Knowledge Management in Bioinformatics, 10099 Berlin, Germany, Institute of Pathology, Charité-Universitätsmedizin Berlin, Deutsches Rheuma Forschungszentrum, Charitéplatz 1, 10117 Berlin, Germany, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary and Integrative Research Institute for the Life Sciences, Humboldt Universität zu Berlin, Philippstr. 13 Haus 18, 10115 Berlin, Germany Humboldt-Universität zu Berlin, Institute for Computer Science, Knowledge Management in Bioinformatics, 10099 Berlin, Germany, Institute of Pathology, Charité-Universitätsmedizin Berlin, Deutsches Rheuma Forschungszentrum, Charitéplatz 1, 10117 Berlin, Germany, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary and Integrative Research Institute for the Life Sciences, Humboldt Universität zu Berlin, Philippstr. 13 Haus 18, 10115 Berlin, Germany.
Bioinformatics. 2015 Apr 15;31(8):1258-66. doi: 10.1093/bioinformatics/btu795. Epub 2014 Nov 29.
A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs.
We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship.
We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art.
Web-service is freely accessible at http://fastforward.sys-bio.net/.
leser@informatik.hu-berlin.de or nils.bluethgen@charite.de
Supplementary data are available at Bioinformatics online.
转录因子(TFs)高度互联的网络协调着人类基因的上下文依赖性表达。用于探究特定转录因子与基因组区域结合情况的芯片实验(ChIP-chip)被用于在基因组规模上重建基因调控网络,但却受到高假阳性率的困扰。与此同时,大量关于高质量调控相互作用的知识在很大程度上仍未被探索,因为这些知识仅以分散在数百万篇科学出版物中的自然语言描述形式存在。此类数据难以提取,目前调控数据中人类转录因子之间仅包含503种调控关系。
我们开发了一种文本挖掘辅助工作流程,用于从生物学文献中系统地提取有关人类转录因子之间调控相互作用的知识。我们将此工作流程应用于整个医学文献数据库(Medline),这帮助我们识别出超过45000个可能描述此类关系的句子。我们通过机器学习方法对这些句子进行排序。排名前2500的句子中包含约900个句子,涵盖了数据库中已有的关系。通过人工筛选其余排名靠前的1625个句子,我们获得了300多个之前调控数据库中不存在的经过验证的调控关系。全文筛选使我们能够获得关于支持某种关系的实验证据强度的详细信息。
与当前调控数据库的内容相比,我们能够将有关人类核心转录网络的筛选信息增加60%以上。与现有技术相比,我们在使用该网络进行疾病基因优先级排序时观察到了性能的提升。
网络服务可在http://fastforward.sys-bio.net/免费访问。
leser@informatik.hu-berlin.de或nils.bluethgen@charite.de
补充数据可在《生物信息学》在线获取。