Bravo Àlex, Piñero Janet, Queralt-Rosinach Núria, Rautschka Michael, Furlong Laura I
Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
BMC Bioinformatics. 2015 Feb 21;16:55. doi: 10.1186/s12859-015-0472-9.
Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.
By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.
BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.
当前的生物医学研究需要利用和挖掘科学出版物中报道的大量信息。自动化文本挖掘方法,特别是那些旨在发现实体之间关系的方法,是从自由文本库中识别可操作知识的关键。我们提出了BeFree系统,旨在识别生物医学实体之间的关系,特别关注基因及其相关疾病。
通过利用文本的形态句法信息,BeFree能够以最先进的性能识别基因-疾病、药物-疾病和药物-靶点关联。将BeFree应用于实际案例场景显示了其在提取与转化研究相关信息方面的有效性。我们通过一系列分析以及与其他数据源的整合,展示了BeFree提取的基因-疾病关联的价值。BeFree成功识别出与全球主要发病原因之一抑郁症相关的基因,而这些基因在其他公共资源中并不存在。此外,对基因-疾病关联的大规模提取和分析,以及与当前生物医学知识的整合,为文献中可发现的信息类型提供了有趣的见解,并在数据优先级排序和整理方面提出了挑战。我们发现,使用BeFree发现的基因-疾病关联中,只有一小部分被收集到专家策划的数据库中。因此,迫切需要找到替代人工整理的策略,以便审查、优先排序和整理文本挖掘数据,并将其纳入特定领域的数据库。我们提出了数据优先级排序策略,并讨论了其对支持生物医学研究和应用的意义。
BeFree是一种新颖的文本挖掘系统,在识别基因-疾病、药物-疾病和药物-靶点关联方面具有竞争力。我们的分析表明,挖掘仅一小部分MEDLINE结果就能得到一个庞大的基因-疾病关联数据集,而该数据集中实际上只有一小部分(2%)被记录在策划资源中,这在数据优先级排序和整理方面引发了若干问题。我们建议将文本挖掘数据与专家策划的数据进行联合分析,这似乎是评估数据质量以及突出新颖有趣信息的合适方法。