Yu Shirui, Dong Peng, Li Junlian, Tang Xiaoli, Li Xiaoying
National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu, 610041, China.
Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, China.
BMC Med Inform Decis Mak. 2025 Mar 18;25(1):136. doi: 10.1186/s12911-025-02893-0.
Biomedical semantic relationship extraction could reveal important biomedical entities and the semantic relationships between them, providing a crucial foundation for the biomedical knowledge discovery, clinical decision making and other artificial intelligence applications. Identifying the causal relationships between diseases is a significant research field, since it expedites the identification of underlying disease pathogenesis mechanisms and promote better disease prevention and treatment. SemRep is an effective tool for semantic relationship extraction in the biomedical field, but it is not accurate enough for disease causality extraction, bringing challenges for downstream tasks. In this study, we proposed an optimization strategy for SemRep to enhance its accuracy in disease causality extraction.
This study aims to optimize disease causality extraction of SemRep tool by constructing a semantic predicate vocabulary that precisely expresses disease causality to support the automatic extraction of disease causality knowledge from biomedical literature. The proposed method invloves the following four steps: Firstly, we obtained a collection of semantic feature words expressing disease causality based on current causality predicate studies and the disease causality pairs extracted from SemMedDB. Then, we constructed a disease causality semantic predicate vocabulary by filtering and evaluating the clue words using quantitative comparisons. Following that, we extracted disease causality pairs from the biomedical literature using 36 semantic predicates with an accuracy greater than 80% for more meaningful knowledge discovery. Finally, we conducted knowledge discovery based on the extracted disease causality triples, which primarily includes unidirectional disease causality, bidirectional disease causality, as well as two specific types of disease causality: primary disease causality and rare disease causality.
We obtained a disease causality semantic predicate vocabulary containing 50 textual predicates with an accuracy of above 40%. 36 semantic predicates from the 60% accuracy group were used for disease causality extraction, yielding 259,434 disease causality pairs for subsequent knowledge discovery. Among them, 92,557 types with 176,010 unidirectional disease causality triples, and 6084 types with 83,424 bidirectional disease causality triples were found eventually. Two other types of disease causality, primary disease causality and rare disease causality, were also discovered.
The novelty of this research is that the proposed method enhanced the disease causality extraction of SemRep tool, resulting a more accurate and comprehensive disease causality extraction. It also facilitates an automatic disease causality extraction from large-scale biomedical literature. Additionally, a customized extraction of disease causality for its accuracy and comprehensiveness can be made possible by leveraging the quantified causality predicate vocabulary, allowing for flexible extraction of disease causality according to the actual circumstance.
生物医学语义关系提取能够揭示重要的生物医学实体及其之间的语义关系,为生物医学知识发现、临床决策及其他人工智能应用提供关键基础。识别疾病之间的因果关系是一个重要的研究领域,因为它有助于加快对潜在疾病发病机制的识别,并促进更好的疾病预防和治疗。SemRep是生物医学领域中语义关系提取的有效工具,但在疾病因果关系提取方面不够准确,给下游任务带来了挑战。在本研究中,我们提出了一种针对SemRep的优化策略,以提高其在疾病因果关系提取方面的准确性。
本研究旨在通过构建一个精确表达疾病因果关系的语义谓词词汇表来优化SemRep工具的疾病因果关系提取,以支持从生物医学文献中自动提取疾病因果关系知识。所提出的方法包括以下四个步骤:首先,基于当前的因果谓词研究以及从SemMedDB中提取的疾病因果关系对,我们获得了一组表达疾病因果关系的语义特征词。然后,我们通过定量比较对线索词进行过滤和评估,构建了一个疾病因果关系语义谓词词汇表。接下来,我们使用36个准确率大于80%的语义谓词从生物医学文献中提取疾病因果关系对,以进行更有意义的知识发现。最后,我们基于提取的疾病因果关系三元组进行知识发现,主要包括单向疾病因果关系、双向疾病因果关系,以及两种特定类型的疾病因果关系:原发性疾病因果关系和罕见疾病因果关系。
我们获得了一个包含50个文本谓词的疾病因果关系语义谓词词汇表,准确率在40%以上。从准确率60%的组中选取36个语义谓词用于疾病因果关系提取,得到259,434个疾病因果关系对用于后续的知识发现。其中,最终发现了92,557种类型,包含176,010个单向疾病因果关系三元组,以及6084种类型,包含83,424个双向疾病因果关系三元组。还发现了另外两种类型的疾病因果关系,即原发性疾病因果关系和罕见疾病因果关系。
本研究的新颖之处在于所提出的方法增强了SemRep工具的疾病因果关系提取能力,从而实现了更准确、更全面的疾病因果关系提取。它还便于从大规模生物医学文献中自动提取疾病因果关系。此外,通过利用量化的因果谓词词汇表,可以实现针对疾病因果关系提取的准确性和全面性进行定制提取,从而能够根据实际情况灵活提取疾病因果关系。