重写和抑制统一医学语言系统术语以改进生物医学术语识别。

Rewriting and suppressing UMLS terms for improved biomedical term identification.

作者信息

Hettne Kristina M, van Mulligen Erik M, Schuemie Martijn J, Schijvenaars Bob Ja, Kors Jan A

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

出版信息

J Biomed Semantics. 2010 Mar 31;1(1):5. doi: 10.1186/2041-1480-1-5.

DOI:10.1186/2041-1480-1-5

PMID:20618981

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2895736/

Abstract

BACKGROUND

Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.

RESULTS

Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.

CONCLUSIONS

We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper.

摘要

背景

术语识别对于生物医学文本挖掘至关重要。我们在此专注于使用词汇表进行术语识别，特别是统一医学语言系统（UMLS）。为使UMLS更适用于生物医学文本挖掘，我们实施并评估了九条术语重写规则和八条术语抑制规则。这些规则依赖于他人先前工作中已识别出的UMLS属性，以及我们团队在使用UMLS过程中发现的一组新的附加属性。我们的工作对早期工作起到补充作用，因为我们在MEDLINE语料库上衡量了不同规则对识别出的术语数量的影响。在应用规则前后计算了MEDLINE中唯一识别出的术语数量及其频率。对每条规则评估了50个最常出现的术语以及100个随机选择的术语样本。

结果

九条重写规则中有五条被发现能生成与原始术语含义正确对应的额外同义词和拼写变体，八条抑制规则中有七条被发现仅抑制不需要的术语。使用通过我们评估的五条重写规则，我们能够在MEDLINE中识别出14784个重写术语的1117772个新出现情况。未进行重写时，我们识别出属于397414个概念的651268个术语；进行重写后，我们识别出属于410823个概念的666,053个术语，术语数量增加了2.8%，识别出的概念数量增加了3.4%。使用七条抑制规则，UMLS中总共抑制了257118个不需要的术语，显著减小了其规模。语料库中抑制了7397个术语。

结论

当UMLS用于MEDLINE中的生物医学术语识别时，我们建议应用通过我们评估的五条重写规则和七条抑制规则。可从http://biosemantics.org/casper免费获取将这些规则应用于UMLS的软件工具。

相似文献

Rewriting and suppressing UMLS terms for improved biomedical term identification.

J Biomed Semantics. 2010 Mar 31;1(1):5. doi: 10.1186/2041-1480-1-5.

A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies.

BMC Med Inform Decis Mak. 2003 Jun 16;3:6. doi: 10.1186/1472-6947-3-6.

Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS.

J Am Med Inform Assoc. 2002 Nov-Dec;9(6):621-36. doi: 10.1197/jamia.m1101.

The Unified Medical Language System (UMLS): integrating biomedical terminology.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. doi: 10.1093/nar/gkh061.

Consumers' Use of UMLS Concepts on Social Media: Diabetes-Related Textual Data Analysis in Blog and Social Q&A Sites.

JMIR Med Inform. 2016 Nov 24;4(4):e41. doi: 10.2196/medinform.5748.

Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.

J Biomed Semantics. 2016 Sep 9;7(1):52. doi: 10.1186/s13326-016-0096-7.

Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test.

J Am Med Inform Assoc. 1997 Nov-Dec;4(6):484-500. doi: 10.1136/jamia.1997.0040484.

Unified medical language system coverage of emergency-medicine chief complaints.

Acad Emerg Med. 2006 Dec;13(12):1319-23. doi: 10.1197/j.aem.2006.06.054. Epub 2006 Nov 1.

The comparative study on concept representation between the UMLS and the clinical terms in Korean medical records.

Int J Med Inform. 2005 Jan;74(1):67-76. doi: 10.1016/j.ijmedinf.2004.09.004.

A GCN-based approach to uncover misaligned synonymous terms in the UMLS Metathesaurus.

AMIA Annu Symp Proc. 2024 Jan 11;2023:977-986. eCollection 2023.

引用本文的文献

Using General-purpose Sentiment Lexicons for Suicide Risk Assessment in Electronic Health Records: Corpus-Based Analysis.

JMIR Med Inform. 2021 Apr 13;9(4):e22397. doi: 10.2196/22397.

A new synonym-substitution method to enrich the human phenotype ontology.

BMC Bioinformatics. 2017 Oct 10;18(1):446. doi: 10.1186/s12859-017-1858-7.

Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.

J Biomed Semantics. 2016 Sep 9;7(1):52. doi: 10.1186/s13326-016-0096-7.

Common disease signatures from gene expression analysis in Huntington's disease human blood and brain.

Orphanet J Rare Dis. 2016 Aug 1;11(1):97. doi: 10.1186/s13023-016-0475-2.

JuFiT: A Configurable Rule Engine for Filtering and Generating New Multilingual Umls Terms.

AMIA Annu Symp Proc. 2015 Nov 5;2015:604-10. eCollection 2015.

An Informatics Approach to Evaluating Combined Chemical Exposures from Consumer Products: A Case Study of Asthma-Associated Chemicals and Potential Endocrine Disruptors.

Environ Health Perspect. 2016 Aug;124(8):1155-65. doi: 10.1289/ehp.1510529. Epub 2016 Mar 8.

The Implicitome: A Resource for Rationalizing Gene-Disease Associations.

PLoS One. 2016 Feb 26;11(2):e0149621. doi: 10.1371/journal.pone.0149621. eCollection 2016.

Identifying named entities from PubMed for enriching semantic categories.

BMC Bioinformatics. 2015 Feb 21;16:57. doi: 10.1186/s12859-015-0487-2.

Molecularly and clinically related drugs and diseases are enriched in phenotypically similar drug-disease pairs.

Genome Med. 2014 Aug 17;6(7):52. doi: 10.1186/s13073-014-0052-z. eCollection 2014.

Quantifying the impact and extent of undocumented biomedical synonymy.

PLoS Comput Biol. 2014 Sep 25;10(9):e1003799. doi: 10.1371/journal.pcbi.1003799. eCollection 2014 Sep.

本文引用的文献

A dictionary to identify small molecules and drugs in free text.

Bioinformatics. 2009 Nov 15;25(22):2983-91. doi: 10.1093/bioinformatics/btp535. Epub 2009 Sep 16.

MBA: a literature mining system for extracting biomedical abbreviations.

BMC Bioinformatics. 2009 Jan 9;10:14. doi: 10.1186/1471-2105-10-14.

A comparison study on algorithms of detecting long forms for short forms in biomedical text.

BMC Bioinformatics. 2007 Nov 27;8 Suppl 9(Suppl 9):S5. doi: 10.1186/1471-2105-8-S9-S5.

Literature-based compound profiling: application to toxicogenomics.

Pharmacogenomics. 2007 Nov;8(11):1521-34. doi: 10.2217/14622416.8.11.1521.

Frontiers of biomedical text mining: current progress.

Brief Bioinform. 2007 Sep;8(5):358-75. doi: 10.1093/bib/bbm045. Epub 2007 Oct 30.

Applied information retrieval and multidisciplinary research: new mechanistic hypotheses in complex regional pain syndrome.

J Biomed Discov Collab. 2007 May 4;2:2. doi: 10.1186/1747-5333-2-2.

Combining hierarchical and associative gene ontology relations with textual evidence in estimating gene and gene product similarity.

IEEE Trans Nanobioscience. 2007 Mar;6(1):51-9. doi: 10.1109/tnb.2007.891886.

Assignment of protein function and discovery of novel nucleolar proteins based on automatic analysis of MEDLINE.

Proteomics. 2007 Mar;7(6):921-31. doi: 10.1002/pmic.200600693.

Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation.

BMC Bioinformatics. 2007 Jan 18;8:14. doi: 10.1186/1471-2105-8-14.

Status of text-mining techniques applied to biomedical text.

Drug Discov Today. 2006 Apr;11(7-8):315-25. doi: 10.1016/j.drudis.2006.02.011.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

重写和抑制统一医学语言系统术语以改进生物医学术语识别。

Rewriting and suppressing UMLS terms for improved biomedical term identification.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献