Hettne Kristina M, van Mulligen Erik M, Schuemie Martijn J, Schijvenaars Bob Ja, Kors Jan A
Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
J Biomed Semantics. 2010 Mar 31;1(1):5. doi: 10.1186/2041-1480-1-5.
Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.
Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.
We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper.
术语识别对于生物医学文本挖掘至关重要。我们在此专注于使用词汇表进行术语识别,特别是统一医学语言系统(UMLS)。为使UMLS更适用于生物医学文本挖掘,我们实施并评估了九条术语重写规则和八条术语抑制规则。这些规则依赖于他人先前工作中已识别出的UMLS属性,以及我们团队在使用UMLS过程中发现的一组新的附加属性。我们的工作对早期工作起到补充作用,因为我们在MEDLINE语料库上衡量了不同规则对识别出的术语数量的影响。在应用规则前后计算了MEDLINE中唯一识别出的术语数量及其频率。对每条规则评估了50个最常出现的术语以及100个随机选择的术语样本。
九条重写规则中有五条被发现能生成与原始术语含义正确对应的额外同义词和拼写变体,八条抑制规则中有七条被发现仅抑制不需要的术语。使用通过我们评估的五条重写规则,我们能够在MEDLINE中识别出14784个重写术语的1117772个新出现情况。未进行重写时,我们识别出属于397414个概念的651268个术语;进行重写后,我们识别出属于410823个概念的666,053个术语,术语数量增加了2.8%,识别出的概念数量增加了3.4%。使用七条抑制规则,UMLS中总共抑制了257118个不需要的术语,显著减小了其规模。语料库中抑制了7397个术语。
当UMLS用于MEDLINE中的生物医学术语识别时,我们建议应用通过我们评估的五条重写规则和七条抑制规则。可从http://biosemantics.org/casper免费获取将这些规则应用于UMLS的软件工具。