Suppr超能文献

临床记录中统一医学语言系统术语的出现:大规模语料库分析。

Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis.

机构信息

Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55905, USA.

出版信息

J Am Med Inform Assoc. 2012 Jun;19(e1):e149-56. doi: 10.1136/amiajnl-2011-000744. Epub 2012 Apr 4.

Abstract

OBJECTIVE

To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.

DESIGN

Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.

RESULTS

For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106426 and 94788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.

CONCLUSION

The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

摘要

目的

在大型临床语料库中描述统一医学语言系统 (UMLS) 元词表术语字符串的经验实例,并说明哪些类型的术语特征可在数据源之间推广。

设计

基于 Mayo 诊所临床笔记 5100 万篇文档语料库中 UMLS 术语的出现情况,本研究计算了术语字符串属性、源术语表、语义类型和语法类别方面的统计信息。还对 2010 年 i2b2/VA 文本中的术语出现情况进行了映射;从 Mayo 基于统计的基础上设计了 8 个示例过滤器,并将其应用于 i2b2/VA 数据。

结果

对于语料库分析,在 Mayo 语料库中,映射的术语数量很少超过六个单词或 55 个字符。在 UMLS 的源术语表中,消费者健康词汇和系统命名法医学临床术语 (SNOMED-CT) 在 Mayo 临床笔记中的覆盖率最高,分别为 106426 和 94788 个唯一术语。在 UMLS 的 15 个语义组中,有 7 个组占 Mayo 数据中术语出现的 92.08%。从语法上看,超过 90%的匹配术语都在名词短语中。对于跨机构分析,在 i2b2/VA 数据上使用五个示例过滤器将实际词汇减少到 UMLS 大小的 19.13%,而匹配术语仅减少了 2%。

结论

这里提出的语料库统计信息对于从 UMLS 构建词汇表很有启发性。元词表术语固有的特征(完整性、长度和语言)很容易在临床机构之间推广,但术语频率应该谨慎调整。映射术语的语义组在不同机构之间可能略有不同,但在转移到生物医学文献领域时差异很大。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a94/3392861/2e5652889280/amiajnl-2011-000744fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验