Suppr超能文献

Transformer 时代改进的生物医学词向量。

Improved biomedical word embeddings in the transformer era.

机构信息

Department of Computer Science, University of Kentucky, United States of America.

Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States of America; Department of Computer Science, University of Kentucky, United States of America.

出版信息

J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.

Abstract

BACKGROUND

Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings.

OBJECTIVE

Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications.

METHODS

We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts.

RESULTS

Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board.

CONCLUSION

We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.

摘要

背景

最近的自然语言处理(NLP)研究主要由神经网络方法主导,这些方法使用词嵌入作为基本构建块。使用包含自由文本语料库的神经方法进行预训练可以捕获局部和全局分布特性(例如,skip-gram、GLOVE),通常用于嵌入单词和概念。使用各种旨在优化特定任务目标的神经架构,可以进一步调整这些预训练的嵌入。

目的

尽管基于上下文的语言模型嵌入取得了进展,但静态词嵌入仍然是生物自然语言处理研究和应用的重要起点。它们在资源有限的情况下和词汇语义研究中非常有用。我们的主要目标是构建改进的生物医学词嵌入,并将其公开提供给下游应用。

方法

我们通过首先使用 skip-gram 方法来共同学习单词和概念嵌入,并使用在生物医学引文中共现的 Medical Subject Heading(MeSH)概念的相关信息进一步微调这些嵌入,来实现这种微调。这种微调是在基于转换器的 BERT 架构中使用两句输入模式和分类目标来完成的,该目标捕获了 MeSH 对共现。我们使用之前的工作开发的多个数据集对这些经过调整的静态嵌入进行了评估,这些数据集用于评估单词的相关性。

结果

在定性和定量评估中,我们都证明了与其他静态嵌入方法相比,我们的方法产生了改进的生物医学嵌入。我们没有选择性地剔除概念和术语(如之前的工作所追求的那样),我们相信我们提供了迄今为止最详尽的生物医学嵌入评估,并在各个方面都取得了明显的性能提升。

结论

我们重新利用了一种转换器架构(通常用于生成动态嵌入),通过概念相关性来改进静态生物医学词嵌入。我们为下游应用和研究努力提供了我们的代码和嵌入:https://github.com/bionlproc/BERT-CRel-Embeddings。

相似文献

1
Improved biomedical word embeddings in the transformer era.
J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.
2
A comparison of word embeddings for the biomedical natural language processing.
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
3
Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.
J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.
5
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.
6
BioWordVec, improving biomedical word embeddings with subword information and MeSH.
Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.
7
The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.
Stud Health Technol Inform. 2020 Jun 16;270:432-436. doi: 10.3233/SHTI200197.
8
Language with vision: A study on grounded word and sentence embeddings.
Behav Res Methods. 2024 Sep;56(6):5622-5646. doi: 10.3758/s13428-023-02294-z. Epub 2023 Dec 19.
9
Domain specific word embeddings for natural language processing in radiology.
J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.
10
Explaining Contextualized Word Embeddings in Biomedical Research - A Qualitative Investigation.
Stud Health Technol Inform. 2022 Jun 29;295:289-292. doi: 10.3233/SHTI220719.

引用本文的文献

1
CSpace: a concept embedding space for biomedical applications.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf376.
2
AirSeg: Learnable Interconnected Attention Framework for Robust Airway Segmentation.
J Imaging Inform Med. 2025 May 22. doi: 10.1007/s10278-025-01545-z.
3
Quality of word and concept embeddings in targetted biomedical domains.
Heliyon. 2023 Jun 2;9(6):e16818. doi: 10.1016/j.heliyon.2023.e16818. eCollection 2023 Jun.
4
Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing.
Yearb Med Inform. 2022 Aug;31(1):254-260. doi: 10.1055/s-0042-1742547. Epub 2022 Dec 4.
5
Artificial Intelligence in Pharmacovigilance: An Introduction to Terms, Concepts, Applications, and Limitations.
Drug Saf. 2022 May;45(5):407-418. doi: 10.1007/s40264-022-01156-5. Epub 2022 May 17.

本文引用的文献

1
Literature Retrieval for Precision Medicine with Neural Matching and Faceted Summarization.
Proc Conf Empir Methods Nat Lang Process. 2020 Nov;2020:3389-3399. doi: 10.18653/v1/2020.findings-emnlp.304.
2
PubTator central: automated concept annotation for biomedical full text articles.
Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.
3
BioWordVec, improving biomedical word embeddings with subword information and MeSH.
Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.
4
Concept embedding to measure semantic relatedness for biomedical information ontologies.
J Biomed Inform. 2019 Jun;94:103182. doi: 10.1016/j.jbi.2019.103182. Epub 2019 Apr 19.
5
A comparison of word embeddings for the biomedical natural language processing.
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
6
Association measures for estimating semantic similarity and relatedness between biomedical concepts.
Artif Intell Med. 2019 Jan;93:1-10. doi: 10.1016/j.artmed.2018.08.006. Epub 2018 Sep 7.
7
Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings.
Proc IEEE Int Symp Bioinformatics Bioeng. 2017 Oct;2017:163-170. doi: 10.1109/BIBE.2017.00-61. Epub 2018 Jan 11.
9
An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records.
Artif Intell Med. 2015 Oct;65(2):155-66. doi: 10.1016/j.artmed.2015.04.007. Epub 2015 May 15.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验