Suppr超能文献

联合使用语料库和知识库学习词向量。

Jointly learning word embeddings using a corpus and a knowledge base.

机构信息

Department of Computer Science, University of Liverpool, Liverpool, United Kingdom.

Kawarabayashi ERATO Large Graph Project, Tokyo, Japan.

出版信息

PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.

Abstract

Methods for representing the meaning of words in vector spaces purely using the information distributed in text corpora have proved to be very valuable in various text mining and natural language processing (NLP) tasks. However, these methods still disregard the valuable semantic relational structure between words in co-occurring contexts. These beneficial semantic relational structures are contained in manually-created knowledge bases (KBs) such as ontologies and semantic lexicons, where the meanings of words are represented by defining the various relationships that exist among those words. We combine the knowledge in both a corpus and a KB to learn better word embeddings. Specifically, we propose a joint word representation learning method that uses the knowledge in the KBs, and simultaneously predicts the co-occurrences of two words in a corpus context. In particular, we use the corpus to define our objective function subject to the relational constrains derived from the KB. We further utilise the corpus co-occurrence statistics to propose two novel approaches, Nearest Neighbour Expansion (NNE) and Hedged Nearest Neighbour Expansion (HNE), that dynamically expand the KB and therefore derive more constraints that guide the optimisation process. Our experimental results over a wide-range of benchmark tasks demonstrate that the proposed method statistically significantly improves the accuracy of the word embeddings learnt. It outperforms a corpus-only baseline and reports an improvement of a number of previously proposed methods that incorporate corpora and KBs in both semantic similarity prediction and word analogy detection tasks.

摘要

使用仅从文本语料库中分布的信息来表示词汇意义的向量空间方法已被证明在各种文本挖掘和自然语言处理(NLP)任务中非常有价值。但是,这些方法仍然忽略了共现上下文中单词之间宝贵的语义关系结构。这些有益的语义关系结构包含在手动创建的知识库(KB)中,例如本体和语义词典,其中单词的含义通过定义这些单词之间存在的各种关系来表示。我们结合语料库和 KB 中的知识来学习更好的单词嵌入。具体来说,我们提出了一种联合单词表示学习方法,该方法使用 KB 中的知识,并同时预测语料库上下文中两个单词的共现。特别地,我们使用语料库来定义我们的目标函数,同时服从从 KB 中得出的关系约束。我们进一步利用语料库共现统计数据提出了两种新颖的方法,即最近邻扩展(NNE)和避险最近邻扩展(HNE),这两种方法可以动态扩展 KB,从而得出更多的约束条件来指导优化过程。我们在广泛的基准任务上的实验结果表明,所提出的方法在统计学上显著提高了学习到的单词嵌入的准确性。它优于仅语料库的基线,并在语义相似性预测和单词类比检测任务中报告了一些先前提出的同时结合语料库和 KB 的方法的改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/bcd684d77dec/pone.0193094.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验