nle:词汇语义与分布式假设

爱格白

已于 2022-01-21 07:06:20 修改

阅读量777

点赞数 1

分类专栏：学习笔记文章标签：分布式自然语言处理

于 2022-01-21 05:19:31 首次发布

本文链接：https://blog.csdn.net/zj71hmvx/article/details/122612325

版权

学习笔记专栏收录该内容

40 篇文章 1 订阅

订阅专栏

Lexical semantics 词汇语义

一个词汇有很多不同的语性与词义
本周论文阅读笔记，补充了IC相关，LCS相关

lexical semantic relations

词汇的语义关系有：

synonymy 同义词
Two words are synonymous if they can be substituted in all possible contexts without changing the meaning of the utterance.
fast == quickly
antonymy 反义词
Words which are opposite in meaning
hot ≠ cold
hyponymy 下位词/ hypernymy 上位词
狗是一种动物，所以：
狗是动物的下义词
动物是狗的上位词
有共同上位词的词被称为 co-hyponyms 共同下位词
这个关系会在概念树或者概念层中出现
Linguistic terms which capture the idea of class inclusion
meronymy / holonymy
meronym 表示部分，holonym 表示整体
比如：
finger is a meronym of hand which is its holonym.
engine is a meronym of car which is its holonym.
topical relatedness

word net

围绕同义词和下位词组织的语言网络
核心单元是同义词集synset
然后通过上位词下位词连接

semantic similarity

语义相似度是基于word net
最简单的方式是通过word net的距离来计算：1/（1+二者距离）但是很不靠谱

least com-mon subsumer (LCS) 最不常见的子类
个人理解就是祖先？
请添加图片描述

还有另一种就是基于 Information Content 信息系数去计算
请添加图片描述

所以最后词汇相似度是两个概念的之间相似度的最大值：
wordsim( w1,w2 ) = maxsim( c1,c2 )
c1∈senses(w1)
c2∈senses(w2)

Distributional Hypothesis 分布式假设

出现在相同上下文中的单词往往具有相似的含义。Harris (1954)

context window上下文窗口就是
“My name is Agnesia”
My: {name: 1}
name: {My: 1, is: 1}
…
以此类推，需要定义window size

cosine similarity

余弦相似度𝑠𝑖𝑚( 𝑤1, 𝑤2 )= cos(𝜃)
公式：

请添加图片描述

课件的例子：
请添加图片描述

(Positive) Pointwise Mutual Information (PPMI)

PPMI值逐点互信息
Frequency and/or simple conditional probability do not capture the intuition that some features are more informative than others
比如the 和 is会对句子贡献较小
PMI measures the amount of information gained by seeing a word and a feature together
目标词共同出现的特征在相似度计算中具有更大的权重

公式：
请添加图片描述
例子：
有点交叉相乘的感觉

PMI是负无穷大的，positive PMI （PPMI）则不是
当 I ( w, f )> 0 时，PPMI值是 I ( w, f )，当 I ( w, f )<= 0 时,PPMI值是0

challenges in distributional semantics

分布表示是词word而不是词义sense
mixture of senses in distributional neighbourhoods 分布邻域
分布邻域往往反映主要的词义
通常来说，离一个词最近的词会是同义词
分布式的稀疏性sparsity b站视频：
Zipf’s Law: “The product of the frequency of a word and its rank is approximately constant.”
一个词的频率和它的排名的乘积是近似恒定的