词向量源码解析：（4.6）hyperwords源码解析之evaluation

最新推荐文章于 2020-11-09 17:30:21 发布

Sailing_ZhaoZhe

最新推荐文章于 2020-11-09 17:30:21 发布

阅读量373

点赞数

分类专栏：词向量

本文链接：https://blog.csdn.net/u011793737/article/details/77967534

版权

词向量专栏收录该内容

41 篇文章 3 订阅

订阅专栏

hyperwords中包含了两个任务，一个是analogy任务，之前都已经介绍过了，另一个是similarity任务。similarity任务比analogy出现的时间早，是人们能想到的衡量词向量性质的最直接的方式。simialrity衡量词向量性质的过程是：数据集包括了大量的单词对（word pair）以及人对于这个两个单词相似度的打分。我们得到的词向量同样也能对这个单词对的相似度进行打分。我们希望人的分数能和词向量给的分数相似。不过这里有一个问题，词向量的相似度分数和人的分数不是一个量级，并不好直接比较，所以一般的解决方法是使用spearman rank。这样量级就不是问题了。假定人给car train的分数高于dog table，词向量给的分数也是car train的分数大于dog table，那么不管具体的分数是多少，我们都认为这词向量的结果和人是一致的。总之我们只关注最终的排序（rank）。下面看用similarity任务评价词向量的代码。similarity测试集每一行的内容是word pair加上单词之间的相似度。这个函数把数据集读入内存。ws_eval.py如下

def read_test_set(path):
test = []
with open(path) as f:
for line in f:
x, y, sim = line.strip().lower().split()
test.append(((x, y), sim))
return test

main函数首先把测试集读入内存，然后通过我们之前介绍过的representation_factory类中的create_representation函数得到单词的词向量（封装），evaluate函数把测试集和词向量作为输入得到spearman rank。

def main():
args = docopt("""
Usage:
ws_eval.py [options] <representation> <representation_path> <task_path>

Options:
--neg NUM Number of negative samples; subtracts its log from PMI (only applicable to PPMI) [default: 1]
--w+c Use ensemble of word and context vectors (not applicable to PPMI)
--eig NUM Weighted exponent of the eigenvalue matrix (only applicable to SVD) [default: 0.5]
""")

data = read_test_set(args['<task_path>'])
representation = create_representation(args)
correlation = evaluate(representation, data)
print args['<representation>'], args['<representation_path>'], '\t%0.3f' % correlation

最后看一下spearman rank怎么得到的。

def evaluate(representation, data)://data是测试集，
results = []
for (x, y), sim in data:
results.append((representation.similarity(x, y), sim))//用similarity得到单词的相似度，sim是人给的相似度，有了人的相似度和词向量的相似度就可以计算spearman rank了
actual, expected = zip(*results)//*list表示把列表拆开。原来列表是一个整体，现在变成了len(list)个变量。这种用法可以用在接受任意多个输入的函数上面。
return spearmanr(actual, expected)[0]

下面我们看一下用analogy任务评估词向量的代码。首先再重新说一下analogy这个任务，我们通过man, woman, husband去寻找wife，寻找方法是找到离vec(woman)+vec(husband)-vec(man)距离最近的词向量，或者说similarity最大的词向量。这个问题等价于找到一个单词w，使得sim(w,woman)+sim(w,husband)-sim(w,man)最大。所以我们可以先把所有的单词之间的similarity先算好了，将来我们可以直接利用similarity就可以找到单词w。回到上一个例子，本质上我们希望的w能和woman接近，husband接近，和man疏远。我们可以稍微变化一下，把加号变成乘号，减号变成除号，即argmax sim(w,woman)*sim(w,husband)/sim(w,man)这样找到的单词同样也是和woman，husband接近，和man疏远。用后者在稀疏的词向量上面效果好的多。

analogy_eval如下

def main():
args = docopt("""
Usage:
analogy_eval.py [options] <representation> <representation_path> <task_path>

Options:
--neg NUM Number of negative samples; subtracts its log from PMI (only applicable to PPMI) [default: 1]
--w+c Use ensemble of word and context vectors (not applicable to PPMI)
--eig NUM Weighted exponent of the eigenvalue matrix (only applicable to SVD) [default: 0.5]
""")

data = read_test_set(args['<task_path>'])//读取analogy数据集
xi, ix = get_vocab(data)//针对analogy数据集建立一个词典
representation = create_representation(args)//得到词向量（包装后的类）
accuracy_add, accuracy_mul = evaluate(representation, data, xi, ix)//得到用两种analogy公式计算下的准确率
print args['<representation>'], args['<representation_path>'], '\t%0.3f' % accuracy_add, '\t%0.3f' % accuracy_mul

下面看一下read_test_set函数，把analogy数据集读入内存。

def read_test_set(path):
test = []
with open(path) as f:
for line in f:
analogy = line.strip().lower().split()
test.append(analogy)
return test

然后根据analogy数据集建立词典。

def get_vocab(data):
vocab = set()
for analogy in data:
vocab.update(analogy)
vocab = sorted(vocab)
return dict([(a, i) for i, a in enumerate(vocab)]), vocab

然后我们看一下evaluate函数。首先要把单词单词之间的相似度计算好，然后遍历analogy的所有四元组a，b，a_，b_。guess函数根据词向量得到的相似度矩阵去猜测第四个单词，猜对了的话加一，最后得到准确率。逻辑很简单，关键是prepare_similarities和guess这两个函数。

def evaluate(representation, data, xi, ix):
sims = prepare_similarities(representation, ix)
correct_add = 0.0
correct_mul = 0.0
for a, a_, b, b_ in data:
b_add, b_mul = guess(representation, sims, xi, a, a_, b)
if b_add == b_:
correct_add += 1
if b_mul == b_:
correct_mul += 1
return correct_add/len(data), correct_mul/len(data)

下面是prepare_similarities函数，得到similarity矩阵，每一行是analogy数据集中的一个单词，每一列是这个单词和词向量中包含的单词的相似度。

def prepare_similarities(representation, vocab):
vocab_representation = representation.m[[representation.wi[w] if w in representation.wi else 0 for w in vocab]]//首先得到所有的analogy词典中的所有的单词的词向量
sims = vocab_representation.dot(representation.m.T)//sims是analogy中每个单词和所有单词的相似度
//后面的代码好像没有什么意义
dummy = None
for w in vocab:
if w not in representation.wi:
dummy = representation.represent(w)
break
if dummy is not None:
for i, w in enumerate(vocab):
if w not in representation.wi:
vocab_representation[i] = dummy

if type(sims) is not np.ndarray:
sims = np.array(sims.todense())
else:
sims = (sims+1)/2
return sims

有个sims我们就可以去通过前三个单词预测下一个单词了。下面是guess函数。完全是根据之前给的公式计算的

def guess(representation, sims, xi, a, a_, b):
sa = sims[xi[a]]
sa_ = sims[xi[a_]]
sb = sims[xi[b]]

add_sim = -sa+sa_+sb//加减计算公式
if a in representation.wi://b_肯定不是a, b, a_
add_sim[representation.wi[a]] = 0
if a_ in representation.wi:
add_sim[representation.wi[a_]] = 0
if b in representation.wi:
add_sim[representation.wi[b]] = 0
b_add = representation.iw[np.nanargmax(add_sim)]

mul_sim = sa_*sb*np.reciprocal(sa+0.01)//乘除计算公式
if a in representation.wi:
mul_sim[representation.wi[a]] = 0
if a_ in representation.wi:
mul_sim[representation.wi[a_]] = 0
if b in representation.wi:
mul_sim[representation.wi[b]] = 0
b_mul = representation.iw[np.nanargmax(mul_sim)]

return b_add, b_mul//返回两种计算方法找到的单词

Sailing_ZhaoZhe

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
词向量源码解析：（4.6）hyperwords源码解析之evaluation

hyperwords中包含了两个任务，一个是analogy任务，之前都已经介绍过了，另一个是similarity任务。similarity任务比analogy出现的时间早，是人们能想到的衡量词向量性质的最直接的方式。simialrity衡量词向量性质的过程是：数据集包括了大量的单词对（word pair）以及人对于这个两个单词相似度的打分。我们得到的词向量同样也能对这个单词对的相似度进行打分。我们
复制链接

扫一扫