Java词向量比较字符串相似度_Sequence Model-week2编程题1-词向量的操作【余弦相似度词类比除偏词向量】...

最新推荐文章于 2024-04-01 17:40:21 发布

Dale Dai

最新推荐文章于 2024-04-01 17:40:21 发布

阅读量400

点赞数

文章标签： Java词向量比较字符串相似度

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42130786/article/details/114785234

版权

本文介绍了如何使用预训练的GloVe词向量计算两个单词之间的余弦相似度，解决词类比问题，并探讨如何通过中性化和均衡算法减少词向量的性别偏见。内容包括余弦相似度的计算，完成词类比任务的函数实现，以及中和和均衡性别词的算法。

摘要由CSDN通过智能技术生成

1. 词向量上的操作(Operations on word vectors)

因为词嵌入的训练是非常耗资源的，所以ML从业者通常都是选择加载训练好的词嵌入(Embedding)数据集。(不用自己训练啦~~~)

任务：

导入预训练词向量，使用余弦相似性(cosine similarity)计算相似度

使用词嵌入来解决 “Man is to Woman as King is to __.” 之类的词语类比问题

修改词嵌入来减少它们的性别歧视

import numpy as np

from w2v_utils import *

导入词向量，这个任务中，使用 50维的GloVe向量来表示单词，导入 load the word_to_vec_map.

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt') # Embedding vector已知

print(list(words)[:10])

print(word_to_vec_map['mauzac'])

['1945gmt', 'mauzac', 'kambojas', '4-b', 'wakan', 'lorikeet', 'paratroops', 'wittkower', 'messageries', 'oliver']

[ 0.049225 -0.36274 -0.31555 -0.2424 -0.58761 0.27733

0.059622 -0.37908 -0.59505 0.78046 0.3348 -0.90401

0.7552 -0.30247 0.21053 0.03027 0.22069 0.40635

0.11387 -0.79478 -0.57738 0.14817 0.054704 0.973

-0.22502 1.3677 0.14288 0.83708 -0.31258 0.25514

-1.2681 -0.41173 0.0058966 -0.64135 0.32456 -0.84562

-0.68853 -0.39517 -0.17035 -0.54659 0.014695 0.073697

0.1433 -0.38125 0.22585 -0.70205 0.9841 0.19452

-0.21459 0.65096 ]

导入的数据：

words: 词汇表中单词集.

word_to_vec_map: dictionary 映射单词到它们的 GloVe vector 表示.

Embedding vectors vs one-hot vectors

one-hot向量不能很好捕捉单词之间的相似度水平(每一个one-hot向量与任何其他one-hot向量有相同的欧几里得距离(Euclidean distance))

Embedding vector，如Glove vector提供了许多关于单个单词含义的有用信息

下面介绍如何使用 GloVe向量来度量两个单词之间的相似性

1.1 余弦相似度(Cosine similarity)

为了测量两个单词之间的相似性, 我们需要一个方法来测量两个单词的两个embedding vectors的相似性程度。给定两个向量 \(u\) 和 \(v\), cosine similarity 定义如下：

\[\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}

\]

\(u \cdot v\) 是两个向量的点积(内积)

\(||u||_2\) 向量 \(u\) 的范数(长度)

\(\theta\) 是 \(u\) 与 \(v\) 之间的夹角角度

余弦相似性依赖于 \(u\) and \(v\) 的角度.

如果 \(u\) 和 \(v\) 很相似，那么 \(cos(\theta)\) 越接近1.

如果 \(u\) 和 \(v\) 不相似，那么 \(cos(\theta)\) 得到一个很小的值.

**Figure 1**: The cosine of the angle between two vectors is a measure their similarity

Exercise：实现函数 cosine_similarity() 来计算两个词向量之间的相似性.

Reminder： \(u\) 的范式定义为 \(||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}\)

提示：使用 np.dot, np.sum, or np.sqrt 很有用.

# GRADED FUNCTION: cosine_similarity

def cosine_similarity(u, v):

"""

Cosine similarity reflects the degree of similarity between u and v

Arguments:

u -- a word vector of shape (n,)

v -- a word vector of shape (n,)

Returns:

cosine_similarity -- the cosine similarity between u and v defined by the formula above.

"""

distance = 0.0

### START CODE HERE ###

# Compute the dot product between u and v (≈1 line)

dot = np.sum(u * v)

# Compute the L2 norm of u (≈1 line)

norm_u = np.sqrt(np.sum(np.square(u)))

# Compute the L2 norm of v (≈1 line)

norm_v = np.sqrt(np.sum(np.square(v)))

# Compute the cosine similarity defined by formula (1) (≈1 line)

cosine_similarity = dot / (norm_u * norm_v)

### END CODE HERE ###

return cosine_similarity

测试：

father = word_to_vec_map["father"]

mother = word_to_vec_map["mother"]

ball = word_to_vec_map["ball"]

crocodile = word_to_vec_map["crocodile"]

france = word_to_vec_map["fra

最低0.47元/天解锁文章

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。