词向量与词向量拼接_05.序列模型 W2.自然语言处理与词嵌入（作业：词向量+Emoji表情生成）...

最新推荐文章于 2024-02-16 18:08:54 发布

weixin_39669629

最新推荐文章于 2024-02-16 18:08:54 发布

阅读量164

点赞数

文章标签：词向量与词向量拼接

本文链接：https://blog.csdn.net/weixin_39669629/article/details/111535748

版权

05.序列模型 W2.自然语言处理与词嵌入(作业：词向量+Emoji表情生成)

日期：2020-10-04

浏览：15 评论：0

核心提示：文章目录作业1：1. 余弦相似度2. 单词类比3. 词向量纠偏3.1 消除对非性别词语的偏见3.2 性别词的均衡算法作业2：Emojify表情生成1. Baseline model: Emojifier-V11.1 数据集1.2 模型预览1.3 实现 Emojifier-V11.4 在训练集上测试2. Emojifier-V2: Using LSTMs in Keras2.1 模型预览2.2 Keras and mini-batching2.3 Embedding 层2.3 建立 Emojifier-V2

文章目录

作业1：

1. 余弦相似度

2. 单词类比

3. 词向量纠偏

3.1 消除对非性别词语的偏见

3.2 性别词的均衡算法

作业2：Emojify表情生成

1. Baseline model: Emojifier-V1

1.1 数据集

1.2 模型预览

1.3 实现 Emojifier-V1

1.4 在训练集上测试

2. Emojifier-V2: Using LSTMs in Keras

2.1 模型预览

2.2 Keras and mini-batching

2.3 Embedding 层

2.3 建立 Emojifier-V2

测试题：参考博文

笔记：W2.自然语言处理与词嵌入

作业1：

加载预训练的单词向量，用 c o s ( θ ) cos(\theta)cos(θ) 余弦夹角测量相似度

使用词嵌入解决类比问题

修改词嵌入降低性比歧视

import numpy as np

from w2v_utils import *

这个作业使用 50-维的 GloVe vectors 表示单词

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

1. 余弦相似度

CosineSimilarity(u, v) = u . v ∣ ∣ u ∣ ∣ 2 ∣ ∣ v ∣ ∣ 2 = c o s ( θ ) \text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta)CosineSimilarity(u, v)=∣∣u∣∣2∣∣v∣∣2u.v=cos(θ)

其中 ∣ ∣ u ∣ ∣ 2 = ∑ i = 1 n u i 2 ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}∣∣u∣∣2=∑i=1nui2

# GRADED FUNCTION: cosine_similarity

def cosine_similarity(u, v):

""" Cosine similarity reflects the degree of similariy between u and v Arguments: u -- a word vector of shape (n,) v -- a word vector of shape (n,) Returns: cosine_similarity -- the cosine similarity between u and v defined by the formula above. """

distance = 0.0

### START CODE HERE ###

# Compute the dot product between u and v (≈1 line)

dot = np.dot(u, v)

# Compute the L2 norm of u (≈1 line)

norm_u = np.linalg.norm(u)

# Compute the L2 norm of v (≈1 line)

norm_v = np.linalg.norm(v)

# Compute the cosine similarity defined by formula (1) (≈1 line)

cosine_similarity = dot/(norm_u*norm_v)

### END CODE HERE ###

return cosine_similarity

2. 单词类比

例如：男人：女人 --> 国王：王后

# GRADED FUNCTION: complete_analogy

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):

""" Performs the word analogy task as explained above: a is to b as c is to ____. Arguments: word_a -- a word, string word_b -- a word, string word_c -- a word, string word_to_vec_map -- dictionary that maps words to their corresponding vectors. Returns: best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity """

# convert words to lower case

word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

### START CODE HERE ###

# Get the word embeddings v_a, v_b and v_c (≈1-3 lines)

e_a, e_b, e_c = word_to_vec_map[word_a],word_to_vec_map[word_b],word_to_vec_map[word_c]

### END CODE HERE ###

words = word_to_vec_map.keys()

max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number

best_word = None # Initialize best_word with None, it will help keep track of the word to output

# loop over the whole word vector set

for w in words:

# to avoid best_word being one of the input words, pass on them.

if w in [word_a, word_b, word_c] :

continue

### START CODE HERE ###

# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)

cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)

# If the cosine_sim is more than the max_cosine_sim seen so far,

# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)

if cosine_sim > max_cosine_sim:

max_cosine_sim = cosine_sim

best_word = w

### END CODE HERE ###

return best_word

测试：

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]

for triad in triads_to_try:

print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

输出：

italy -> italian :: spain -> spanish

india -> delhi :: japan -> tokyo

man -> woman :: boy -> girl

small -> smaller :: large -> larger

额外测试：

good -> ok :: bad -> oops(糟糕)

father -> dad :: mother -> mom

3. 词向量纠偏

研究反映在单词嵌入中的性别偏见，并探索减少这种偏见的算法

g = word_to_vec_map['woman'] - word_to_vec_map['man']

print(g)

输出：向量(50维)

[-0.087144 0.2182 -0.40986 -0.03922 -0.1032 0.94165

-0.06042 0.32988 0.46144 -0.35962 0.31102 -0.86824

0.96006 0.01073 0.24337 0.08193 -1.02722 -0.21122

0.695044 -0.00222 0.29106 0.5053 -0.099454 0.40445

0.30181 0.1355 -0.0606 -0.07131 -0.19245 -0.06115

-0.3204 0.07165 -0.13337 -0.25068714 -0.14293 -0.224957

-0.149 0.048882 0.12191 -0.27362 -0.165476 -0.20426

0.54376 -0.271425 -0.10245 -0.32108 0.2516 -0.33455

-0.04371 0.01258 ]

print ('List of names and their similarities with constructed vector:')

# girls and boys name

name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:

print (w, cosine_similarity(word_to_vec_map[w], g))

输出：

List of names and their similarities with constructed vector:

john -0.23163356145973724

marie 0.315597935396073

sophie 0.31868789859418784

ronaldo -0.31244796850329437

priya 0.17632041839009402

rahul -0.16915471039231716

danielle 0.24393299216283895

reza -0.07930429672199553

katy 0.2831068659572615

yasmin 0.2331385776792876

可以看出，

女性的名字往往与向量

打赏

本文转载自：网络

所有权利归属于原作者，如文章来源标示错误或侵犯了您的权利请联系微信13520258486