目录
作业1:
1. 余弦相似度
2. 单词类比
3. 词向量纠偏
3.1 消除对非性别词语的偏见
3.2 性别词的均衡算法
作业2:Emojify表情生成
1. Baseline model: Emojifier-V1
1.1 数据集
1.2 模型预览
1.3 实现 Emojifier-V1
1.4 在训练集上测试
2. Emojifier-V2: Using LSTMs in Keras
2.1 模型预览
2.2 Keras and mini-batching
2.3 Embedding 层
2.3 建立 Emojifier-V2
测试题:参考博文
笔记:W2.自然语言处理与词嵌入
作业1:
- 加载预训练的 单词向量,用 余弦夹角 测量相似度
- 使用词嵌入解决类比问题
- 修改词嵌入降低性比歧视
import numpy as np
from w2v_utils import *
这个作业使用 50-维的 GloVe vectors 表示单词
words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
1. 余弦相似度
其中
# GRADED FUNCTION: cosine_similarity
def cosine_similarity(u, v):
"""
Cosine similarity reflects the degree of similariy between u and v
Arguments:
u -- a word vector of shape (n,)
v -- a word vector of shape (n,)
Returns:
cosine_similarity -- the cosine similarity between u and v defined by the formula above.
"""
distance = 0.0
### START CODE HERE ###
# Compute the dot product between u and v (≈1 line)
dot = np.dot(u, v)
# Compute the L2 norm of u (≈1 line)
norm_u = np.linalg.norm(u)
# Compute the L2 norm of v (≈1 line)
norm_v = np.linalg.norm(v)
# Compute the cosine similarity defined by formula (1) (≈1 line)
cosine_similarity = dot/(norm_u*norm_v)
### END CODE HERE ###
return cosine_similarity
2. 单词类比
例如:男人:女人 --> 国王:王后
# GRADED FUNCTION: complete_analogy
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
"""
Performs the word analogy task as explained above: a is to b as c is to ____.
Arguments:
word_a -- a word, string
word_b -- a word, string
word_c -- a word, string
word_to_vec_map -- dictionary that maps words to their corresponding vectors.
Returns:
best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
"""
# convert words to lower case
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
### START CODE HERE ###
# Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
e_a, e_b, e_c = word_to_vec_map[word_a],word_to_vec_map[word_b],word_to_vec_map[word_c]
### END CODE HERE ###
words = word_to_vec_map.keys()
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output
# loop over the whole word vector set
for w in words:
# to avoid best_word being one of the input words, pass on them.
if w in [word_a, word_b, word_c] :
continue
### START CODE HERE ###
# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim
best_word = w
### END CODE HERE ###
return best_word
测试:
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))
输出:
italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger
额外测试:
good -> ok :: bad -> oops(糟糕)
father -> dad :: mother -> mom
3. 词向量纠偏
研究反映在单词嵌入中的性别偏见,并探索减少这种偏见的算法
g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)
输出:向量(50维)
[-0.087144 0.2182 -0.40986 -0.03922 -0.1032 0.94165
-0.06042 0.32988 0.46144 -0.35962 0.31102 -0.86824
0.96006 0.01073 0.24337 0.08193 -1.02722 -0.21122
0.695044 -0.00222 0.29106 0.5053 -0.099454 0.40445
0.30181 0.1355 -0.0606 -0.07131 -0.19245 -0.06115
-0.3204 0.07165 -0.13337 -0.25068714 -0.14293 -0.224957
-0.149 0.048882 0.12191 -0.27362 -0.165476 -0.20426
0.54376 -0.271425 -0.10245 -0.32108 0.2516 -0.33455
-0.04371 0.01258 ]
print ('List of names and their similarities with constructed vector:')
# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']
for w in name_list:
print (w, cosine_similarity(word_to_vec_map[w], g))
输出:
List of names and their similarities with constructed vector:
john -0.23163356145973724
marie 0.315597935396073
sophie 0.31868789859418784
ronaldo -0.31244796850329437
priya 0.17632041839009402
rahul -0.16915471039231716
danielle 0.24393299216283895
reza -0.07930429672199553
katy 0.2831068659572615
yasmin 0.2331385776792876
可以看出,
- 女性的名字往往与向量 ? 有正的余弦相似性,
- 而男性的名字往往有负的余弦相似性。结果似乎可以接受。
试试其他的词语
print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist',
'technology', 'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:
print (w, cosine_similarity(word_to_vec_map[w], g))
输出:
Other words and their similarities:
lipstick 0.2769191625638267
guns -0.1888485567898898
science -0.06082906540929701
arts 0.008189312385880337
literature 0.06472504433459932
warrior -0.20920164641125288
doctor 0.11895289410935041
tree -0.07089399175478091
receptionist 0.3307794175059374
technology -0.13193732447554302
fashion 0.03563894625772699
teacher 0.17920923431825664
engineer -0.0803928049452407
pilot 0.0010764498991916937
computer -0.10330358873850498
singer 0.1850051813649629
这些结果反映了某些「性别歧视」。例如,“computer 计算机”
更接近“man 男人”
,“literature 文学”
更接近“woman 女人”
。
下面看到如何使用Boliukbasi
等人2016年提出的算法来减少这些向量的偏差。
请注意,有些词对,如“演员”/“女演员”
或“祖母”/“祖父”
应保持性别特异性,而其他词如“接待员”
或“技术”
应保持中立,即「与性别无关」。纠偏时,你必须区别对待这两种类型的单词
3.1 消除对非性别词语的偏见
def neutralize(word, g, word_to_vec_map):
"""
Removes the bias of "word" by projecting it on the space orthogonal to the bias axis.
This function ensures that gender neutral words are zero in the gender subspace.
Arguments:
word -- string indicating the word to debias
g -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)
word_to_vec_map -- dictionary mapping words to their corresponding vectors.
Returns:
e_debiased -- neutralized word vector representation of the input "word"
"""
### START CODE HERE ###
# Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
e = word_to_vec_map[word]
# Compute e_biascomponent using the formula give above. (≈ 1 line)
e_biascomponent = np.dot(e, g)/np.linalg.norm(g)**2*g
# Neutralize e by substracting e_biascomponent from it
# e_debiased should be equal to its orthogonal projection. (≈ 1 line)
e_debiased = e - e_biascomponent
### END CODE HERE ###
return e_debiased
测试:
e = "receptionist"
print("cosine similarity between " + e + " and g, before neutralizing: ", cosine_similarity(word_to_vec_map["receptionist"], g))
e_debiased = neutralize("receptionist", g, word_to_vec_map)
print("cosine similarity between " + e + " and g, after neutralizing: ", cosine_similarity(e_debiased, g))
输出:
cosine similarity between receptionist and g,
before neutralizing: 0.3307794175059374
cosine similarity between receptionist and g,
after neutralizing: -2.099120994400013e-17
纠偏以后,receptionist(接待员)
与性别的相似度接近于 0,既不偏向男人,也不偏向女人
3.2 性别词的均衡算法
如何将纠偏应用于「单词对」,例如“女演员”和“演员”
。均衡化应用:只希望通过性别属性而有所不同的单词对。作为一个具体的例子,假设“女演员”比“演员”
更接近“保姆”
,通过对“保姆”
进行中性化,我们可以减少与保姆相关的性别刻板印象。但这仍然不能保证“演员”和“女演员”
与“保姆”
的距离相等,均衡算法可以处理这一点。
def equalize(pair, bias_axis, word_to_vec_map):
"""
Debias gender specific words by following the equalize method described in the figure above.
Arguments:
pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor")
bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. gender
word_to_vec_map -- dictionary mapping words to their corresponding vectors
Returns
e_1 -- word vector corresponding to the first word
e_2 -- word vector corresponding to the second word
"""
### START CODE HERE ###
# Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)
w1, w2 = pair[0], pair[1]
e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]
# Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
mu = (e_w1+e_w2)/2
# Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
mu_B = np.dot(mu, bias_axis)/np.linalg.norm(bias_axis)**2*bias_axis
mu_orth = mu-mu_B
# Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
e_w1B = np.dot(e_w1,bias_axis)/np.linalg.norm(bias_axis)**2*bias_axis
e_w2B = np.dot(e_w2,bias_axis)/np.linalg.norm(bias_axis)**2*bias_axis
# Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)
corrected_e_w1B = np.sqrt(np.abs(1-np.linalg.norm(mu_orth)**2))*np.divide((e_w1B-mu_B),np.abs(e_w1-mu_orth-mu_B))
corrected_e_w2B = np.sqrt(np.abs(1-np.linalg.norm(mu_orth)**2))*np.divide((e_w2B-mu_B),np.abs(e_w2-mu_orth-mu_B))
# Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
e1 = corrected_e_w1B+mu_orth
e2 = corrected_e_w2B+mu_orth
### END CODE HERE ###
return e1, e2
测试:
print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_similarity(word_to_vec_map["man"], g))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_similarity(word_to_vec_map["woman"], g))
print()
e1, e2 = equalize(("man", "woman"), g, word_to_vec_map)
print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_similarity(e1, g))
print("cosine_similarity(e2, gender) = ", cosine_similarity(e2, g))
输出:
cosine similarities before equalizing:
cosine_similarity(word_to_vec_map["man"], gender) = -0.11711095765336832
cosine_similarity(word_to_vec_map["woman"], gender) = 0.35666618846270376
cosine similarities after equalizing:
cosine_similarity(e1, gender) = -0.7165727525843935
cosine_similarity(e2, gender) = 0.7396596474928909
平衡以后,相似度符号相反,数值接近
作业2:Emojify表情生成
使用 word vector representations 建立 Emojifier
让你的消息更有表现力?,使用单词向量的话,可以是你的单词没有在该表情的关联里面,也能学习到可以使用该表情。
- 导入一些包
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt
%matplotlib inline
1. Baseline model: Emojifier-V1
1.1 数据集
X:127个句子(字符串) Y:整型 标签 0 - 4 ,是相关的句子的表情
- 加载数据集,训练集(127个样本),测试集(56个样本)
X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')
maxLen = len(max(X_train, key=len).split())
print(max(X_train, key=len).split())
输出:
['I', 'am', 'so', 'impressed', 'by', 'your', 'dedication', 'to', 'this', 'project']
最长的句子是10个单词
- 查看数据集
index = 3
print(X_train[index], label_to_emoji(Y_train[index]))
输出:Miss you so much ❤️
1.2 模型预览
为了方便,把 Y 的形状从 改成 one-hot 表示
Y_oh_train = convert_to_one_hot(Y_train, C = 5)
Y_oh_test = convert_to_one_hot(Y_test, C = 5)
index = 52
print(Y_train[index], "is converted into one hot", Y_oh_train[index])
输出:
3 is converted into one hot [0. 0. 0. 1. 0.]
1.3 实现 Emojifier-V1
使用预训练的 50-dimensional GloVe embeddings
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
- 检查下是否正确
word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])
输出:
the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos
实现 sentence_to_avg()
:
- 转换每个句子为小写,并切分成单词
- 每个句子的单词,使用 GloVe 向量表示,然后求句子的平均
# GRADED FUNCTION: sentence_to_avg
def sentence_to_avg(sentence, word_to_vec_map):
"""
Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
and averages its value into a single vector encoding the meaning of the sentence.
Arguments:
sentence -- string, one training example from X
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
Returns:
avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
"""
### START CODE HERE ###
# Step 1: Split sentence into list of lower case words (≈ 1 line)
words = sentence.lower().split()
# Initialize the average word vector, should have the same shape as your word vectors.
avg = np.zeros(word_to_vec_map[words[0]].shape)
# Step 2: average the word vectors. You can loop over the words in the list "words".
for w in words:
avg += word_to_vec_map[w]
avg /= len(words)
### END CODE HERE ###
return avg
测试:
avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = ", avg)
输出:
avg = [-0.008005 0.56370833 -0.50427333 0.258865 0.55131103 0.03104983
-0.21013718 0.16893933 -0.09590267 0.141784 -0.15708967 0.18525867
0.6495785 0.38371117 0.21102167 0.11301667 0.02613967 0.26037767
0.05820667 -0.01578167 -0.12078833 -0.02471267 0.4128455 0.5152061
0.38756167 -0.898661 -0.535145 0.33501167 0.68806933 -0.2156265
1.797155 0.10476933 -0.36775333 0.750785 0.10282583 0.348925
-0.27262833 0.66768 -0.10706167 -0.283635 0.59580117 0.28747333
-0.3366635 0.23393817 0.34349183 0.178405 0.1166155 -0.076433
0.1445417 0.09808667]
「模型」
用sentence_to_avg()
处理完以后,进行前向传播、计算损失、后向传播更新参数
# GRADED FUNCTION: model
def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
"""
Model to train word vector representations in numpy.
Arguments:
X -- input data, numpy array of sentences as strings, of shape (m, 1)
Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
learning_rate -- learning_rate for the stochastic gradient descent algorithm
num_iterations -- number of iterations
Returns:
pred -- vector of predictions, numpy-array of shape (m, 1)
W -- weight matrix of the softmax layer, of shape (n_y, n_h)
b -- bias of the softmax layer, of shape (n_y,)
"""
np.random.seed(1)
# Define number of training examples
m = Y.shape[0] # number of training examples
n_y = 5 # number of classes
n_h = 50 # dimensions of the GloVe vectors
# Initialize parameters using Xavier initialization
W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
b = np.zeros((n_y,))
# Convert Y to Y_onehot with n_y classes
Y_oh = convert_to_one_hot(Y, C = n_y)
# Optimization loop
for t in range(num_iterations): # Loop over the number of iterations
for i in range(m): # Loop over the training examples
### START CODE HERE ### (≈ 4 lines of code)
# Average the word vectors of the words from the i'th training example
avg = sentence_to_avg(X[i], word_to_vec_map)
# Forward propagate the avg through the softmax layer
z = np.dot(W, avg)+b
a = softmax(z)
# Compute cost using the i'th training label's one hot representation and "A" (the output of the softmax)
cost = - sum(Y_oh[i]*np.log(a))
### END CODE HERE ###
# Compute gradients
dz = a - Y_oh[i]
dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
db = dz
# Update parameters with Stochastic Gradient Descent
W = W - learning_rate * dW
b = b - learning_rate * db
if t % 100 == 0:
print("Epoch: " + str(t) + " --- cost = " + str(cost))
pred = predict(X, Y, W, b, word_to_vec_map)
return pred, W, b
1.4 在训练集上测试
print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)
输出:
Training set:
Accuracy: 0.9772727272727273
Test set:
Accuracy: 0.8571428571428571
随机猜测的话,平均概率是 20%(1/5),模型的效果很不错,在只有127个训练样本的情况下
让我们来测试:
- 我们在训练集里看到了
I love you
有标签 ❤️ - 我们来检查下使用
adore(爱慕)
(该词没有在训练集出现过)
X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])
pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)
输出:
Accuracy: 0.8333333333333334(5/6,最后一个错了)
i adore you ❤️(adore 跟 love 有相似的 embedding )
i love you ❤️
funny lol ?
lets play with a ball ⚾
food is ready ?
not feeling happy ?(识别错误,不能发现 not 这类组合词)
检查错误:打印「混淆矩阵」可以帮助了解哪些样本模型预测不准。一个混淆矩阵显示了一个标签是一个类(真实标签)的例子被算法用不同的类(预测错误)错误标记的频率
print(Y_test.shape)
print(' '+ label_to_emoji(0)+ ' ' + label_to_emoji(1) + ' ' + label_to_emoji(2)+ ' ' + label_to_emoji(3)+' ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_test, pred_test)
2. Emojifier-V2: Using LSTMs in Keras
让我们构建一个LSTM模型,它将单词「序列」作为输入。这个模型将能够考虑单词顺序。Emojifier-V2 将继续使用预先训练过的 word embeddings 来表示单词,将把它们输入LSTM,LSTM的任务是预测最合适的表情符号。
- 导入一些包
import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)
2.1 模型预览
2.2 Keras and mini-batching
为了使样本能够批量训练,我们必须处理句子,使他们的长度都一样长,长度不够最大长度的,后面补上一些 0 向量
2.3 Embedding 层
https://keras.io/zh/layers/embeddings/
- 先把所有句子的单词对应的 idx 填好
# GRADED FUNCTION: sentences_to_indices
def sentences_to_indices(X, word_to_index, max_len):
"""
Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
The output shape should be such that it can be given to `Embedding()` (described in Figure 4).
Arguments:
X -- array of sentences (strings), of shape (m, 1)
word_to_index -- a dictionary containing the each word mapped to its index
max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this.
Returns:
X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
"""
m = X.shape[0] # number of training examples
### START CODE HERE ###
# Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
X_indices = np.zeros((m, max_len))
for i in range(m): # loop over training examples
# Convert the ith training sentence in lower case and split is into words. You should get a list of words.
sentence_words = X[i].lower().split()
# Initialize j to 0
j = 0
# Loop over the words of sentence_words
for w in sentence_words:
# Set the (i,j)th entry of X_indices to the index of the correct word.
X_indices[i, j] = word_to_index[w]
# Increment j to j + 1
j = j+1
### END CODE HERE ###
return X_indices
实现 pretrained_embedding_layer()
- 初始化 词嵌入矩阵,注意 shape
- 填充 词嵌入矩阵,从
word_to_vec_map
里抽取 - 定义 Keras embedding 层,注意设置
trainable = False
,使之不可被训练,如果为True
,则允许算法修改词嵌入的值 - 将 嵌入权重 设置为与 嵌入矩阵 相等
# GRADED FUNCTION: pretrained_embedding_layer
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
"""
Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
Arguments:
word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
embedding_layer -- pretrained layer Keras instance
"""
vocab_len = len(word_to_index) + 1 # adding 1 to fit Keras embedding (requirement)
emb_dim = word_to_vec_map["cucumber"].shape[0] # define dimensionality of your GloVe word vectors (= 50)
### START CODE HERE ###
# Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
emb_matrix = np.zeros((vocab_len, emb_dim))
# Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
for word, index in word_to_index.items():
emb_matrix[index, :] = word_to_vec_map[word]
# Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False.
embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
### END CODE HERE ###
# Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
embedding_layer.build((None,))
# Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
embedding_layer.set_weights([emb_matrix])
return embedding_layer
2.3 建立 Emojifier-V2
https://keras.io/zh/layers/core/#input
https://keras.io/zh/layers/embeddings/#embedding
https://keras.io/zh/layers/recurrent/#lstm
https://keras.io/zh/layers/core/#dropout
https://keras.io/zh/layers/core/#dense
https://keras.io/zh/activations/
https://keras.io/zh/models/about-keras-models/#model
# GRADED FUNCTION: Emojify_V2
def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
"""
Function creating the Emojify-v2 model's graph.
Arguments:
input_shape -- shape of the input, usually (max_len,)
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
model -- a model instance in Keras
"""
### START CODE HERE ###
# Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
sentence_indices = Input(input_shape, dtype='int32')
# Create the embedding layer pretrained with GloVe Vectors (≈1 line)
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
# Propagate sentence_indices through your embedding layer, you get back the embeddings
embeddings = embedding_layer(sentence_indices)
# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a batch of sequences.
X = LSTM(128,return_sequences=True)(embeddings)
# Add dropout with a probability of 0.5
X = Dropout(rate=0.5)(X)
# Propagate X trough another LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a single hidden state, not a batch of sequences.
X = LSTM(128, return_sequences=False)(X)
# Add dropout with a probability of 0.5
X = Dropout(rate=0.5)(X)
# Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
X = Dense(5)(X)
# Add a softmax activation
X = Activation('softmax')(X)
# Create Model instance which converts sentence_indices into X.
model = Model(inputs=sentence_indices, outputs=X)
### END CODE HERE ###
return model
- 创建模型
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()
输出:
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, 10) 0
_________________________________________________________________
embedding_4 (Embedding) (None, 10, 50) 20000050
_________________________________________________________________
lstm_3 (LSTM) (None, 10, 128) 91648
_________________________________________________________________
dropout_1 (Dropout) (None, 10, 128) 0
_________________________________________________________________
lstm_4 (LSTM) (None, 128) 131584
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 5) 645
_________________________________________________________________
activation_1 (Activation) (None, 5) 0
=================================================================
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050 注:(400,001个单词*50词向量维度)
_________________________________________________________________
- 配置模型
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
- 训练模型
转换 X,Y 的格式
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)
训练
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)
输出:
WARNING:tensorflow:From c:\program files\python37\lib\site-packages\keras\backend\tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
Epoch 1/50
132/132 [==============================] - 1s 5ms/step - loss: 1.6088 - accuracy: 0.1970
Epoch 2/50
132/132 [==============================] - 0s 582us/step - loss: 1.5221 - accuracy: 0.3636
Epoch 3/50
132/132 [==============================] - 0s 574us/step - loss: 1.4762 - accuracy: 0.3939
(省略)
Epoch 49/50
132/132 [==============================] - 0s 597us/step - loss: 0.0115 - accuracy: 1.0000
Epoch 50/50
132/132 [==============================] - 0s 582us/step - loss: 0.0182 - accuracy: 0.9924
在训练集上的准确率几乎 100%
- 在测试集上测试
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)
输出:
56/56 [==============================] - 0s 2ms/step
Test accuracy = 0.875
测试集上准确率为 87.5%
- 查看预测错误的样本
# This code allows you to see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
x = X_test_indices
num = np.argmax(pred[i])
if(num != Y_test[i]):
print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())
输出:
Expected emoji:? prediction: work is hard ?
Expected emoji:? prediction: This girl is messing with me ❤️
Expected emoji:? prediction: work is horrible ?
Expected emoji:? prediction: any suggestions for dinner ?
Expected emoji:? prediction: you brighten my day ❤️
Expected emoji:? prediction: go away ⚾
Expected emoji:? prediction: I did not have breakfast ❤️
- 用自己的例子测试
# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.
x_test = np.array(['not feeling happy'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+ label_to_emoji(np.argmax(model.predict(X_test_indices))))
not feeling happy ? (这次 LSTM 可以预测 not 这类的组合词了)
not very happy ?
very happy ?
i really love my wife ❤️
总结:
- 如果你有一个「训练集很小」的NLP任务,使用单词嵌入可以显著地帮助你的算法。单词嵌入允许模型处理测试集中没有出现在训练集中的单词
- 在Keras(和大多数其他深度学习框架中)中训练序列模型需要一些重要的细节:
- 要使用
mini-batches
,需要填充序列,以便mini-batches
中的所有样本具有「相同的长度」 “Embedding()”
层可以用「预先训练的值初始化」。这些值可以是「固定」的,也可以在数据集中「进一步训练」。如果数据集很小就不要接着训练了(效果不大)LSTM()
有一个名为“return_sequences”
的标志,用于决定是返回「每个隐藏状态」还是「只」返回「最后一个隐藏状态」- 可以在
LSTM()
之后使用Dropout()
来正则化网络
本文地址:https://michael.blog.csdn.net/article/details/108902060
我的CSDN博客地址 https://michael.blog.csdn.net/
长按或扫码关注我的公众号(Michael阿明),一起加油、一起学习进步!