Part 2: Emojify
欢迎来到本周的第二个作业,你将利用词向量构建一个表情包。
你有没有想过让你的短信更具表现力? emojifier APP将帮助你做到这一点。 所以不是写下”Congratulations on the promotion! Lets get coffee and talk. Love you!” emojifier可以自动转换为 “Congratulations on the promotion! ? Lets get coffee and talk. ☕️ Love you! ❤️”
另外,如果你对emojis不感兴趣,但有朋友向你发送了使用太多表情符号的疯狂短信,你还可以使用emojifier来回复他们。
你将实现一个模型,输入一个句子(“Let’s go see the baseball game tonight!”),并找到最适合这个句子的表情符号(⚾️)。 在许多表情符号界面中,您需要记住❤️是”heart”符号而不是”love”符号。 但是使用单词向量,你会发现即使你的训练集只将几个单词明确地与特定的表情符号相关联,你的算法也能够将测试集中相关的单词概括并关联到相同的表情符号上,即使这些词没有出现在训练集中。这使得即使使用小型训练集,你也可以建立从句子到表情符号的精确分类器映射。
在本练习中,您将从使用词嵌入的基本模型(Emojifier-V1)开始,然后构建进一步整合LSTM的更复杂的模型(Emojifier-V2)。
导包
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt
%matplotlib inline
emo_utils 中有用的函数
import csv
import numpy as np
import emoji
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
def read_glove_vecs(glove_file):
with open(glove_file, 'r') as f:
words = set()
word_to_vec_map = {}
for line in f:
line = line.strip().split()
curr_word = line[0]
words.add(curr_word)
word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
i = 1
words_to_index = {}
index_to_words = {}
for w in sorted(words):
words_to_index[w] = i
index_to_words[i] = w
i = i + 1
return words_to_index, index_to_words, word_to_vec_map
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def read_csv(filename = 'data/emojify_data.csv'):
phrase = []
emoji = []
with open (filename) as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
phrase.append(row[0])
emoji.append(row[1])
X = np.asarray(phrase)
Y = np.asarray(emoji, dtype=int)
return X, Y
def convert_to_one_hot(Y, C):
Y = np.eye(C)[Y.reshape(-1)]
return Y
emoji_dictionary = {
"0": "\u2764\uFE0F", # :heart: prints a black instead of red heart depending on the font
"1": ":baseball:",
"2": ":smile:",
"3": ":disappointed:",
"4": ":fork_and_knife:"}
def label_to_emoji(label):
"""
Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
"""
return emoji.emojize(emoji_dictionary[str(label)], use_aliases=True)
def print_predictions(X, pred):
print()
for i in range(X.shape[0]):
print(X[i], label_to_emoji(int(pred[i])))
def plot_confusion_matrix(y_actu, y_pred, title='Confusion matrix', cmap=plt.cm.gray_r):
df_confusion = pd.crosstab(y_actu, y_pred.reshape(y_pred.shape[0],), rownames=['Actual'], colnames=['Predicted'], margins=True)
df_conf_norm = df_confusion / df_confusion.sum(axis=1)
plt.matshow(df_confusion, cmap=cmap) # imshow
#plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(df_confusion.columns))
plt.xticks(tick_marks, df_confusion.columns, rotation=45)
plt.yticks(tick_marks, df_confusion.index)
#plt.tight_layout()
plt.ylabel(df_confusion.index.name)
plt.xlabel(df_confusion.columns.name)
def predict(X, Y, W, b, word_to_vec_map):
"""
Given X (sentences) and Y (emoji indices), predict emojis and compute the accuracy of your model over the given set.
Arguments:
X -- input data containing sentences, numpy array of shape (m, None)
Y -- labels, containing index of the label emoji, numpy array of shape (m, 1)
Returns:
pred -- numpy array of shape (m, 1) with your predictions
"""
m = X.shape[0]
pred = np.zeros((m, 1))
for j in range(m): # Loop over training examples
# Split jth test example (sentence) into list of lower case words
words = X[j].lower().split()
# Average words' vectors
avg = np.zeros((50,))
for w in words:
avg += word_to_vec_map[w]
avg = avg/len(words)
# Forward propagation
Z = np.dot(W, avg) + b
A = softmax(Z)
pred[j] = np.argmax(A)
print("Accuracy: " + str(np.mean((pred[:] == Y.reshape(Y.shape[0],1)[:]))))
return pred
1 基本模型:Emojifier-V1
1.1 emoji 数据集
我们先来建立一个简单的分类器。
我们有一个小型数据集(X, Y):
- X 包含127个句子
- Y 包含标号为0-4的对应于每个句子的表情
下面导入数据集,训练集127个例子,测试集56个例子
X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')
maxLen = len(max(X_train, key=len).split())
查看训练集数据
index = 1
print(X_train[index], label_to_emoji(Y_train[index]))
1.2 Emojifier-V1 概述
在这一部分,我们要实现一个名为”Emojifier-v1” 的模型
模型的输入是一个句子,输出是(1,5)的概率向量,然后通过argmax层得出最适合的预测表情。
为了将我们的标签转换为适合训练softmax分类器的格式,我们将Y从(m,1)转换为”one-hot” 的(m,5)。如下转换中 Y_oh 表示”Y-one-hot”。
Y_oh_train = convert_to_one_hot(Y_train, C = 5)
Y_oh_test = convert_to_one_hot(Y_test, C = 5)
index = 50
print(Y_train[index], "is converted into one hot", Y_oh_train[index])
# 0 is converted into one hot [ 1. 0. 0. 0. 0.]
数据准备就绪,下面可以实现模型了。
1.3 实现 Emojifier-V1
首先需要将输入的句子转换为词向量再求平均值,我们仍然使用50维的 Glove 词嵌入。
导入word_to_vec_map,其中包含了所有的向量表示。
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
导入的内容有
- word_to_index:词典中单词到索引的映射(400,001个单词,索引从0到400,000)
- index_to_word:词典中索引到单词的映射
- word_to_vec_map:词典中单词到 Glove 向量的映射
看看数据
word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])
# the index of cucumber in the vocabulary is 113317
# the 289846th word in the vocabulary is potatos
练习:实现 sentence_to_avg()
- 将每个句子转换为小写,然后拆分句子为单词列表(可以使用X.lower() 和 X.split())
- 取出句子中每个单词的 Glove 向量,然后求平均值
# GRADED FUNCTION: sentence_to_avg
def sentence_to_avg(sentence, word_to_vec_map):
"""
Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
and averages its value into a single vector encoding the meaning of the sentence.
Arguments:
sentence -- string, one training example from X
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
Returns:
avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
"""
### START CODE HERE ###
# Step 1: Split sentence into list of lower case words (鈮?1 line)
words = sentence.lower().split()
# Initialize the average word vector, should have the same shape as your word vectors.
avg = np.zeros((50,))
# Step 2: average the word vectors. You can loop over the words in the list "words".
for w in words:
avg += word_to_vec_map[w]
avg = avg/len(words)
### END CODE HERE ###
return avg
avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = ", avg)
# avg = [-0.008005 0.56370833 -0.50427333 0.258865 0.55131103 0.03104983
# -0.21013718 0.16893933 -0.09590267 0.141784 -0.15708967 0.18525867
# 0.6495785 0.38371117 0.21102167 0.11301667 0.02613967 0.26037767
# 0.05820667 -0.01578167 -0.12078833 -0.02471267 0.4128455 0.5152061
# 0.38756167 -0.898661 -0.535145 0.33501167 0.68806933 -0.2156265
# 1.797155 0.10476933 -0.36775333 0.750785 0.10282583 0.348925
# -0.27262833 0.66768 -0.10706167 -0.283635 0.59580117 0.28747333
# -0.3366635 0.23393817 0.34349183 0.178405 0.1166155 -0.076433
# 0.1445417 0.09808667]
期待的输出
key | value |
---|---|
avg | [-0.008005 0.56370833 -0.50427333 0.258865 0.55131103 0.03104983 -0.21013718 0.16893933 -0.09590267 0.141784 -0.15708967 0.18525867 0.6495785 0.38371117 0.21102167 0.11301667 0.02613967 0.26037767 0.05820667 -0.01578167 -0.12078833 -0.02471267 0.4128455 0.5152061 0.38756167 -0.898661 -0.535145 0.33501167 0.68806933 -0.2156265 1.797155 0.10476933 -0.36775333 0.750785 0.10282583 0.348925 -0.27262833 0.66768 -0.10706167 -0.283635 0.59580117 0.28747333 -0.3366635 0.23393817 0.34349183 0.178405 0.1166155 -0.076433 0.1445417 0.09808667] |
模型
你已经实现了 model 的各个部分。在sentence_to_avg()之后需要将平均值进行前向传播、计算损失、后向传播以及更新参数。
练习:实现 model()
假设这里的 Yoh(“Y one hot”)为输出标签的”one-hot”表示,下面公式用于前向传播和计算交叉熵。