NLP之语义自动匹配emoji

最新推荐文章于 2024-05-26 09:30:42 发布

冲动老少年

最新推荐文章于 2024-05-26 09:30:42 发布

阅读量2.4k

点赞数 2

文章标签： emoji 词嵌入吴恩达序列模型

本文链接：https://blog.csdn.net/u013093426/article/details/82829380

版权

本文是基于吴恩达老师《深度学习》第五课第二周练习题所做。

0.背景介绍

在发送短信时，通常我们会使用表情符来表达自己此刻的心情，比如 ❤️会代表“love”，但在表情包中选择表情往往需要花费一些时间，本程序将实现自动识别语义随后匹配合适的表情。

文本所需的第三方库、数据集及辅助程序，可点击此处下载。

import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt

1.Emojifier-V1

1.1 数据集和表情包

我们先建立一个简单的基线的分类器。下面导入一个较小的数据集，X包含127个句子，Y包含与X对应的[0-4]的整数标识每个句子的表情符，如下图所示：

X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/test_emoji.csv')

maxLen = len(max(X_train, key=len).split())

看一下数据集中具体例子

index = 1
print(X_train[index], label_to_emoji(Y_train[index]))

I am proud of your achievements ὠ4

注：由于emoji库中表情符采用UCS2的编码方式，而IDLE采用utf-8的编码方式，因此在运行中会出现乱码。虽然采用了UCS2的方式来表示各表情符，但仍未解决该问题，希望了解的朋友能够指正。表情符的USC2编码参考文章（1）.在下面的代码中，“\u”当符号为4位时可以很好的显示，但是当为5位时就显示乱码。

emoji_dictionary = {"0": "\u2764",    
                    "1": "\u26BE",
                    "2": "\u1F604",
                    "3": "\u1F61E",
                    "4": "\u1F374"}

1.2 Emojifier-V1概览

从上图中可知，模型的输入时一个句子的对应单词，输出是shape为（1,5）的概率向量。因此需要将Y值表示为（m，5）的one-hot表达式。

Y_oh_train = convert_to_one_hot(Y_train, C=5)
Y_oh_test = convert_to_one_hot(Y_test, C=5)

index = 50
print(Y_train[index], "is converted into one hot", Y_oh_train[index])

0 is converted into one hot [1. 0. 0. 0. 0.]

1.3 应用Emojifier-V1

将输入的句子转化为词向量表达式后，我们使用预先训练好的一个50维的GloVe词嵌入模型。

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])

the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos

接下来我们将输入的词向量转化一个平均值词向量，即概率图中的avg向量。

def sentence_to_avg(sentence, word_to_vec_map):
    words = sentence.lower().split()
    avg = np.zeros((50,))
    for w in words:
        avg += word_to_vec_map[w]
    avg = avg / len(words)

    return avg

avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = ", avg)

avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983
 -0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.18525867
  0.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.26037767
  0.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.5152061
  0.38756167 -0.898661

最低0.47元/天解锁文章

冲动老少年

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
NLP之语义自动匹配emoji

本文是基于吴恩达老师《深度学习》第五课第二周练习题所做。0.背景介绍在发送短信时，通常我们会使用表情符来表达自己此刻的心情，比如 ❤️会代表“love”，但在表情包中选择表情往往需要花费一些时间，本程序将实现自动识别语义随后匹配合适的表情。文本所需的第三方库、数据集及辅助程序，可点击此处下载。import numpy as npfrom emo_utils import *i...
复制链接

扫一扫