吴恩达deeplearning Lesson5 Week2 自然语言处理与词嵌入词嵌入操作+根据文本自动加注Emoji

最新推荐文章于 2022-09-09 10:25:58 发布

pu扑朔迷离

最新推荐文章于 2022-09-09 10:25:58 发布

阅读量431

点赞数

分类专栏： Tensorflow DeepLearning 文章标签：吴恩达 deeplearning keras 词嵌入

本文链接：https://blog.csdn.net/bluehatihati/article/details/90770455

版权

Tensorflow 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

DeepLearning

6 篇文章 0 订阅

订阅专栏

吴恩达deeplearning Lesson5 Week2 自然语言处理与词嵌入词嵌入操作+根据文本自动加注Emoji

词嵌入
作业1 词嵌入操作
作业2 根据文本自动加注Emoji

词嵌入

在这里插入图片描述
如上图，顾名思义，将每一个向量化表示，不用较为死板的onehot，而是将词的表征soft化。

之所以叫嵌入的原因是，你可以想象一个300维的空间）。现在取每一个单词比如orange，它对应一个300维的特征向量，所以这个词就被嵌在这个300维空间里的一个点上了，apple这个词就被嵌在这个300维空间的另一个点上了，这就是嵌入的例子与理解。

使用词嵌入

总结一下，这是如何用词嵌入做迁移学习的步骤。

1 先从大量的文本集中学习词嵌入。一个非常大的文本集，或者可以下载网上预训练好的词嵌入模型，网上你可以找到不少，词嵌入模型并且都有许可。

2 你可以用这些词嵌入模型把它迁移到你的新的只有少量标注训练集的任务中，比如说用这个300维的词嵌入来表示你的单词。这样做的一个好处就是你可以用更低维度的特征向量代替原来的10000维的one-hot向量，现在你可以用一个300维更加紧凑的向量。

3 当你在你新的任务上训练模型时，在你的命名实体识别任务上，只有少量的标记数据集上，你可以自己选择要不要继续微调，用新的数据调整词嵌入。（一般不微调）
（实际中，只有在第二步中有很大的数据集才会这样做，如果你标记的数据集不是很大，通常不要在微调词嵌入上费力气。）

（最后，词嵌入和人脸编码之间有奇妙的关系）

在这里插入图片描述

这个结果表示，man和woman主要的差异是gender（性别）上的差异，而king和queen之间的主要差异，根据向量的表示，也是gender（性别）上的差异，这就是为什么结果是相同的。
这是词嵌入领域影响力最为惊人和显著的成果之一，这种思想帮助了研究者们对词嵌入领域建立了更深刻的理解。

嵌入矩阵

在这里插入图片描述
如图示

学习词嵌入基础模型

在这里插入图片描述
实际上更常见的是有一个固定的历史窗口，举个例子，你总是想预测给定四个单词（上图编号1所示）后的下一个单词，注意这里的4是算法的超参数。这就是如何适应很长或者很短的句子，方法就是总是只看前4个单词，所以说我只用这4个单词（上图编号2所示）而不去看这几个词（上图编号3所示）。如果你一直使用一个4个词的历史窗口，这就意味着你的神经网络会输入一个1200维的特征变量到这个层中（上图编号4所示），然后再通过softmax来预测输出，选择有很多种，用一个固定的历史窗口就意味着你可以处理任意长度的句子，因为输入的维度总是固定的。所以这个模型的参数就是矩阵E，对所有的单词用的都是同一个矩阵E，而不是对应不同的位置上的不同单词用不同的矩阵。然后这些权重（上图编号5所示）也都是算法的参数，你可以用反向传播来进行梯度下降来最大化训练集似然，通过序列中给定的4个单词去重复地预测出语料库中下一个单词什么。

Word2Vec

在这里插入图片描述
我们将构造一个监督学习问题，它给定上下文词，要求你预测在这个词正负10个词距或者正负5个词距内随机选择的某个目标词。显然，这不是个非常简单的学习问题，因为在单词orange的正负10个词距之间，可能会有很多不同的单词。但是构造这个监督学习问题的目标并不是想要解决这个监督学习问题本身，而是想要使用这个学习问题来学到一个好的词嵌入模型。
模型：
在这里插入图片描述
缺点：softmax（上图3、6，由于样本数量太多）计算起来很慢。
解决办法：分级的softmax分类器和负采样（Negative Sampling）

分级分类器：用了决策树的想法，如下图
在这里插入图片描述
负采样：在这个算法中要做的是构造一个新的监督学习问题，那么问题就是给定一对单词，比如orange和juice，我们要去预测这是否是一对上下文词-目标词（context-target）。
在这个例子中orange和juice就是个正样本，那么orange和king就是个负样本，我们把它标为0。我们要做的就是采样得到一个上下文词和一个目标词，在这个例子中就是orange 和juice，我们用1作为标记。

与此同时，如果想要在NLP问题上取得进展，去下载其他人的词向量也是很好的方法，在此基础上改进。

GloVe 词向量

GloVe代表用词表示的全局变量（global vectors for word representation）
在这里插入图片描述
具体看论文

情感分析

在这里插入图片描述
目标如上，模型如下

一旦你学习到或者从网上下载词嵌入，你就可以很快构建一个很有效的NLP系统。

除偏

消除歧视（例：性别歧视）如下：
在这里插入图片描述

非带性别表征词汇（比如码农）

在这里插入图片描述

比如50维，将其进行pca，x轴为偏见轴（比如正是男负是女），使目标词汇向量消除偏见，就是将其列于0点想轴的正交轴上。

带性别表征词汇（比如爷爷奶奶）

让他们对性别的区分距离是均等的，如下图。
在这里插入图片描述
方法

作业1 词嵌入操作

遇到的问题主要是词嵌入向量不对，后来又去网上下载了一个新的。

作业2 根据文本自动加注Emoji

本作业的训练集和测试集的csv都是错误的，重新从网上下载。

Emoji v2

模型图：
在这里插入图片描述

embedding keras

embedding模型图：
在这里插入图片描述
将词汇找到向量表中的编号，并padding(补零)(这一步提前做)
https://keras.io/layers/embeddings/

Embedding(vocab_len, emb_dim, weights=[emb_matrix], trainable = False)

vocab_len输入维度（词汇表的长度，这里为400001）、emb_dim输出维度（每个词汇的vector的长度这里为50）
weights:是训练好的词嵌入矩阵
作用：嵌入层将正整数（下标）转换为具有固定大小的向量，如[[4],[20]]->[[0.25,0.1],[0.6,-0.2]]
第4号->第4号的词嵌入向量[0.25,0.1]
第20号->第20号的词嵌入向量[0.6,-0.2]
https://blog.csdn.net/jiangpeng59/article/details/77533309

https://blog.csdn.net/yyhhlancelot/article/details/86534793

下为作业源代码

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement) =400001
    
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    ### START CODE HERE ###
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len,emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len, emb_dim, weights=[emb_matrix], trainable = False)
    #输入维度（词汇表的长度，这里为400001）、输出维度（每个词汇的vector的长度 这里为50）
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

LSTM keras

在这里插入图片描述
为了搭建上图，lstm层需要设置输出。
return_sequences=True 表示输出时的维数等于输入的维数（相当于输出了一个序列）。return_sequences=False 相当于上图的最上面的输出，只在最后的lstm层那里输出一个结果。

X = LSTM(128,return_sequences=True)(embeddings)
X =  LSTM(128,return_sequences=False)(X)

源码：

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the Emojify-v2 model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    ### START CODE HERE ###
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(input_shape,dtype="int32")
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices) 
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128,return_sequences=True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X =  LSTM(128,return_sequences=False)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(5,activation="softmax")(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs = sentence_indices, outputs = X, name='Emojify_V2')
    
    ### END CODE HERE ###
    
    return model