nlp-tutorial代码注释1-2，词向量、Word2Vec、Skip-gram简述

最新推荐文章于 2024-09-18 11:32:45 发布

yqy2001

最新推荐文章于 2024-09-18 11:32:45 发布

阅读量424

点赞数

分类专栏： nlp

本文链接：https://blog.csdn.net/yqy2001/article/details/104661201

版权

nlp 专栏收录该内容

7 篇文章 4 订阅

订阅专栏

系列语：本系列是nlp-tutorial代码注释系列，github上原项目地址为：nlp-tutorial，本系列每一篇文章的大纲是相关知识点介绍 + 详细代码注释。

one-hot

传统的自然语言处理中，通常使用one-hot向量表示单词。one-hot向量是一个只有一个1，其他均为0的稀疏向量，维度等于词典中单词的个数。示例如下：
在这里插入图片描述

one-hot向量有几个缺点：
1、每两个one-hot向量都是正交的，无法刻画one-hot向量之间的相似性，模型学习到的结果很难推广；
2、当词汇表中词汇数量较多时，one-hot向量维数巨大，带来维度灾难，增大模型学习的难度。

词向量：

词向量是一个密集的向量，它描述的是单词的特征，可以高效地表示词与词之间的关系，维数较少，下介绍训练词向量的Word2Vec算法。

Word2Vec、Skip-gram

Word2Vec算法就是给定一个词，要预测在这个词左右一定词距内随机选择的某个目标词。
下面是学习的细节，此模型叫Skip-gram模型：
1、首先从训练集中抽取配对好的中心词、目标词，目标词可以是中心词周围的随机词；
例如，训练集中给定了下图所示的句子：“I want a glass of orange juice to go along with my cereal.”，先随便选一个词作为上下文词，比如说 orange，再在此词左右一定的范围内选择目标词，比如可能选到 juice、glass、my等。
2、将上下文 c 和目标词 t 作为的输入𝑥和相应的输出𝑦；
3、从嵌入矩阵中得到词的嵌入向量，再把向量喂入一个softmax单元，得到预测词的概率分布；
在这里插入图片描述
4、计算损失函数、更新参数，损失函数如下：

代码实现

pytorch具体代码及详细注释如下：
（源代码为github中nlp-tutorial项目，项目地址：nlp-tutorial）

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import matplotlib.pyplot as plt        #绘图

dtype = torch.FloatTensor

# 3 Words Sentence
sentences = [ "i like dog", "i like cat", "i like animal",
              "dog cat animal", "apple cat dog like", "dog fish milk like",
              "dog cat eyes like", "i like apple", "apple i hate",
              "apple i movie book music like", "cat dog hate", "cat dog like"]     #数据集
word_sequence = " ".join(sentences).split()          #先用" ".join()，以空格为分隔，将sentences中的句子连接起来，再用.split()以空格为分割点，将每个词分出来
word_list = " ".join(sentences).split()              #同上
word_list = list(set(word_list))                     #先用set合并重复的单词，再用list创建单词列表
word_dict = {w: i for i, w in enumerate(word_list)}  #建立由单词到序号的索引

# Word2Vec Parameter
batch_size = 20                             # 一次训练的词向量数量
embedding_size = 2                          # 词向量的维度
voc_size = len(word_list)                   # 词典中单词的数量

def random_batch(data, size): # 这里的输入参数data是[中心词，上下文]对的列表 batch_size
    random_inputs = []        # 创建输入和相应标签的空列表
    random_labels = []
    random_index = np.random.choice(range(len(data)), size, replace=False)   #随机取样，取的是单词序号，形状为size，不重复取样
    
    for i in random_index:
        random_inputs.append(np.eye(voc_size)[data[i][0]])  # 输入为随机选择的词
        random_labels.append(data[i][1])                    # 相应的标签为输入的词的上下文
    
    return random_inputs, random_labels
    
# Make skip gram of one size window
skip_grams = []                                        # 根据数据集创建skip_gram空列表，其元素是[中心词， 上下文]对
for i in range(1, len(word_sequence) - 1):             # 对于单词序列中的每一个词
    target = word_dict[word_sequence[i]]               # 中心词
    context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]] # 上下文为中心词左右两边的两个词
    
    for w in context:                                  #将[中心词， 上下文]对加入skip_gram列表
        skip_grams.append([target, w])            
        
# Model
class Word2Vec(nn.Module):
    def __init__(self):
        super(Word2Vec, self).__init__()               #继承父类
        # W and WT is not Traspose relationship
        self.W = nn.Parameter(-2 * torch.rand(voc_size, embedding_size) + 1).type(dtype)  # 需要学习的词向量矩阵
        self.WT = nn.Parameter(-2 * torch.rand(embedding_size, voc_size) + 1).type(dtype) # 权重矩阵
        
    def forward(self, X):
        # 这里的输入X是one-hot向量，形状是：[batch_size, voc_size]
        hidden_layer = torch.matmul(X, self.W)             # 这一层是将输入的one-hot向量转换成词向量，形状是：[batch_size, embedding_size]
        output_layer = torch.matmul(hidden_layer, self.WT) # 这一层就是预测输入词的上下文词，形状是：[batch_size, voc_size]
        return output_layer
        
model = Word2Vec()

criterion = nn.CrossEntropyLoss()                     # 损失函数为交叉熵损失
optimizer = optim.Adam(model.parameters(), lr=0.001)  # 优化方法为Adam

# Training
for epoch in range(5000):
    input_batch, target_batch = random_batch(skip_grams, batch_size)  # 从数据集中随机取样获得输入和相对应的标签
    
    input_batch = Variable(torch.Tensor(input_batch))         # 这两个是转换为variable，现在的pytorch版本已取消
    target_batch = Variable(torch.LongTensor(target_batch))
    
    optimizer.zero_grad()                                     #每次训练前清零梯度缓存
    output = model(input_batch)                               #输入input_batch，从模型中获得输出
    
    # output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)
    loss = criterion(output, target_batch)                    #计算loss
    if (epoch + 1)%1000 == 0:                                 #每1000次打印一次loss
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
        
    loss.backward()                #反向传播                           
    optimizer.step()               #优化
    
for i, label in enumerate(word_list):            # 绘图
    W, WT = model.parameters()
    x,y = float(W[i][0]), float(W[i][1])
    plt.scatter(x, y)
    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()