word2vec-词向量模型

最新推荐文章于 2024-05-18 21:39:05 发布

平平无奇的小天才而已

最新推荐文章于 2024-05-18 21:39:05 发布

阅读量529

点赞数

分类专栏： Nlp 文章标签： word2vec 机器学习 python 人工智能 nlp

本文链接：https://blog.csdn.net/weixin_45879692/article/details/129670826

版权

Nlp 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

word2vec-词向量模型

Paper - Distributed Representations of Words and Phrases and their Compositionality(2013)

词向量

将文本向量化，使用数值化的向量来描述词的特征，通常来说，向量的维度越高，能提供的信息也就越多，从而计算结果的可靠性就更值得信赖。

CBOW（Continuous Bag of Words）

CBOW（连续词袋模型）在上下文已经知道的情况下去预测中间词，与NNLM相比，没有了隐藏层，用输入向量的和代替了向量的拼接

举例子：

比如一句话： Apples are red and sweet

输入：Apples，are ，and，sweet 的词向量

希望输出：red

CBOW的算法实现：

输入层：一个形状为C×V的one-hot张量，其中C代表上下文中词的个数，通常是一个偶数，我们假设为4；V表示词表大小，我们假设为5000，该张量的每一行都是一个上下文词的one-hot向量表示，比如“Apples，are ，and，sweet ”。
隐藏层：一个形状为V×N的参数张量W1，一般称为word-embedding，N表示每个词的词向量长度，我们假设为128。输入张量和word embedding W1进行矩阵乘法，就会得到一个形状为C×N的张量。综合考虑上下文中所有词的信息去推理中心词，因此将上下文中C个词相加得一个1×N的向量，是整个上下文的一个隐含表示。
输出层：创建另一个形状为N×V的参数张量，将隐藏层得到的1×N的向量乘以该N×V的参数张量，得到了一个形状为1×V的向量。最终，1×V的向量代表了使用上下文去推理中心词，每个候选词的打分，再经过softmax函数的归一化，即得到了对中心词的推理概率。

Skip-Gram

Skip-gram模型输入和输出与CBOW正好相反，在中间词已经确定的情况下，预测上下文词

SG算法的中心思想就是对于每个选定的中心词，尽量准确的预测其周围可能出现的词的概率分布。具体来说，SG算法首先随机初始化每个词的词向量；然后预测不同临近词出现的概率，最后最大化实际临近词出现的概率。

在自然语言处理任务中，词向量（Word Embedding）是表示自然语言里单词的一种方法，即把每个词都表示为一个N维空间内的点，即一个高维空间内的向量。通过这种方法，实现把自然语言计算转换为向量计算。

如下图所示的词向量计算任务中，先把每个词（如queen，king等）转换成一个高维空间的向量，这些向量在一定意义上可以代表这个词的语义信息。再通过计算这些向量之间的距离，就可以计算出词语之间的关联关系，从而达到让计算机像计算数值一样去计算自然语言的目的。

自然语言单词是离散信号，比如“我”、“ 爱”、“人工智能”。如何把每个离散的单词转换为一个向量？通常情况下，我们可以维护一个如下图所示的查询表。表中每一行都存储了一个特定词语的向量值，每一列的第一个元素都代表着这个词本身，以便于我们进行词和向量的映射（如“我”对应的向量值为 [0.3，0.5，0.7，0.9，-0.2，0.03] ）。给定任何一个或者一组单词，我们都可以通过查询这个excel，实现把单词转换为向量的目的，这个查询和替换过程称之为Embedding Lookup。

上述过程也可以使用一个字典数据结构实现。事实上如果不考虑计算效率，使用字典实现上述功能是个不错的选择。然而在进行神经网络计算的过程中，需要大量的算力，常常要借助特定硬件（如GPU）满足训练速度的需求。GPU上所支持的计算都是以张量（Tensor）为单位展开的，因此在实际场景中，我们需要把Embedding Lookup的过程转换为张量计算，如下图所示。

假设对于句子"我，爱，人工，智能"，把Embedding Lookup的过程转换为张量计算的流程如下：

通过查询字典，先把句子中的单词转换成一个ID（通常是一个大于等于0的整数），这个单词到ID的映射关系可以根据需求自定义（如上图中，我=>1, 人工=>2，爱=>3，…）。
得到ID后，再把每个ID转换成一个固定长度的向量。假设字典的词表中有5000个词，那么，对于单词“我”，就可以用一个5000维的向量来表示。由于“我”的ID是1，因此这个向量的第一个元素是1，其他元素都是0（[1，0，0，…，0]）；同样对于单词“人工”，第二个元素是1，其他元素都是0。用这种方式就实现了用一个向量表示一个单词。由于每个单词的向量表示都只有一个元素为1，而其他元素为0，因此我们称上述过程为One-Hot Encoding。
经过One-Hot Encoding后，句子“我，爱，人工，智能”就被转换成为了一个形状为 4×5000的张量，记为VVV。在这个张量里共有4行、5000列，从上到下，每一行分别代表了“我”、“爱”、“人工”、“智能”四个单词的One-Hot Encoding。最后，我们把这个张量VVV和另外一个稠密张量WWW相乘，其中WWW张量的形状为5000 × 128（5000表示词表大小，128表示每个词的向量大小）。经过张量乘法，我们就得到了一个4×128的张量，从而完成了把单词表示成向量的目的。

图中的love是目标单词，其他是上下文单词，求解其他单词的概率。

skip-gram算法实现：

Input Layer（输入层）：接收一个one-hot张量作为网络的输入，里面存储着当前句子中心词的one-hot表示。
Hidden Layer（隐藏层）：将张量V乘以一个word embedding张量，并把结果作为隐藏层的输出，得到一个形状为的张量，里面存储着当前句子中心词的词向量。
Output Layer（输出层）：将隐藏层的结果乘以另一个word embedding张量，得到一个形状为的张量。这个张量经过softmax变换后，就得到了使用当前中心词对上下文的预测结果。根据这个softmax的结果，我们就可以去训练词向量模型。

在实际操作中，使用一个滑动窗口（一般情况下，长度是奇数），从左到右开始扫描当前句子。每个扫描出来的片段被当成一个小句子，每个小句子中间的词被认为是中心词，其余的词被认为是这个中心词的上下文。

CBOW和Skip-Gram模型

Skip-Gram代码实现

在语料库里生成输入标签、定义模型、利用输入标签训练模型

内容分为以下几个部分：

1.数据准备——定义语料库、整理、规范化和分词

2.超参数——学习率、训练次数、窗口尺寸、嵌入（embedding）尺寸

3.生成训练数据——建立词汇表，对单词进行one-hot编码，建立将id映射到单词的字典，以及单词映射到id的字典

4.模型训练——通过正向传递编码过的单词，计算错误率，使用反向传播调整权重和计算loss值

5.结论——获取词向量，并找到相似的词

6.进一步的改进 —— 利用Skip-gram负采样(Negative Sampling)和Hierarchical Softmax提高训练速度

语料库

# 语料库
sentences = ["apple banana fruit", "banana orange fruit", "orange banana fruit",
                 "dog cat animal", "cat monkey animal", "monkey dog animal"]

处理数据

# 处理语料库
# 把内容合并到一起，通过空格分开
word_sequence = ' '.join(sentences).split()
# jion 拆分容器对象，然后以指定的方式连接起来  ---以上代码-->拆分语料库中的对象，按空格连接,返回字符串
# split 以指定的方式拆分容器对象 --->拆分join生成的语句，以空格方式存到list中
# 把语料库中的句子，拆分成单词，存放到列表中
'''
    ['apple', 'banana', 'fruit', 'banana', 'orange', 'fruit', 'orange', 'banana', 'fruit', 
    'dog', 'cat', 'animal', 'cat', 'monkey', 'animal', 'monkey', 'dog', 'animal']
'''
word_list = " ".join(sentences).split()
# 去掉列表中重复的单词
# ['animal', 'cat', 'apple', 'monkey', 'orange', 'fruit', 'dog', 'banana']
word_list = list(set(word_list))
# 把表中的单词存放到字典中，并且赋予编号
# --每次运行都会变{'animal': 0, 'cat': 1, 'apple': 2, 'monkey': 3, 'orange': 4, 'fruit': 5, 'dog': 6, 'banana': 7}
# 构建词典
word_dict = {w: i for i, w in enumerate(word_list)}

构建训练数据

# 设置训练数据为1个窗口长度（只取关键词的前一个词和后一个词）
skip_grams = []  # 训练数据
for i in range(1, len(word_sequence) - 1):  # -->rang() 左闭右开
    target = word_dict[word_sequence[i]]  # 当前词在字典中对应的id
    # 窗口长度为1
    context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]  # 上下文词对应的id
    for w in context:
        # target对应的上下文 存入到skip_grams
        # 例如: target=4 --前文：3 后文：1 则存入[4,3][4,1]
        skip_grams.append([target, w])

构建模型–Model

class Word2Vec(nn.Module):  # nn.Module
    # 构造函数
    def __init__(self):
        # 初始化父类
        super(Word2Vec, self).__init__()
        # 参数--(矩阵转换)
        # W and WT is not Traspose relationship
        # voc_size 长度为8
        # Linear函数
        self.W = nn.Linear(voc_size, embedding_size, bias=False)  # voc_size > embedding_size Weight
        self.WT = nn.Linear(embedding_size, voc_size, bias=False)  # embedding_size > voc_size Weight

    # 向前传播
    def forward(self, X):
        # X : [batch_size, voc_size]
        hidden_layer = self.W(X)  # hidden_layer : [batch_size, embedding_size]
        output_layer = self.WT(hidden_layer)  # output_layer : [batch_size, voc_size]
        return output_layer

切分-转化成one-hot形式

def random_batch():
    random_inputs = []
    random_labels = []
    # 乱序的索引
    random_index = np.random.choice(range(len(skip_grams)), batch_size, replace=False)

    for i in random_index:
        # 生成one-hot向量
        random_inputs.append(np.eye(voc_size)[skip_grams[i][0]])  # target
        random_labels.append(skip_grams[i][1])  # context word

    return random_inputs, random_labels

训练

 # Training --训练函数（5000轮）
    for epoch in range(5000):
        # 切分
        input_batch, target_batch = random_batch()
        input_batch = torch.Tensor(input_batch)
        target_batch = torch.LongTensor(target_batch)
        # print(input_batch)
        # print("target_batch")
        # print(target_batch)

        # 优化器梯度清零
        optimizer.zero_grad()
        # 模型的输出
        output = model(input_batch)

        # output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)
        loss = criterion(output, target_batch)
        # 隔1000个输出当前的的Epoch，损失值
        if (epoch + 1) % 1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
        # 反向传播
        loss.backward()
        # 更新梯度
        optimizer.step()

完整代码

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt


# 切分
def random_batch():
    random_inputs = []
    random_labels = []
    # 乱序的索引
    random_index = np.random.choice(range(len(skip_grams)), batch_size, replace=False)

    for i in random_index:
        # 生成one-hot向量
        random_inputs.append(np.eye(voc_size)[skip_grams[i][0]])  # target
        random_labels.append(skip_grams[i][1])  # context word

    return random_inputs, random_labels


# Model
class Word2Vec(nn.Module):  # nn.Module
    # 构造函数
    def __init__(self):
        # 初始化父类
        super(Word2Vec, self).__init__()
        # 参数--(矩阵转换)
        # W and WT is not Traspose relationship
        # voc_size 长度为8
        # Linear函数
        self.W = nn.Linear(voc_size, embedding_size, bias=False)  # voc_size > embedding_size Weight
        self.WT = nn.Linear(embedding_size, voc_size, bias=False)  # embedding_size > voc_size Weight

    # 向前传播
    def forward(self, X):
        # X : [batch_size, voc_size]
        hidden_layer = self.W(X)  # hidden_layer : [batch_size, embedding_size]
        output_layer = self.WT(hidden_layer)  # output_layer : [batch_size, voc_size]
        return output_layer


if __name__ == '__main__':
    # 一批次输入几个数据
    batch_size = 2  # mini-batch size
    # 维度
    embedding_size = 2  # embedding size
    # 语料库
    sentences = ["apple banana fruit", "banana orange fruit", "orange banana fruit",
                 "dog cat animal", "cat monkey animal", "monkey dog animal"]
    # 数据处理
    # 把语料库中的句子，拆分成单词，存放到列表中
    word_sequence = " ".join(sentences).split()
    '''
    ['apple', 'banana', 'fruit', 'banana', 'orange', 'fruit', 'orange', 'banana', 'fruit', 
    'dog', 'cat', 'animal', 'cat', 'monkey', 'animal', 'monkey', 'dog', 'animal']
    '''
    word_list = " ".join(sentences).split()
    # 去掉列表中重复的单词
    # ['animal', 'cat', 'apple', 'monkey', 'orange', 'fruit', 'dog', 'banana']
    word_list = list(set(word_list))
    # 把表中的单词存放到字典中，并且赋予编号
    # --每次运行都会变{'animal': 0, 'cat': 1, 'apple': 2, 'monkey': 3, 'orange': 4, 'fruit': 5, 'dog': 6, 'banana': 7}
    word_dict = {w: i for i, w in enumerate(word_list)}
    # print(word_dict)
    # 设置长度 -->8
    voc_size = len(word_list)

    # 设置训练数据为1个窗口长度
    skip_grams = []  # 训练数据
    for i in range(1, len(word_sequence) - 1):  # -->rang() 左闭右开
        target = word_dict[word_sequence[i]]  # 当前词在字典中对应的id
        # 窗口长度为1
        context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]  # 上下文词对应的id
        for w in context:
            skip_grams.append([target, w])
    # print(word_sequence)
    # print(skip_grams)
    model = Word2Vec()
    # 损失函数--交叉熵损失
    criterion = nn.CrossEntropyLoss()
    # 优化器 lr=0.001 学习率
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # Training --训练函数（5000轮）
    for epoch in range(5000):
        # 切分
        input_batch, target_batch = random_batch()
        input_batch = torch.Tensor(input_batch)
        target_batch = torch.LongTensor(target_batch)
        # print(input_batch)
        # print("target_batch")
        # print(target_batch)

        # 优化器梯度清零
        optimizer.zero_grad()
        # 模型的输出
        output = model(input_batch)

        # output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)
        loss = criterion(output, target_batch)
        # 隔1000个输出当前的的Epoch，损失值
        if (epoch + 1) % 1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
        # 反向传播
        loss.backward()
        # 更新梯度
        optimizer.step()

    # for i, label in enumerate(word_list):
    #     W, WT = model.parameters()
    #     x, y = W[0][i].item(), W[1][i].item()
    #     plt.scatter(x, y)
    #     plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
    # plt.show()