CBOW和Skip-Gram模型介绍及Python编程实现

最新推荐文章于 2024-02-28 15:14:46 发布

99.99％

最新推荐文章于 2024-02-28 15:14:46 发布

阅读量4k

点赞数 12

文章标签： python 自然语言处理 jupyter

本文链接：https://blog.csdn.net/weixin_50706330/article/details/127335284

版权

CBOW模型 Skip-Gram模型词向量预测神经网络文本挖掘

关键词由CSDN通过智能技术生成

文章目录

前言

本文实现了CBOW和Skip-Gram模型的文本词汇预测。下图为两种模型结构图：

一、CBOW模型

1. CBOW模型介绍

CBOW模型功能：通过给出目标词语前后位置上的x个词语可以实现对中间词语的预测（x是前后词语个数，x可变。代码中我实现的是利用前后各2个词语，来预测中间位置词语是什么）。

CBOW模型考虑了上下文(t - 1，t + 1)，CBOW模型的全称为Continuous Bag-of-Word Model。该模型的作用是根据给定的词，预测目标词出现的概率。如下图所示，Input layer表示给定的词，${h_1,...,h_N}$是这个给定词的词向量（又称输入词向量），Output layer是这个神经网络的输出层，为了得出在这个输入词下另一个词出现的可能概率，需要对Output layer求softmax。

2. CBOW模型实现

第一步：随便找一段英文文本，进行分词并汇总为集合word，并形成顺序字典word_to_ix、ix_to_word。

import torch
import torch.nn as nn

text = """People who truly loved once are far more likely to love again.
Difficult circumstances serve as a textbook of life for people.
The best preparation for tomorrow is doing your best today.
The reason why a great man is great is that he resolves to be a great man.
The shortest way to do many things is to only one thing at a time.
Only they who fulfill their duties in everyday matters will fulfill them on great occasions. 
I go all out to deal with the ordinary life. 
I can stand up once again on my own.
Never underestimate your power to change yourself.""".split()

word = set(text)
word_size = len(word)

word_to_ix = {word:ix for ix, word in enumerate(word)}
ix_to_word = {ix:word for ix, word in enumerate(word)}

注：enumerate()是python的内置函数。
enumerate在字典上是枚举、列举的意思。
enumerate参数为可遍历/可迭代的对象(如列表、字符串)。
enumerate多用于在for循环中得到计数，利用它可以同时获得索引和值，即需要index和value值的时候可以使用enumerate。

第二步：定义方法，自定义make_context_vector方法制作数据，自定义CBOW用于建立模型；

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

EMDEDDING_DIM = 100 #词向量维度


data = []
for i in range(2, len(text) - 2):
    context = [text[i - 2], text[i - 1],
               text[i + 1], text[i + 2]]
    target = text[i]
    data.append((context, target))


class CBOW(torch.nn.Module):
    def __init__(self, word_size, embedding_dim):
        super(CBOW, self).__init__()

        self.embeddings = nn.Embedding(word_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()

        self.linear2 = nn.Linear(128, word_size)
        self.activation_function2 = nn.LogSoftmax(dim = -1)
        

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

    def get_word_emdedding(self, word):
        word = torch.tensor([word_to_ix[word]])
        return self.embeddings(word).view(1,-1)

第三步：建立模型，开始训练；

model = CBOW(word_size, EMDEDDING_DIM)

loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

#开始训练
for epoch in range(100):
    total_loss = 0

    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)  

        log_probs = model(context_vector)

        total_loss += loss_function(log_probs, torch.tensor([word_to