文章目录
前言
本文实现了CBOW和Skip-Gram模型的文本词汇预测。下图为两种模型结构图:
一、CBOW模型
1. CBOW模型介绍
CBOW模型功能:通过给出目标词语前后位置上的x个词语可以实现对中间词语的预测(x是前后词语个数,x可变。代码中我实现的是利用前后各2个词语,来预测中间位置词语是什么)。
CBOW模型考虑了上下文(t - 1,t + 1),CBOW模型的全称为Continuous Bag-of-Word Model。该模型的作用是根据给定的词,预测目标词出现的概率。如下图所示,Input layer表示给定的词,${h_1,...,h_N}$是这个给定词的词向量(又称输入词向量),Output layer是这个神经网络的输出层,为了得出在这个输入词下另一个词出现的可能概率,需要对Output layer求softmax。
2. CBOW模型实现
第一步:随便找一段英文文本,进行分词并汇总为集合word,并形成顺序字典word_to_ix、ix_to_word。
import torch
import torch.nn as nn
text = """People who truly loved once are far more likely to love again.
Difficult circumstances serve as a textbook of life for people.
The best preparation for tomorrow is doing your best today.
The reason why a great man is great is that he resolves to be a great man.
The shortest way to do many things is to only one thing at a time.
Only they who fulfill their duties in everyday matters will fulfill them on great occasions.
I go all out to deal with the ordinary life.
I can stand up once again on my own.
Never underestimate your power to change yourself.""".split()
word = set(text)
word_size = len(word)
word_to_ix = {word:ix for ix, word in enumerate(word)}
ix_to_word = {ix:word for ix, word in enumerate(word)}
注:enumerate()是python的内置函数。
enumerate在字典上是枚举、列举的意思。
enumerate参数为可遍历/可迭代的对象(如列表、字符串)。
enumerate多用于在for循环中得到计数,利用它可以同时获得索引和值,即需要index和value值的时候可以使用enumerate。
第二步:定义方法,自定义make_context_vector方法制作数据,自定义CBOW用于建立模型;
def make_context_vector(context, word_to_ix):
idxs = [word_to_ix[w] for w in context]
return torch.tensor(idxs, dtype=torch.long)
EMDEDDING_DIM = 100 #词向量维度
data = []
for i in range(2, len(text) - 2):
context = [text[i - 2], text[i - 1],
text[i + 1], text[i + 2]]
target = text[i]
data.append((context, target))
class CBOW(torch.nn.Module):
def __init__(self, word_size, embedding_dim):
super(CBOW, self).__init__()
self.embeddings = nn.Embedding(word_size, embedding_dim)
self.linear1 = nn.Linear(embedding_dim, 128)
self.activation_function1 = nn.ReLU()
self.linear2 = nn.Linear(128, word_size)
self.activation_function2 = nn.LogSoftmax(dim = -1)
def forward(self, inputs):
embeds = sum(self.embeddings(inputs)).view(1,-1)
out = self.linear1(embeds)
out = self.activation_function1(out)
out = self.linear2(out)
out = self.activation_function2(out)
return out
def get_word_emdedding(self, word):
word = torch.tensor([word_to_ix[word]])
return self.embeddings(word).view(1,-1)
第三步:建立模型,开始训练;
model = CBOW(word_size, EMDEDDING_DIM)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
#开始训练
for epoch in range(100):
total_loss = 0
for context, target in data:
context_vector = make_context_vector(context, word_to_ix)
log_probs = model(context_vector)
total_loss += loss_function(log_probs, torch.tensor([word_to