PyTorch搭建N-gram模型实现单词预测

N-gram模型就是假设预测这个单词只与前面的N-1个单词有关,并不是和前面所有的词都有关系。

所以,对于1-gram(每个词都是独立分布的):

P(w1, w2, w3, … , wn) = P(w1)P(w2|w1)P(w3|w1w2)P(w4|w1w2w3)…P(wn|w1w2…wn-1)

                                     ≈ P(w1)P(w2)P(w3)P(w4)…P(wn)

2-gram(每个词都与左边最近的那个词有关):

P(w1, w2, w3, … , wn) = P(w1)P(w2|w1)P(w3|w1w2)P(w4|w1w2w3)…P(wn|w1w2…wn-1)

                                     ≈P(w1)P(w2|w1)P(w3|w2)P(w4|w3)…P(wn|wn-1)

3-gram(每个词都与左边最近的那两个词有关):

P(w1, w2, w3, … , wn) = P(w1)P(w2|w1)P(w3|w1w2)P(w4|w1w2w3)…P(wn|w1w2…wn-1)

                                     ≈ P(w1)P(w2|w1)P(w3|w1w2)P(w4|w2w3)…P(wn|wn-2wn-1)

而:P(w2|w1)=语料库中w1和w2出现的次数/语料库中w1出现的次数

对于这个条件概率,传统的方法是统计语料中每个词出现的频率,根据贝叶斯定理来估计这个条件概率,这里我们就可以用词嵌入对其进行代替,然后最大化条件概率从而优化词向量,据此进行预测。

下面对代码进行说明:

  • 首先添加引用,net里面是定义的n-gram模型,
import torch
from torch import nn, optim
import net
class n_gram(nn.Module):
    def __init__(self, vocab_size, context_size, n_dim):
        super(n_gram, self).__init__()

        self.embed = nn.Embedding(vocab_size, n_dim)   # (vocab_size,n_dim)
        self.classify = nn.Sequential(
            nn.Linear(context_size * n_dim, 128),   
            nn.ReLU(True),
            nn.Linear(128, vocab_size)
        )

    def forward(self, x):
        voc_embed = self.embed(x)  # 得到词嵌入  context_size*n_dim
        voc_embed = voc_embed.view(1, -1)  # 将两个词向量拼在一起  1*(context_size*n_dim)
        out = self.classify(voc_embed)   # 1*vocab_size
        return out
  • 定义一些参数和语料库
CONTEXT_SIZE = 2  # 2-gram
EMBEDDING_DIM = 10  # 词向量的维度

test_sentence = """We always knew our daughter Kendall was 
                going be a performer of some sort. 
                She entertained people in our small town 
                by putting on shows on our front porch when 
                she was only three or four. Blonde-haired, 
                blue-eyed, and beautiful, she sang like a 
                little angel and mesmerized1 everyone.""".split()

trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2])
            for i in range(len(test_sentence)-2)]

这里的 CONTEXT_SIZE =2 表示我们由前面2个单词来预测这个单词,EMBEDDING_DIM 表示词嵌入的维度。

接着我们建立训练集,将单词三个分组,前面两个作为输入,最后一个作为预测的结果。

  • 对单词进行编码,用数字表示每个单词,只有这样才能传入nn.Embedding得到词向量。
# 建立每个词与数字的编码,据此构建词嵌入
vocb = set(test_sentence)  # 使用 set 将重复的元素去掉
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
  • 初始化模型,定义损失和优化函数
model = net.n_gram(len(word_to_idx), CONTEXT_SIZE, EMBEDDING_DIM)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2, weight_decay=1e-5)
  • 训练
for epoch in range(100):
    train_loss = 0
    for word, label in trigram:
        word = torch.LongTensor([word_to_idx[i] for i in word])  # 将两个词作为输入
        label = torch.LongTensor([word_to_idx[label]])
        # 前向传播
        out = model(word)
        loss = criterion(out, label)
        train_loss += loss.item()
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (epoch + 1) % 20 == 0:
        print('epoch: {}, Loss: {:.6f}'.format(epoch + 1, train_loss / len(trigram)))

  • 测试
model = model.eval()
word, label = trigram[15]
print('\ninput:{}'.format(word))
print('label:{}'.format(label))
word = torch.LongTensor([word_to_idx[i] for i in word])
out = model(word)
pred_label_idx = out.max(1)[1].item()  # 第一行的最大值的下标
predict_word = idx_to_word[pred_label_idx]  # 得到对应的单词
print('real word is {}, predicted word is {}'.format(label, predict_word))

可以发现预测值和label值一样,虽然是在训练集上,但是在一定程度上也说明这个小模型能够处理N-gram模型的问题。

  • 附上完整代码:
import torch
from torch import nn, optim
import net

CONTEXT_SIZE = 2  # 2-gram
EMBEDDING_DIM = 10  # 词向量的维度

test_sentence = """We always knew our daughter Kendall was 
                going be a performer of some sort. 
                She entertained people in our small town 
                by putting on shows on our front porch when 
                she was only three or four. Blonde-haired, 
                blue-eyed, and beautiful, she sang like a 
                little angel and mesmerized1 everyone.""".split()


trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2])
            for i in range(len(test_sentence)-2)]

# 建立每个词与数字的编码,据此构建词嵌入
vocb = set(test_sentence)  # 使用 set 将重复的元素去掉
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}

model = net.n_gram(len(word_to_idx), CONTEXT_SIZE, EMBEDDING_DIM)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2, weight_decay=1e-5)


for epoch in range(100):
    train_loss = 0
    for word, label in trigram:
        word = torch.LongTensor([word_to_idx[i] for i in word])  # 将两个词作为输入
        label = torch.LongTensor([word_to_idx[label]])
        # 前向传播
        out = model(word)
        loss = criterion(out, label)
        train_loss += loss.item()
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (epoch + 1) % 20 == 0:
        print('epoch: {}, Loss: {:.6f}'.format(epoch + 1, train_loss / len(trigram)))

model = model.eval()
word, label = trigram[15]
print('\ninput:{}'.format(word))
print('label:{}'.format(label))
word = torch.LongTensor([word_to_idx[i] for i in word])
out = model(word)
pred_label_idx = out.max(1)[1].item()  # 第一行的最大值的下标
predict_word = idx_to_word[pred_label_idx]  # 得到对应的单词
print('real word is {}, predicted word is {}'.format(label, predict_word))

 

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 数字20 设计师:blogdevteam 返回首页