Python 学习笔记 —— N-Gram学习，根据文章中几个词，预测后一个词

最新推荐文章于 2024-01-30 13:00:00 发布

wwb_0218

最新推荐文章于 2024-01-30 13:00:00 发布

阅读量1.6k

点赞数 3

分类专栏： Python学习笔记文章标签： python 深度学习机器学习

本文链接：https://blog.csdn.net/wwb1990/article/details/104452390

版权

Python学习笔记专栏收录该内容

13 篇文章 7 订阅

订阅专栏

关于N-Gram的原理和描述

《自然语言处理NLP中的N-gram模型》这篇博主的文章描述的非常详细，各种例子。仔细阅读可以完全理解其原理。

做一个用N-Gram根据文章某位置的两个词，推测下一个词的实现。

import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.autograd import Variable

好好学习pytorch哈，所以我这里使用的pytorch实现

接下来做一下数据的预处理：

CONTEXT_SIZE = 2
EMBEDDING_DIM = 100
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

trigram = [((test_sentence[i], test_sentence[i + 1]), test_sentence[i + 2])
           for i in range(len(test_sentence) - 2)]

vocb = set(test_sentence)
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}

CONTEXT_SIZE 是输入语境的大小，这里用两个词来推测下一个词。所以初始值给了2。

EMBEDDING_DIM 词向量维度。用于语意的50+吧，这个有很多不同意见，不过效果上来讲越大越好，但是也不一定。还是再看看大神的相关文章吧。我取的100.

test_sentence 我找了一段文字。

trigram 将文字转换 [ [ 词i, 词i+1], 词i+2 ] (i from 0 to n-2) 这样的数组存储，方便输入。

vocb 是去掉重复单词，用于做字典。

word_to_idx 单词对应id的字典

idx_to_word id对应单词的字典

class NgramModel(nn.Module):
    def __init__(self, vocb_size, context_size, n_dim):
        super(NgramModel, self).__init__()
        #init中设置好每一层的输入输出口有几个
        self.n_word = vocb_size
        self.embedding = nn.Embedding(self.n_word, n_dim)
        self.linear1 = nn.Linear(context_size * n_dim, 128)
        self.linear2 = nn.Linear(128, self.n_word)

    def forward(self, x):
        emb = self.embedding(x)
        emb = emb.view(1, -1)
        out = self.linear1(emb)
        out = F.relu(out) #激活函数
        out = self.linear2(out)
        log_prob = F.log_softmax(out,dim=1)
        return log_prob

定义Ngram

self.linear1 可以理解为一个隐层，将输入转换为128位输出
self.linear2 输出层，将128位输出转换为结果

对应激活函数和log_softmax的说明可以看这篇文章的讲解：
PyTorch学习笔记——softmax和log_softmax的区别、CrossEntropyLoss() 与 NLLLoss() 的区别、log似然代价函数
我这里用F.log_softmax，所以后面计算loss用nn.NLLLoss()了。

ngrammodel = NgramModel(len(word_to_idx), CONTEXT_SIZE, EMBEDDING_DIM)
criterion = nn.NLLLoss()
optimizer = optim.SGD(ngrammodel.parameters(), lr=1e-3)

上面这一步是初始化一下NgramModel，nn.NLLLoss()和optimizer 优化器
criterion 用于计算loss
optimizer 用于backward的参数优化

for epoch in range(100):
    print('epoch: {}'.format(epoch + 1))
    print('*' * 10)
    running_loss = 0
    for data in trigram:
        word, label = data
        word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
        label = Variable(torch.LongTensor([word_to_idx[label]]))
        # forward
        out = ngrammodel(word)
        loss = criterion(out, label)
        running_loss += loss.item()
        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print('Loss: {:.6f}'.format(running_loss / len(word_to_idx)))

上面这段是训练模型。

epoch 训练100次迭代。打印迭代次数，和累计的loss

从trigram中取出之前处理好的单词数组，word是输入的两个单词，label是输出的目标单词。

后面两句word 和label 的转换是将单词用字典中的id替代。

forward 的流程是，将输入送入模型，返回结果和目标做对比，得到loss。loss累加一下。

backward 的流程是，优化器清零，loss.backward()，然后优化器step。
optimizer更新参数空间需要基于反向梯度，因此，当调用optimizer.step()的时候应当是loss.backward()后。
这一块调用原理参照《Pytorch optimizer.step() 和loss.backward()和scheduler.step()的关系与区别（Pytorch 代码讲解）》和《Pytorch手册》

word, label = trigram[1]
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = ngrammodel(word)
_, predict_label = torch.max(out, 1)
predict_word = idx_to_word[predict_label.item()]
print('real word is {}, predict word is {}'.format(label, predict_word))

最后训练好了之后，用上面这个代码来使用预测。
取id为1开始的word两个预测单词，label一个目标单词。
world转换成id来输入。
输出结果取max，得分最高的。
然后用这个得分最高的id，从字典里再转换回单词。

最后输出预测结果，和目标单词。
在这里插入图片描述
完整代码：

import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.autograd import Variable

CONTEXT_SIZE = 2
EMBEDDING_DIM = 100
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

trigram = [((test_sentence[i], test_sentence[i + 1]), test_sentence[i + 2])
           for i in range(len(test_sentence) - 2)]

vocb = set(test_sentence)
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}


class NgramModel(nn.Module):
    def __init__(self, vocb_size, context_size, n_dim):
        super(NgramModel, self).__init__()
        self.n_word = vocb_size
        self.embedding = nn.Embedding(self.n_word, n_dim)
        self.linear1 = nn.Linear(context_size * n_dim, 128)
        self.linear2 = nn.Linear(128, self.n_word)

    def forward(self, x):
        emb = self.embedding(x)
        emb = emb.view(1, -1)
        out = self.linear1(emb)
        out = F.relu(out)#激激活函数
        out = self.linear2(out)
        log_prob = F.log_softmax(out,dim=1)
        return log_prob


ngrammodel = NgramModel(len(word_to_idx), CONTEXT_SIZE, EMBEDDING_DIM)
criterion = nn.NLLLoss()
optimizer = optim.SGD(ngrammodel.parameters(), lr=1e-3)

for epoch in range(100):
    print('epoch: {}'.format(epoch + 1))
    print('*' * 10)
    running_loss = 0
    for data in trigram:
        word, label = data
        word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
        label = Variable(torch.LongTensor([word_to_idx[label]]))
        # forward
        out = ngrammodel(word)
        loss = criterion(out, label)
        running_loss += loss.item()
        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print('Loss: {:.6f}'.format(running_loss / len(word_to_idx)))

word, label = trigram[1]
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = ngrammodel(word)
_, predict_label = torch.max(out, 1)
predict_word = idx_to_word[predict_label.item()]
print('real word is {}, predict word is {}'.format(label, predict_word))