N-gram模型就是假设预测这个单词只与前面的N-1个单词有关,并不是和前面所有的词都有关系。
所以,对于1-gram(每个词都是独立分布的):
P(w1, w2, w3, … , wn) = P(w1)P(w2|w1)P(w3|w1w2)P(w4|w1w2w3)…P(wn|w1w2…wn-1)
≈ P(w1)P(w2)P(w3)P(w4)…P(wn)
2-gram(每个词都与左边最近的那个词有关):
P(w1, w2, w3, … , wn) = P(w1)P(w2|w1)P(w3|w1w2)P(w4|w1w2w3)…P(wn|w1w2…wn-1)
≈P(w1)P(w2|w1)P(w3|w2)P(w4|w3)…P(wn|wn-1)
3-gram(每个词都与左边最近的那两个词有关):
P(w1, w2, w3, … , wn) = P(w1)P(w2|w1)P(w3|w1w2)P(w4|w1w2w3)…P(wn|w1w2…wn-1)
≈ P(w1)P(w2|w1)P(w3|w1w2)P(w4|w2w3)…P(wn|wn-2wn-1)
而:P(w2|w1)=语料库中w1和w2出现的次数/语料库中w1出现的次数
对于这个条件概率,传统的方法是统计语料中每个词出现的频率,根据贝叶斯定理来估计这个条件概率,这里我们就可以用词嵌入对其进行代替,然后最大化条件概率从而优化词向量,据此进行预测。
下面对代码进行说明:
- 首先添加引用,net里面是定义的n-gram模型,
import torch
from torch import nn, optim
import net
class n_gram(nn.Module):
def __init__(self, vocab_size, context_size, n_dim):
super(n_gram, self).__init__()
self.embed = nn.Embedding(vocab_size, n_dim) # (vocab_size,n_dim)
self.classify = nn.Sequential(
nn.Linear(context_size * n_dim, 128),
nn.ReLU(True),
nn.Linear(128, vocab_size)
)
def forward(self, x):
voc_embed = self.embed(x) # 得到词嵌入 context_size*n_dim
voc_embed = voc_embed.view(1, -1) # 将两个词向量拼在一起 1*(context_size*n_dim)
out = self.classify(voc_embed) # 1*vocab_size
return out
- 定义一些参数和语料库
CONTEXT_SIZE = 2 # 2-gram
EMBEDDING_DIM = 10 # 词向量的维度
test_sentence = """We always knew our daughter Kendall was
going be a performer of some sort.
She entertained people in our small town
by putting on shows on our front porch when
she was only three or four. Blonde-haired,
blue-eyed, and beautiful, she sang like a
little angel and mesmerized1 everyone.""".split()
trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2])
for i in range(len(test_sentence)-2)]
这里的 CONTEXT_SIZE
=2 表示我们由前面2个单词来预测这个单词,EMBEDDING_DIM
表示词嵌入的维度。
接着我们建立训练集,将单词三个分组,前面两个作为输入,最后一个作为预测的结果。
- 对单词进行编码,用数字表示每个单词,只有这样才能传入nn.Embedding得到词向量。
# 建立每个词与数字的编码,据此构建词嵌入
vocb = set(test_sentence) # 使用 set 将重复的元素去掉
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
- 初始化模型,定义损失和优化函数
model = net.n_gram(len(word_to_idx), CONTEXT_SIZE, EMBEDDING_DIM)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2, weight_decay=1e-5)
- 训练
for epoch in range(100):
train_loss = 0
for word, label in trigram:
word = torch.LongTensor([word_to_idx[i] for i in word]) # 将两个词作为输入
label = torch.LongTensor([word_to_idx[label]])
# 前向传播
out = model(word)
loss = criterion(out, label)
train_loss += loss.item()
# 反向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 20 == 0:
print('epoch: {}, Loss: {:.6f}'.format(epoch + 1, train_loss / len(trigram)))
- 测试
model = model.eval()
word, label = trigram[15]
print('\ninput:{}'.format(word))
print('label:{}'.format(label))
word = torch.LongTensor([word_to_idx[i] for i in word])
out = model(word)
pred_label_idx = out.max(1)[1].item() # 第一行的最大值的下标
predict_word = idx_to_word[pred_label_idx] # 得到对应的单词
print('real word is {}, predicted word is {}'.format(label, predict_word))
可以发现预测值和label值一样,虽然是在训练集上,但是在一定程度上也说明这个小模型能够处理N-gram模型的问题。
- 附上完整代码:
import torch
from torch import nn, optim
import net
CONTEXT_SIZE = 2 # 2-gram
EMBEDDING_DIM = 10 # 词向量的维度
test_sentence = """We always knew our daughter Kendall was
going be a performer of some sort.
She entertained people in our small town
by putting on shows on our front porch when
she was only three or four. Blonde-haired,
blue-eyed, and beautiful, she sang like a
little angel and mesmerized1 everyone.""".split()
trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2])
for i in range(len(test_sentence)-2)]
# 建立每个词与数字的编码,据此构建词嵌入
vocb = set(test_sentence) # 使用 set 将重复的元素去掉
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
model = net.n_gram(len(word_to_idx), CONTEXT_SIZE, EMBEDDING_DIM)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2, weight_decay=1e-5)
for epoch in range(100):
train_loss = 0
for word, label in trigram:
word = torch.LongTensor([word_to_idx[i] for i in word]) # 将两个词作为输入
label = torch.LongTensor([word_to_idx[label]])
# 前向传播
out = model(word)
loss = criterion(out, label)
train_loss += loss.item()
# 反向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 20 == 0:
print('epoch: {}, Loss: {:.6f}'.format(epoch + 1, train_loss / len(trigram)))
model = model.eval()
word, label = trigram[15]
print('\ninput:{}'.format(word))
print('label:{}'.format(label))
word = torch.LongTensor([word_to_idx[i] for i in word])
out = model(word)
pred_label_idx = out.max(1)[1].item() # 第一行的最大值的下标
predict_word = idx_to_word[pred_label_idx] # 得到对应的单词
print('real word is {}, predicted word is {}'.format(label, predict_word))