https://lena-voita.github.io/nlp_course/language_modeling.html#intro
We have know what is neural network langauge modeling. Now we apply two special neural network model: Convolutional neural network (CNN) and Recurrent neural network (RNN) to language modeling task.
RNN language modeling
The simplest model is recurrent model. The goal of RNN language modeling is to predict the likelihood of the next word in a sequence given the previous words in the sequence. The follwing Figure is a single layer RNN.
Sometime we can use multi-layer RNN:
from google.colab import drive
drive.mount('/content/drive')
from torch.utils.data import DataLoader, Dataset
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load and preprocess text data
with open('/content/drive/MyDrive/toy_language_model/text_data.txt', 'r') as f:
raw_text = f.read()[:1000]
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i+1: i + seq_length + 1]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append([char_to_int[char] for char in seq_out])
class LMDataset(Dataset):
def __init__(self):
self.source = dataX
self.target = dataY
def __len__(self):
return len(self.source)
def __getitem__(self,idx):
return {"source":torch.tensor(self.source[idx]).to(device), "target":torch.tensor(self.target[idx]).to(device)}
train_dataset = LMDataset()
train_loader = DataLoader(train_dataset, batch_size = 64, drop_last=True)
import torch
import torch.nn as nn
import numpy as np
# Define the RNN model
class CharRNN(nn.Module):
def __init__(self, vocab_size, embedding_size, hidden_size, output_size, num_layers=1):
super(CharRNN, self).__init__()
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.hidden_size = hidden_size
self.output_size = output_size
self.num_layers = num_layers
# define embedding
self.embedding = nn.Embedding(self.vocab_size, embedding_size)
# define rnn, we use LSTM
self.rnn = nn.LSTM(self.embedding_size, self.hidden_size,
self.num_layers, batch_first=True)
self.out_fc = nn.Linear(self.hidden_size, self.output_size)
# define dropout
self.dropout = nn.Dropout(0.25)
# define the forward function
def forward(self, input, hidden, cell):
embedding = self.embedding(input)
embedding = self.dropout(embedding)
output, (hidden, cell) = self.rnn(embedding, (hidden, cell))
out = self.out_fc(output)
return out.view(-1, self.vocab_size), (hidden, cell)
def init_hidden(self, batch_size):
hidden = torch.zeros(self.num_layers, batch_size,
self.hidden_size).to(device)
cell = torch.zeros(self.num_layers, batch_size,
self.hidden_size).to(device)
return hidden, cell
# Define the model hyperparameters
input_size = n_vocab
hidden_size = 256
output_size = n_vocab
num_layers = 2
embedding_size = 100
learning_rate = 0.01
# Initialize the model and loss function
model = CharRNN(n_vocab, embedding_size, hidden_size, output_size, num_layers).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model
for epoch in range(100):
loss = 0
hidden, cell = model.init_hidden(batch_size=64)
for batch in train_loader:
optimizer.zero_grad()
# Since we use batch_size =1 , there is a []
inputs = batch["source"]
label = batch["target"]
output, (hidden, cell) = model(inputs, hidden, cell)
label = label.view(-1)
loss += criterion(output, label)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, 100, loss.item()))
# Save the model weights
# torch.save(model.state_dict(), 'model_weights.pt')
CNN
卷积网路也可以用来做语言模型。 和分类任务相比,卷积网络做语言模型有几个不同点:
-
prevent information flow from future tokens
为了预测一个token, left-to-right 语言模型必须使用之前的tokens,确保你的CNN模型只是用了这些之前的tokens。例如,我们可以使用padding策略来滑动tokens。如上图所示 -
do not remove positional information
显然,在语言模型任务中,CNN不能再使用pooling策略因为我们需要知道每一个token的位置。 -
if you stack many layers, do not forget about residual connections
如果我们使用很多层,我们就很难训练一个好的深度网络。为了避免这个发生,我们会使用残差网络。
Receptive field:
当使用没有全局pooling卷积模型时,你的模型将不可避免地有一个固定大小的上下文。这似乎是不可取的:固定上下文大小的问题正是我们不喜欢的n-gram模型的问题。
然而,如果n-gram模型的典型上下文大小为1-4,卷积模型中的上下文可能相当长。如图,只有3个卷积层,小核大小为3,一个网络的上下文是7个符号。如果你堆叠许多层,你可以得到一个非常大的上下文长度。
Residual connections: train deep networks easily
为了处理更长的上下文,你需要大量的层。不幸的是,当堆叠大量的层时,你可能会遇到在深层网络中从上到下传播梯度的问题。为了避免这个问题,我们可以使用残差连接。
残差链接非常的简单。 它们将一个块的输入加到它的输出。这样一来,输入的梯度不仅会间接地流经区块,而且会直接流经总和。
还有一种更高级一点的残差链接叫做Highway connections. 他不是简单的把input 和output相加,而是使用了一个门电路,这有点像LSTM中的门结构。
如上图所示为一个带有残差网络的卷积网络。通常情况下,我们把残差网络放在几个残差快的周围,记住我们需要很多层来获得一个像样的receptive field.