We have know what is neural network langauge modeling. Now we apply two special neural network model: Convolutional neural network (CNN) and Recurrent neural network (RNN) to language modeling task.
RNN language modeling
The simplest model is recurrent model. The goal of RNN language modeling is to predict the likelihood of the next word in a sequence given the previous words in the sequence. The follwing Figure is a single layer RNN.
Sometime we can use multi-layer RNN:
from google.colab import drive
from import DataLoader, Dataset
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load and preprocess text data
with open('/content/drive/MyDrive/toy_language_model/text_data.txt', 'r') as f:
raw_text =[:1000]
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i+1: i + seq_length + 1]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append([char_to_int[char] for char in seq_out])
class LMDataset(Dataset):
def __init__(self):
self.source = dataX = dataY
def __len__(self):
return len(self.source)
def __getitem__(self,idx):
return {"source":torch.tensor(self.source[idx]).to(device), "target":torch.tensor([idx]).to(device)}
train_dataset = LMDataset()
train_loader = DataLoader(train_dataset, batch_size = 64, drop_last=True)
import torch
import torch.nn as nn
import numpy as np
# Define the RNN model
class CharRNN(nn.Module):
def __init__(self, vocab_size, embedding_size, hidden_size, output_size, num_layers=1):
super(CharRNN, self).__init__()
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.hidden_size = hidden_size
self.output_size = output_size
self.num_layers = num_layers
# define embedding
self.embedding = nn.Embedding(self.vocab_size, embedding_size)
# define rnn, we use LSTM
self.rnn = nn.LSTM(self.embedding_size, self.hidden_size,
self.num_layers, batch_first=True)
self.out_fc = nn.Linear(self.hidden_size, self.output_size)
# define dropout
self.dropout = nn.Dropout(0.25)
# define the forward function
def forward(self, input, hidden, cell):
embedding = self.embedding(input)
embedding = self.dropout(embedding)
output, (hidden, cell) = self.rnn(embedding, (hidden, cell))
out = self.out_fc(output)
return out.view(-1, self.vocab_size), (hidden, cell)
def init_hidden(self, batch_size):
hidden = torch.zeros(self.num_layers, batch_size,
cell = torch.zeros(self.num_layers, batch_size,
return hidden, cell
# Define the model hyperparameters
input_size = n_vocab
hidden_size = 256
output_size = n_vocab
num_layers = 2
embedding_size = 100
learning_rate = 0.01
# Initialize the model and loss function
model = CharRNN(n_vocab, embedding_size, hidden_size, output_size, num_layers).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model
for epoch in range(100):
loss = 0
hidden, cell = model.init_hidden(batch_size=64)
for batch in train_loader:
# Since we use batch_size =1 , there is a []
inputs = batch["source"]
label = batch["target"]
output, (hidden, cell) = model(inputs, hidden, cell)
label = label.view(-1)
loss += criterion(output, label)
if epoch % 10 == 0:
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, 100, loss.item()))
# Save the model weights
#, '')
卷积网路也可以用来做语言模型。 和分类任务相比,卷积网络做语言模型有几个不同点:
prevent information flow from future tokens
为了预测一个token, left-to-right 语言模型必须使用之前的tokens,确保你的CNN模型只是用了这些之前的tokens。例如,我们可以使用padding策略来滑动tokens。如上图所示 -
do not remove positional information
显然,在语言模型任务中,CNN不能再使用pooling策略因为我们需要知道每一个token的位置。 -
if you stack many layers, do not forget about residual connections
Receptive field:
Residual connections: train deep networks easily
残差链接非常的简单。 它们将一个块的输入加到它的输出。这样一来,输入的梯度不仅会间接地流经区块,而且会直接流经总和。
还有一种更高级一点的残差链接叫做Highway connections. 他不是简单的把input 和output相加,而是使用了一个门电路,这有点像LSTM中的门结构。
如上图所示为一个带有残差网络的卷积网络。通常情况下,我们把残差网络放在几个残差快的周围,记住我们需要很多层来获得一个像样的receptive field.