Neural Network Language Model
NNLM
Language Model
-
Models that assign probabilities to sequences of words are called language models. There are primarily two types of Language Models:
- Statistical Language Models: These models use traditional statistical techniques like N-grams, HiddenMarkov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
- Neural Language Models: They use different kinds of Neural Networks to model language and have surpassed the statistical language models in their effectiveness.
-
A language model (LM) is the basis for many natural language processing (NLP) tasks. Early NLP systems were built mainly on the basis of manually written rules, which were time-consuming and laborious, and did not cover many linguistic phenomena. It was not until the 1980s that statistical language models were proposed to assign probabilities to a sequence s of N words, i.e.
P ( s ) = P ( w 1 w 2 ⋯ w N ) = P ( w 1 ) P ( w 2 ∣ w 1 ) ⋯ P ( w N ∣ w 1 w 2 ⋯ w N − 1 ) \begin{aligned} P(s) &=P\left(w_1 w_2 \cdots w_N\right) \\ &=P\left(w_1\right) P\left(w_2 \mid w_1\right) \cdots P\left(w_N \mid w_1 w_2 \cdots w_{N-1}\right) \end{aligned} P(s)=P(w1w2⋯wN)=P(w1)P(w2∣w1)⋯P(wN∣w1w2⋯wN−1)
- Where w i w_i wi stands for the i t h i^{th} ith word in sequence s s s. The probability of a sequence of words can be decomposed as the product of the conditional probability of the next word given the antecedents of the next word (often called context history or context).
N-gram
- Considering that it is difficult to learn the excessive number of parameters in the above model, it is necessary to adopt an approximation method. The N-gram model is the most widely used approximation method and the most advanced model before the emergence of NNLM. A (k+1) metamodel is derived from the k-order Markov assumption. This hypothesis states that the current state only depends on the previous k states, namely:
P ( w t ∣ w 1 ⋯ w t − 1 ) ≈ P ( w t ∣ w t − k ⋯ w t − 1 ) P\left(w_t \mid w_1 \cdots w_{t-1}\right) \approx P\left(w_t \mid w_{t-k} \cdots w_{t-1}\right) P(wt∣w1⋯wt−1)≈P(wt∣wt−k⋯wt−1)
- We use maximum likelihood estimation to estimate the parameters.
Perplexity
-
Confusion (perplexity, PPL) is an information theory metric used to measure the quality of a probabilistic model, and it is a method to evaluate language models. A lower PPL indicates a better model.
-
The perplexity of the model m m m is the exponential of its cross entropy:
Perplexity ( m ) = 2 − ∑ i = 1 n p ( x i ) log 2 m ( x i ) \text { Perplexity }(m)=2^{-\sum_{i=1}^{n} p\left(x_{i}\right) \log _{2} m\left(x_{i}\right)} Perplexity (m)=2−∑i=1np(xi)log2m(xi)
-
It is worth noting that PPL is related to corpora. PPL can be used to compare two or more language models on the same corpus.
-
Objective Funtion:
L = 1 T ∑ t log ( w t , w t − 1 , … , w t − n + 2 , w t − n + 1 ; θ ) + R ( θ ) L=\frac{1}{T} \sum_{t} \log \left(w_{t}, w_{t-1}, \ldots, w_{t-n+2}, w_{t-n+1} ; \theta\right)+R(\theta) L=T1t∑log(wt,wt−1,…,wt−n+2,wt−n+1;θ)+R(θ)Where θ \theta θ is all parameters of the model, R ( θ ) R(\theta) R(θ) is the regularization term
Network structure
-
The task now is to enter w t − n + 1 , … , w t − 1 w_{t-n+1}, \ldots, w_{t-1} wt−n+1,…,wt−1 the n-1 words and predict the next word w t w_t wt
-
Mathematical notation.
- C ( i ) C(i) C(i): the word vector corresponding to the word, where is the index of the word in the whole vocabulary
- C C C: the word vector, a matrix of size
- ∣ V ∣ |V| ∣V∣: the size of the vocabulary, i.e., the number of de-duplicated words in the prediction database
- m m m: the dimension of the word vector, typically greater than 50
- H H H: the weight of the hidden layer
- d d d: the bias of the hidden layer
- U U U: the weight of the output layer
- b b b: bias of the output layer
- W W W: weight of the input layer to the output layer
- h h h: number of neurons in the hidden layer
Calculation process:
- First convert the input n − 1 n-1 n−1 word indices into word vectors, then concatenate the n − 1 n-1 n−1 vectors to form a ( n − 1 ) ∗ w (n-1)*w (n−1)∗w matrix, denoted by X X X.
- Send X X X to the hidden layer for computation, hidden out = tanh ( d + X ∗ H ) \operatorname{hidden}_{\text {out }}=\tanh (d+X * H) hiddenout =tanh(d+X∗H)
- There are $ |V| $ nodes in the output layer, and each node y i y_i yi represents the probability of predicting the next word i i i, and the formula for y y y is y = b + X ∗ W + hidden out ∗ U y=b+X * W+\text { hidden }_{\text {out }} * U y=b+X∗W+ hidden out ∗U
Network structure:(From bottom to top)
- Input layer:one-hot vector for each word in the context of the window
- Projection matrix: The purple dashed line indicates that the words are mapped to the words by the projection matrix Matrix C
- Neural network input layer: a concatenation of word vectors mapped by the projection matrix, the size of the input vector is the number of words in the window context multiplied by the length of the defined word vector
- Neural network hidden layer: nonlinear mapping with activation function tanh, etc.
- Output layer: softmax normalization to ensure the probability sum is 1.
Code(Python)
class NNLM(nn.Module):
# NNLM model architecture
def __init__(self):
super(NNLM, self).__init__()
self.C = nn.Embedding(num_embeddings = num_words, embedding_dim = m) # 词表
self.d = nn.Parameter(torch.randn(n_hidden).type(dtype)) # 隐藏层的偏置
self.H = nn.Parameter(torch.randn(n_steps * m, n_hidden).type(dtype)) # 输入层到隐藏层的权重
self.U = nn.Parameter(torch.randn(n_hidden, num_words).type(dtype)) # 隐藏层到输出层的权重
self.b = nn.Parameter(torch.randn(num_words).type(dtype)) # 输出层的偏置
self.W = nn.Parameter(torch.randn(n_steps * m, num_words).type(dtype)) # 输入层到输出层的权重
def forward(self, input):
'''
input: [batchsize, n_steps]
x: [batchsize, n_steps*m]
hidden_layer: [batchsize, n_hidden]
output: [batchsize, num_words]
'''
x = self.C(input) # 获得一个batch的词向量的词表
x = x.view(-1, n_steps * m)
hidden_out = torch.tanh(torch.mm(x, self.H) + self.d) # 获取隐藏层输出
output = torch.mm(x, self.W) + torch.mm(hidden_out, self.U) + self.b # 获得输出层输出
return output