Neural Network Language Model 神经网络语言模型（概念+公式+代码）

hUaleeF

已于 2022-10-23 21:33:01 修改

阅读量637

点赞数

分类专栏： NLP Learning Notes 文章标签：神经网络语言模型机器学习

于 2022-10-23 21:11:17 首次发布

本文链接：https://blog.csdn.net/hua_453/article/details/127480808

版权

NLP Learning Notes 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

Neural Network Language Model

NNLM

Language Model

Models that assign probabilities to sequences of words are called language models. There are primarily two types of Language Models:
- Statistical Language Models: These models use traditional statistical techniques like N-grams, HiddenMarkov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
- Neural Language Models: They use different kinds of Neural Networks to model language and have surpassed the statistical language models in their effectiveness.
A language model (LM) is the basis for many natural language processing (NLP) tasks. Early NLP systems were built mainly on the basis of manually written rules, which were time-consuming and laborious, and did not cover many linguistic phenomena. It was not until the 1980s that statistical language models were proposed to assign probabilities to a sequence s of N words, i.e.

$\begin{aligned} P(s) &=P\left(w_1 w_2 \cdots w_N\right) \\ &=P\left(w_1\right) P\left(w_2 \mid w_1\right) \cdots P\left(w_N \mid w_1 w_2 \cdots w_{N-1}\right) \end{aligned}$

Where $w_i$ stands for the $i^{th}$ word in sequence $s$ . The probability of a sequence of words can be decomposed as the product of the conditional probability of the next word given the antecedents of the next word (often called context history or context).

N-gram

Considering that it is difficult to learn the excessive number of parameters in the above model, it is necessary to adopt an approximation method. The N-gram model is the most widely used approximation method and the most advanced model before the emergence of NNLM. A (k+1) metamodel is derived from the k-order Markov assumption. This hypothesis states that the current state only depends on the previous k states, namely:

$P\left(w_t \mid w_1 \cdots w_{t-1}\right) \approx P\left(w_t \mid w_{t-k} \cdots w_{t-1}\right)$

We use maximum likelihood estimation to estimate the parameters.

Perplexity

Confusion (perplexity, PPL) is an information theory metric used to measure the quality of a probabilistic model, and it is a method to evaluate language models. A lower PPL indicates a better model.
The perplexity of the model $m$ is the exponential of its cross entropy:

$\text { Perplexity }(m)=2^{-\sum_{i=1}^{n} p\left(x_{i}\right) \log _{2} m\left(x_{i}\right)}$

It is worth noting that PPL is related to corpora. PPL can be used to compare two or more language models on the same corpus.
Objective Funtion:
$L=\frac{1}{T} \sum_{t} \log \left(w_{t}, w_{t-1}, \ldots, w_{t-n+2}, w_{t-n+1} ; \theta\right)+R(\theta)$

Where $\theta$ is all parameters of the model, $R(\theta)$ is the regularization term

Network structure

在这里插入图片描述

The task now is to enter $w_{t-n+1}, \ldots, w_{t-1}$ the n-1 words and predict the next word $w_t$
Mathematical notation.
- $C (i)$ : the word vector corresponding to the word, where is the index of the word in the whole vocabulary
- $C$ : the word vector, a matrix of size
- $∣ V ∣$ : the size of the vocabulary, i.e., the number of de-duplicated words in the prediction database
- $m$ : the dimension of the word vector, typically greater than 50
- $H$ : the weight of the hidden layer
- $d$ : the bias of the hidden layer
- $U$ : the weight of the output layer
- $b$ : bias of the output layer
- $W$ : weight of the input layer to the output layer
- $h$ : number of neurons in the hidden layer

Calculation process:

First convert the input $n - 1$ word indices into word vectors, then concatenate the $n - 1$ vectors to form a $(n - 1) * w$ matrix, denoted by $X$ .
Send $X$ to the hidden layer for computation, $\operatorname{hidden}_{\text {out }}=\tanh (d+X * H)$
There are $ |V| $ nodes in the output layer, and each node $y_i$ represents the probability of predicting the next word $i$ , and the formula for $y$ is $W+\text { hidden }_{\text {out }} * U$

Network structure:(From bottom to top)

Input layer:one-hot vector for each word in the context of the window
Projection matrix: The purple dashed line indicates that the words are mapped to the words by the projection matrix Matrix C
Neural network input layer: a concatenation of word vectors mapped by the projection matrix, the size of the input vector is the number of words in the window context multiplied by the length of the defined word vector
Neural network hidden layer: nonlinear mapping with activation function tanh, etc.
Output layer: softmax normalization to ensure the probability sum is 1.

Code(Python)

class NNLM(nn.Module):
  # NNLM model architecture
  def __init__(self):
    super(NNLM, self).__init__()
    self.C = nn.Embedding(num_embeddings = num_words, embedding_dim = m)  # 词表
    self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))  # 隐藏层的偏置
    self.H = nn.Parameter(torch.randn(n_steps * m, n_hidden).type(dtype))  # 输入层到隐藏层的权重
    self.U = nn.Parameter(torch.randn(n_hidden, num_words).type(dtype))  # 隐藏层到输出层的权重
    self.b = nn.Parameter(torch.randn(num_words).type(dtype))  # 输出层的偏置
    self.W = nn.Parameter(torch.randn(n_steps * m, num_words).type(dtype))  # 输入层到输出层的权重

  def forward(self, input):
    '''
    input: [batchsize, n_steps] 
    x: [batchsize, n_steps*m]
    hidden_layer: [batchsize, n_hidden]
    output: [batchsize, num_words]
    '''
    x = self.C(input)  # 获得一个batch的词向量的词表
    x = x.view(-1, n_steps * m)
    hidden_out = torch.tanh(torch.mm(x, self.H) + self.d)  # 获取隐藏层输出
    output = torch.mm(x, self.W) + torch.mm(hidden_out, self.U) + self.b  # 获得输出层输出
    return output