Neural Network Language Model 神经网络语言模型(概念+公式+代码)

NNLM

Language Model

  • Models that assign probabilities to sequences of words are called language models. There are primarily two types of Language Models:

    • Statistical Language Models: These models use traditional statistical techniques like N-grams, HiddenMarkov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
    • Neural Language Models: They use different kinds of Neural Networks to model language and have surpassed the statistical language models in their effectiveness.
  • A language model (LM) is the basis for many natural language processing (NLP) tasks. Early NLP systems were built mainly on the basis of manually written rules, which were time-consuming and laborious, and did not cover many linguistic phenomena. It was not until the 1980s that statistical language models were proposed to assign probabilities to a sequence s of N words, i.e.

P ( s ) = P ( w 1 w 2 ⋯ w N ) = P ( w 1 ) P ( w 2 ∣ w 1 ) ⋯ P ( w N ∣ w 1 w 2 ⋯ w N − 1 ) \begin{aligned} P(s) &=P\left(w_1 w_2 \cdots w_N\right) \\ &=P\left(w_1\right) P\left(w_2 \mid w_1\right) \cdots P\left(w_N \mid w_1 w_2 \cdots w_{N-1}\right) \end{aligned} P(s)=P(w1w2wN)=P(w1)P(w2w1)P(wNw1w2wN1)

  • Where w i w_i wi stands for the i t h i^{th} ith word in sequence s s s. The probability of a sequence of words can be decomposed as the product of the conditional probability of the next word given the antecedents of the next word (often called context history or context).

N-gram

  • Considering that it is difficult to learn the excessive number of parameters in the above model, it is necessary to adopt an approximation method. The N-gram model is the most widely used approximation method and the most advanced model before the emergence of NNLM. A (k+1) metamodel is derived from the k-order Markov assumption. This hypothesis states that the current state only depends on the previous k states, namely:

P ( w t ∣ w 1 ⋯ w t − 1 ) ≈ P ( w t ∣ w t − k ⋯ w t − 1 ) P\left(w_t \mid w_1 \cdots w_{t-1}\right) \approx P\left(w_t \mid w_{t-k} \cdots w_{t-1}\right) P(wtw1wt1)P(wtwtkwt1)

  • We use maximum likelihood estimation to estimate the parameters.

Perplexity

  • Confusion (perplexity, PPL) is an information theory metric used to measure the quality of a probabilistic model, and it is a method to evaluate language models. A lower PPL indicates a better model.

  • The perplexity of the model m m m is the exponential of its cross entropy:

 Perplexity  ( m ) = 2 − ∑ i = 1 n p ( x i ) log ⁡ 2 m ( x i ) \text { Perplexity }(m)=2^{-\sum_{i=1}^{n} p\left(x_{i}\right) \log _{2} m\left(x_{i}\right)}  Perplexity (m)=2i=1np(xi)log2m(xi)

  • It is worth noting that PPL is related to corpora. PPL can be used to compare two or more language models on the same corpus.

  • Objective Funtion:
    L = 1 T ∑ t log ⁡ ( w t , w t − 1 , … , w t − n + 2 , w t − n + 1 ; θ ) + R ( θ ) L=\frac{1}{T} \sum_{t} \log \left(w_{t}, w_{t-1}, \ldots, w_{t-n+2}, w_{t-n+1} ; \theta\right)+R(\theta) L=T1tlog(wt,wt1,,wtn+2,wtn+1;θ)+R(θ)

    Where θ \theta θ is all parameters of the model, R ( θ ) R(\theta) R(θ) is the regularization term


Network structure

在这里插入图片描述

  • The task now is to enter w t − n + 1 , … , w t − 1 w_{t-n+1}, \ldots, w_{t-1} wtn+1,,wt1 the n-1 words and predict the next word w t w_t wt

  • Mathematical notation.

    • C ( i ) C(i) C(i): the word vector corresponding to the word, where is the index of the word in the whole vocabulary
    • C C C: the word vector, a matrix of size
    • ∣ V ∣ |V| V: the size of the vocabulary, i.e., the number of de-duplicated words in the prediction database
    • m m m: the dimension of the word vector, typically greater than 50
    • H H H: the weight of the hidden layer
    • d d d: the bias of the hidden layer
    • U U U: the weight of the output layer
    • b b b: bias of the output layer
    • W W W: weight of the input layer to the output layer
    • h h h: number of neurons in the hidden layer

Calculation process:

  1. First convert the input n − 1 n-1 n1 word indices into word vectors, then concatenate the n − 1 n-1 n1 vectors to form a ( n − 1 ) ∗ w (n-1)*w (n1)w matrix, denoted by X X X.
  2. Send X X X to the hidden layer for computation, hidden ⁡ out  = tanh ⁡ ( d + X ∗ H ) \operatorname{hidden}_{\text {out }}=\tanh (d+X * H) hiddenout =tanh(d+XH)
  3. There are $ |V| $ nodes in the output layer, and each node y i y_i yi represents the probability of predicting the next word i i i, and the formula for y y y is y = b + X ∗ W +  hidden  out  ∗ U y=b+X * W+\text { hidden }_{\text {out }} * U y=b+XW+ hidden out U

Network structure:(From bottom to top)

  • Input layer:one-hot vector for each word in the context of the window
  • Projection matrix: The purple dashed line indicates that the words are mapped to the words by the projection matrix Matrix C
  • Neural network input layer: a concatenation of word vectors mapped by the projection matrix, the size of the input vector is the number of words in the window context multiplied by the length of the defined word vector
  • Neural network hidden layer: nonlinear mapping with activation function tanh, etc.
  • Output layer: softmax normalization to ensure the probability sum is 1.

Code(Python)

class NNLM(nn.Module):
  # NNLM model architecture
  def __init__(self):
    super(NNLM, self).__init__()
    self.C = nn.Embedding(num_embeddings = num_words, embedding_dim = m)  # 词表
    self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))  # 隐藏层的偏置
    self.H = nn.Parameter(torch.randn(n_steps * m, n_hidden).type(dtype))  # 输入层到隐藏层的权重
    self.U = nn.Parameter(torch.randn(n_hidden, num_words).type(dtype))  # 隐藏层到输出层的权重
    self.b = nn.Parameter(torch.randn(num_words).type(dtype))  # 输出层的偏置
    self.W = nn.Parameter(torch.randn(n_steps * m, num_words).type(dtype))  # 输入层到输出层的权重

  def forward(self, input):
    '''
    input: [batchsize, n_steps] 
    x: [batchsize, n_steps*m]
    hidden_layer: [batchsize, n_hidden]
    output: [batchsize, num_words]
    '''
    x = self.C(input)  # 获得一个batch的词向量的词表
    x = x.view(-1, n_steps * m)
    hidden_out = torch.tanh(torch.mm(x, self.H) + self.d)  # 获取隐藏层输出
    output = torch.mm(x, self.W) + torch.mm(hidden_out, self.U) + self.b  # 获得输出层输出
    return output

References

  1. A Neural Probabilistic Language Model
  2. A review of neural network language models
  3. Perplexity and accuracy in classification
  4. NNLM Model
  5. NNLM Code
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

hUaleeF

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值