Chapter 7 Neural Networks and Neural Language Models

Chapter 7 Neural Networks and Neural Language Models

Speech and Language Processing ed3 读书笔记

McCulloch-Pitts neuron (McCulloch and Pitts, 1943)

feed-forward network: the computation proceeds iteratively from one layer of units to the next.

deep learning, because modern networks are often deep (have many layers) z

7.1 Units

The building block of a neural network is a single computational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and produces an output.

Given a input vector x x x, a unit has a weight vector w w w and a scalar bias b b b, so the weighted sum z z z can
be represented as:
z = w ⋅ x + b z = w\cdot x+b z=wx+b
Finally, instead of using z z z, a linear function of x x x, as the output, neural units apply a non-linear function f f f to z z z. We will refer to the output of this function as the activation value for the unit, a a a. Since we are just modeling a single unit, the activation for the node is in fact the final output of the network, which we’ll generally call y y y. So the value y y y is defined as:
y = a = f ( z ) y = a = f(z) y=a=f(z)
Three popular non-linear functions f ( ) f() f() are the sigmoid, the tanh, and the rectified linear ReLU.
y = σ ( z ) = 1 1 + e − z y=\sigma(z)=\frac{1}{1+e^{-z}} y=σ(z)=1+ez1

y = tanh ⁡ ( z ) = e z − e − z e z + e − z y=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}} y=tanh(z)=ez+ezezez

y = max ⁡ ( z , 0 ) y=\max(z,0) y=max(z,0)

These activation functions have different properties that make them useful for different language applications or network architectures. For example the rectifier function has nice properties that result from it being very close to linear. In the sigmoid or tanh functions, very high values of z z z result in values of y y y that are saturated, i.e., extremely close to 1, which causes problems for learning. Rectifiers don’t have this problem, since the output of values close to 1 also approaches 1 in a nice gentle linear way. By contrast, the tanh function has the nice properties of being smoothly differentiable and mapping outlier values toward the mean.

7.2 The XOR problem

It turns out, however, that it’s not possible to build a perceptron to compute logical XOR!

The intuition behind this important result relies on understanding that a perceptron is a linear classifier.

We say that XOR is not a linearly separable function.

7.2.1 The solution: neural networks

While the XOR function cannot be calculated by a single perceptron, it can be calculated by a layered network of units. Goodfellow et al. (2016) compute XOR using two layers of ReLU-based units.

7.3 Feed-Forward Neural Networks

A feed-forward network is a multilayer network in which the units are connected with no cycles; the outputs from units in each layer are passed to units in the next higher layer, and no outputs are passed back to lower layers. (In Chapter 9 we’ll introduce networks with cycles, called recurrent neural networks).

Simple feed-forward networks have three kinds of nodes: input units, hidden units, and output units.

The core of the neural network is the hidden layer formed of hidden units, each of which is a neural unit, taking a weighted sum of its inputs and then applying a non-linearity. In the standard architecture, each layer is fully-connected, meaning that each unit in each layer takes as input the outputs from all the units in the previous layer, and there is a link between every pair of units from two adjacent layers. Thus each hidden unit sums over all the input units.

We represent the parameters for the entire hidden layer by combining the weight vector w i \textbf w_i wi and bias $ b_i$ for each hidden unit h i h_i hi into a single weight matrix W W W and a single bias vector b b b for the whole layer. Each element W i j W_{ij} Wij of the weight matrix W W W represents the weight of the connection from the j j jth input unit x j x_j xj to the i i ith hidden unit h i h_i hi.

The advantage of using a single matrix W W W for the weights of the entire layer is that hidden layer computation for a feedforward network can be done very efficiently with simple matrix operations. In fact, the computation only has three steps: multiplying the weight matrix W W W by the input vector x x x, adding the bias vector b b b, and applying the activation function g g g (such as the sigmoid, tanh, or relu activation function defined above).

The output of the hidden layer, the vector h h h, is thus the following, using the sigmoid function σ \sigma σ:
h = σ ( W x + b ) h=\sigma(Wx+b) h=σ(Wx+b)
Let’s introduce some constants to represent the dimensionalities of these vectors and matrices. We’ll refer to the input layer as layer 0 of the network, and use n 0 n_0 n0 to represent the dimension of a input, so x x x is a vector of real numbers of dimension n 0 n_0 n0, or more formally x ∈ R n 0 x\in \mathbb{R}^{n_0} xRn0. Let’s call the hidden layer layer 1 and the output layer layer 2. The hidden layer has dimensionality n 1 n_1 n1, so h ∈ R n 1 h \in \mathbb{R}^{n_1} hRn1 and also b ∈ R n 1 b\in \mathbb{R}^{n_1} bRn1 (since each hidden unit can take a different bias value). And the weight matrix W W W has dimensionality W ∈ R n 1 × n 0 W \in \mathbb{R}^{n_1\times n_0} WRn1×n0.

h i = w i ⋅ x + b i = ∑ j = 1 n 0 w i j x j + b i h_i=w_i\cdot x +b_i=\sum_{j=1}^{n_0}w_{ij}x_j+b_i hi=wix+bi=j=1n0wijxj+bi.

Like the hidden layer, the output layer has a weight matrix (let’s call it U U U), but output layers may not have a bias vector b b b, so we’ll simplify by eliminating the bias vector in this example. The weight matrix is multiplied by its input vector ( h h h) to produce the intermediate output z z z.
z = U h z = Uh z=Uh
There are n 2 n_2 n2 output nodes, so z ∈ R n 2 z \in \mathbb{R}^{n_2} zRn2, weight matrix U U U has dimensionality U ∈ R n 2 × n 1 U \in \mathbb{R}^{n_2\times n_1} URn2×n1, and element U i j U_{i j} Uij is the weight from unit h j h_j hj in the hidden layer to unit i i i in the output layer.

However, z z z can’t be the output of the classifier, since it’s a vector of real-valued numbers, while what we need for classification is a vector of probabilities. There is a convenient function for normalizing a vector of real values, by which we mean converting it to a vector that encodes a probability distribution (all the numbers lie between 0 and 1 and sum to 1): the softmax function. For a vector z z z of dimensionality d d d, the softmax is defined as:
softmax ( z i ) = e z i ∑ j = 1 d e z j \textrm{softmax}(z_i)=\frac{e^{z_i}}{\sum_{j=1}^d e^{z_j}} softmax(zi)=j=1dezjezi
A neural network is like logistic regression, but (a) with many layers, since a deep neural network is like layer after layer of logistic regression classifiers, and (b) rather than forming the features by feature templates, the prior layers of the network induce the feature representations themselves.

Let’s now set up some notation to make it easier to talk about deeper networks of depth more than 2. We’ll use superscripts in square brackets to mean layer numbers, starting at 0 for the input layer. So W [ 1 ] W^{[1]} W[1] will mean the weight matrix for the (first) hidden layer, and b [ 1 ] b^{[1]} b[1] will mean the bias vector for the (first) hidden layer. n j n_ j nj will mean the number of units at layer j j j. We’ll use g ( ⋅ ) g(\cdot) g() to stand for the activation function, which will tend to be ReLU or tanh for intermediate layers and softmax for output layers. We’ll use a [ i ] a^{[i]} a[i] to mean the output from layer i i i, and z [ i ] z^{[i]} z[i] to mean the combination of weights and biases W [ i ] a [ i − 1 ] + b [ i ] W^{[i]}a^{[i-1]} +b^{[i]} W[i]a[i1]+b[i]. The 0th layer is for inputs, so the inputs x x x we’ll refer to more generally as a [ 0 ] a^{[0]} a[0].

Thus we’ll represent a 3-layer net as follows:
KaTeX parse error: Unknown column alignment: [ at position 15: \begin{array}[̲ll]\ z^{[1]}=& …
activation functions g ( ⋅ ) g(\cdot) g() are generally different at the final layer. Thus g [ 2 ] g^{[2]} g[2] might be softmax for multinomial classification or sigmoid for binary classification, while ReLU or tanh might be the activation function g ( ) g() g() at the internal layers.

7.4 Training Neural Nets

First, we’ll need a loss function that models the distance between the system output and the gold output, and it’s common to use the loss used for logistic regression, the cross-entropy loss. Second, to find the parameters that minimize this loss function, we’ll use the gradient descent optimization algorithm. Third, gradient descent requires knowing the gradient of the loss function, the vector that contains the partial derivative of the loss function with respect to each of the parameters. The answer is the algorithm called error back-propagation or reverse differentiation.

7.4.1 Loss function

Let y y y be a vector over the K K K classes representing the true output probability distribution. The cross entropy loss here is
L C E ( y ^ , y ) = − ∑ i = 1 K y i log ⁡ y ^ i L_{CE}(\hat y ,y)=-\sum_{i=1}^K y_i\log \hat y_i LCE(y^,y)=i=1Kyilogy^i
We can simplify this equation further. Assume this is a hard classification task, meaning that only one class is the correct one, and that there is one output unit in y y y for each class. If the true class is i i i, then y y y is a vector where y i = 1 y_i = 1 yi=1 and y j = 0 , ∀ j ≠ i y_ j = 0, \forall j \neq i yj=0,j̸=i. A vector like this, with one value=1 and the rest 0, is called a one-hot vector. Now let y ^ \hat y y^ be the vector output from the network. The sum in Eq. 10 will be 0 except for the true class. Hence the cross-entropy loss is simply the log probability of the likelihood loss negative log correct class, and we therefore also call this the negative log likelihood loss:
L C E ( y ^ , y ) = − log ⁡ y ^ i = − log ⁡ e z i ∑ j = 1 K e z j L_{CE}(\hat y ,y)=-\log \hat y_i=-\log \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} LCE(y^,y)=logy^i=logj=1Kezjezi

7.4.2 Computing the Gradient

For a network with one weight layer and sigmoid output (which is what logistic regression is),
∂ L C E ( w , b ) ∂ w j = ( y ^ − y ) x j = ( σ ( w ⋅ x + b ) − y ) x j \frac{\partial L_{CE}(w,b)}{\partial w_j}=(\hat y-y)x_j=(\sigma(w\cdot x+b)-y)x_j wjLCE(w,b)=(y^y)xj=(σ(wx+b)y)xj
Or for a network with one hidden layer and softmax output,
∂ L C E ∂ w k = ( 1 { y = k } − p ( y = k ∣ x ) ) x k = ( 1 { y = k } − e w k ⋅ x + b k ∑ j = 1 K e w j ⋅ x + b j ) x k \frac{\partial L_{CE}}{\partial w_k}=(1\{y=k\}- p(y=k|x))x_k=\left(1\{y=k\}-\frac{e^{w_k\cdot x+b_k}}{\sum_{j=1}^K e^{w_j\cdot x+b_j}}\right)x_k wkLCE=(1{y=k}p(y=kx))xk=(1{y=k}j=1Kewjx+bjewkx+bk)xk
But these derivatives only give correct updates for one weight layer: the last one! For deep networks, computing the gradients for each weight is much more complex, since we are computing the derivative with respect to weight parameters that appear all the way back in the very early layers of the network, even though the loss is computed only at the very end of the network.

The solution to computing this gradient is an algorithm called error backpropagation or backprop (Rumelhart et al., 1986). While backprop was invented specially for neural networks, it turns out to be the same as a more general procedure called backward differentiation, which depends on the notion of computation graphs.

7.4.3 Computation Graphs

A computation graph is a representation of the process of computing a mathematical expression, in which the computation is broken down into separate operations, each of which is modeled as a node in a graph. In a computation graph, each node represents a operation, and each directed edge represents the output from a operation as the input to the next.

7.4.4 Backward differentiation on computation graphs

The importance of the computation graph comes from the backward pass, which is used to compute the derivatives that we’ll need for the weight update. Backwards differentiation makes use of the chain rule in calculus. In the backward pass, we compute each of these partials along each edge of the graph from right to left, multiplying the necessary partials to result in the final derivative we need. At each node we need to compute the local partial derivative with respect to the parent, multiply it by the partial derivative that is being passed down from the parent, and then pass it to the child.
d σ ( z ) d z = σ ( z ) ( 1 − z ) \frac{d\sigma(z)}{dz}=\sigma(z)(1-z) dzdσ(z)=σ(z)(1z)

d tanh ⁡ ( z ) d z = 1 − tanh ⁡ 2 ( z ) \frac{d \tanh (z)}{dz}=1-\tanh^2(z) dzdtanh(z)=1tanh2(z)

KaTeX parse error: Unknown column alignment: [ at position 52: …\{\begin{array}[̲ll]\ 0& \textrm…

7.4.5 More details on learning

Optimization in neural networks is a non-convex optimization problem, more complex than for logistic regression, and for that and other reasons there are many best practices for successful learning.

For logistic regression we can initialize gradient descent with all the weights and biases having the value 0. In neural networks, by contrast, we need to initialize the weights with small random numbers. It’s also helpful to normalize the input values to have 0 mean and unit variance.

Various forms of regularization are used to prevent overfitting. One of the most important is dropout: randomly dropping some units and their connections from the network during training (Hinton et al. 2012, Srivastava et al. 2014).

Hyperparameter tuning is also important. The parameters of a neural network are the weights W and biases b; those are learned by gradient descent. The hyperparameters are things that are set by the algorithm designer and not learned in the same way, although they must be tuned. Hyperparameters include the learning rate η \eta η, the minibatch size, the model architecture (the number of layers, the number of hidden nodes per layer, the choice of activation functions), how to regularize, and so on. Gradient descent itself also has many architectural variants such as Adam (Kingma and Ba, 2015).

Finally, most modern neural networks are built using computation graph formalisms that make all the work of gradient computation and parallelization onto vector-based GPUs (Graphic Processing Units) very easy and natural. Pytorch (Paszke et al., 2017) and TensorFlow (Abadi et al., 2015) are two of the most popular.

7.5 Neural Language Models

Neural net-based language models turn out to have many advantages over the ngram language models of Chapter 3. Among these are that neural language models don’t need smoothing, they can handle much longer histories, and they can generalize over contexts of similar words. For a training set of a given size, a neural language model has much higher predictive accuracy than an n-gram language model. Furthermore, neural language models underlie many of the models we’ll introduce
for tasks like machine translation, dialog, and language generation.

On the other hand, there is a cost for this improved performance: neural net language models are strikingly slower to train than traditional language models, and so for many tasks an n-gram language model is still the right tool.

In this chapter we’ll describe simple feedforward neural language models, first introduced by Bengio et al. (2003). Modern neural language models are generally not feedforward but recurrent, using the technology that we will introduce in Chapter 9.

A feedforward neural LM is a standard feedforward network that takes as input at time t t t a representation of some number of previous words ( w t − 1 w_{t-1} wt1, w t − 2 w_{t-2} wt2, etc) and outputs a probability distribution over possible next words. Thus—like the n-gram LM—the feedforward neural LM approximates the probability of a word given the entire prior context P ( w t ∣ w 1 t − 1 ) P(w_t|w_1^{t-1}) P(wtw1t1) by approximating based on the N N N previous words:
P ( w t ∣ w 1 t − 1 ) ≈ P ( w t ∣ w t − N + 1 t − 1 ) P(w_t|w_1^{t-1}) \approx P(w_t|w_{t- N+1}^{t-1}) P(wtw1t1)P(wtwtN+1t1)
In the following examples we’ll use a 4-gram example, so we’ll show a net to estimate the probability P ( w t = i ∣ w t − 1 , w t − 2 , w t − 3 ) P(w_t = i|w_{t-1},w_{t-2},w_{t-3}) P(wt=iwt1,wt2,wt3).

7.5.1 Embeddings

In neural language models, the prior context is represented by embeddings of the previous words. Representing the prior context as embeddings, rather than by exact words as used in n-gram language models, allows neural language models to generalize to unseen data much better than n-gram language models.

Let’s see how this works in practice. Let’s assume we have an embedding dictionary E E E that gives us, for each word in our vocabulary V V V , the embedding for that word, perhaps precomputed by an algorithm like word2vec from Chapter 6.

Fig. 7.12 shows a sketch of this simplified FFNNLM with N=3; we have a moving window at time t t t with an embedding vector representing each of the 3 previous words (words w t − 1 w_{t-1} wt1, w t − 2 w_{t-2} wt2,and w t − 3 w_{t-3} wt3). These 3 vectors are concatenated together to produce x x x, the input layer of a neural network whose output is a softmax with a probability distribution over words. Thus y 42 y_{42} y42, the value of output node 42 is the probability of the next word w t w_t wt being V 42 V_{42} V42, the vocabulary word with index 42.

The model shown in Fig. 7.12 is quite sufficient, assuming we learn the embeddings separately by a method like the word2vec methods of Chapter 6. The method of using another algorithm to learn the embedding representations we use for input pretraining words is called pretraining. If those pretrained embeddings are sufficient for your purposes, then this is all you need.

However, often we’d like to learn the embeddings simultaneously with training the network. This is true when whatever task the network is designed for (sentiment classification, or translation, or parsing) places strong constraints on what makes a good representation.

Let’s therefore show an architecture that allows the embeddings to be learned. To do this, we’ll add an extra layer to the network, and propagate the error all the way back to the embedding vectors, starting with embeddings with random values and slowly moving toward sensible representations.

For this to work at the input layer, instead of pre-trained embeddings, we’re going to represent each of the N N N previous words as a one-hot vector of length ∣ V ∣ |V| V, i.e., one-hot vector with one dimension for each word in the vocabulary.

Fig. 7.13 shows the additional layers needed to learn the embeddings during LM training. Here the N=3 context words are represented as 3 one-hot vectors, fully connected to the embedding layer via 3 instantiations of the E E E embedding matrix. Note that we don’t want to learn separate weight matrices for mapping each of the 3 previous words to the projection layer, we want one single embedding dictionary E E E that’s shared among these three. That’s because over time, many different words will appear as w t − 2 w_{t-2} wt2 or w t − 1 w_{t-1} wt1, and we’d like to just represent each word with one vector, whichever context position it appears in. The embedding weight matrix E E E thus has a row for each word, each a vector of d d d dimensions (the dimension of an embedding), and hence has dimensionality V × d V\times d V×d.

Let’s walk through the forward pass of Fig. 7.13.

  1. Select three embeddings from E: Given the three previous words, we look up their indices, create 3 one-hot vectors, and then multiply each by the embedding matrix E E E. Consider w t − 3 w_{t-3} wt3. The one-hot vector for ‘the’ (index 35) is multiplied by the embedding matrix E E E, to give the first part of the first hidden layer, called the projection layer. Since each row of the project layer e e e is just an embedding for a word, and the input is a one-hot columnvector x i x_i xi for word V i V_i Vi, the projection layer for input x i x_i xi will be E x i = e i Ex_i = e_i Exi=ei, the embedding for word V i V_i Vi. We now concatenate the three embeddings for the context words.
  2. Multiply by W: We now multiply by W W W (and add b b b) and pass through the rectified linear (or other) activation function to get the hidden layer h h h.
  3. Multiply by U: h h h is now multiplied by U U U
  4. Apply softmax: After the softmax, each node y i y_i yi in the output layer estimates the probability P ( w t = i ∣ w t − 1 , w t − 2 , w t − 3 ) P(w_t = i|w_{t-1},w_{t-2},w_{t-3}) P(wt=iwt1,wt2,wt3)

In summary, if we use e e e to represent the projection layer, formed by concatenating the 3 embedding for the three context vectors, the equations for a neural language model become:
KaTeX parse error: Unknown column alignment: [ at position 15: \begin{array}[̲lll]\ e&=&(Ex_1…

7.5.2 Training the neural language model

To train the model, i.e. to set all the parameters θ = E , W , U , b \theta = E, W, U,b θ=E,W,U,b, we do gradient descent, using error back propagation on the computation graph to compute the gradient.

Generally, training proceeds by taking as input a very long text, concatenating all the sentences, start with random weights, and then iteratively moving through the text predicting each word w t w_t wt. At each word w t w_t wt, the cross-entropy (negative log likelihood) loss is:
L = − log ⁡ p ( w t ∣ w t − 1 , … , w t − n + 1 ) L=-\log p(w_t |w_{t-1},\ldots,w_{t-n+1}) L=logp(wtwt1,,wtn+1)
The gradient for this loss is then:
θ t + 1 = θ t − η ∂ − log ⁡ p ( w t ∣ w t − 1 , … , w t − n + 1 ) ∂ θ \theta_{t+1}=\theta_t-\eta\frac{\partial -\log p(w_t |w_{t-1},\ldots,w_{t-n+1})}{\partial \theta} θt+1=θtηθlogp(wtwt1,,wtn+1)
Training the parameters to minimize loss will result both in an algorithm for language modeling (a word predictor) but also a new set of embeddings e e e that can be used as word representations for other tasks.

7.6 Summary

  • Neural networks are built out of neural units, originally inspired by human neurons but now simple an abstract computational device.
  • Each neural unit multiplies input values by a weight vector, adds a bias, and then applies a non-linear activation function like sigmoid, tanh, or rectified linear.
  • In a fully-connected, feedforward network, each unit in layer i i i is connected to each unit in layer i + 1 i+ 1 i+1, and there are no cycles.
  • The power of neural networks comes from the ability of early layers to learn representations that can be utilized by later layers in the network.
  • Neural networks are trained by optimization algorithms like gradient descent.
  • Error back propagation, backward differentiation on a computation graph, is used to compute the gradients of the loss function for a network.
  • Neural language models use a neural network as a probabilistic classifier, to compute the probability of the next word given the previous n n n words.
  • Neural language models can use pretrained embeddings, or can learn embeddings from scratch in the process of language modeling.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值