PRML Chapter 5 Neural Networks
5.1 Feed-forward Network Functions
A network with one hidden layer may be the form like this:
y k ( x , w ) = σ ( ∑ j = 1 M w k j 2 h ( ∑ i = 1 D w j i 1 x i + w j 0 1 ) + w k 0 2 ) y_{k}(x, w) = \sigma\left( \sum_{j=1}^{M}w_{kj}^{2}h(\sum_{i=1}^{D}w_{ji}^{1}x_{i} + w_{j0}^{1}) + w_{k0}^{2} \right) yk(x,w)=σ(j=1∑Mwkj2h(i=1∑Dwji1xi+wj01)+wk02)
5.1.1 Weight-space symmetries
5.2 Network Training
The error function that we need to minimize:
E ( w ) = 1 2 ∑ n = 1 N ∣ ∣ y ( x n , w ) − t n ∣ ∣ 2 E(w) = \frac{1}{2}\sum_{n=1}^{N} || y(x_{n}, w) - t_{n} ||^{2} E(w)=21n=1∑N∣∣y(xn,w)−tn∣∣2
For the regression problem, assume that the output of the network is
p ( t ∣ x , w ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t | x, w) = N(t | y(x,w), \beta^{-1}) p(t∣x,w)=N(t∣y(x,w),β−1)
The corresponding likelihood function will be
p ( t ∣ X , w , β ) = ∏ n = 1 N p ( t n ∣ x n , w , β ) p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta) p(t∣X,w,β)=n=1∏Np(tn∣xn,w,β)
Taking the negative logarithm, we obtain the error function:
β 2 ∑ n = 1 N { y ( x , w ) − t n } 2 − N 2 ln β + N 2 ln ( 2 π ) \frac{\beta}{2}\sum_{n=1}^{N}\{ y(x, w) - t_{n} \}^{2} - \frac{N}{2}\ln\beta + \frac{N}{2} \ln(2\pi) 2βn=1∑N{y(x,w)−tn}2−2Nlnβ+2Nln(2π)
from which we can learn w w w and β \beta β. Maximizing the likelihood funciton is equivalent to minimizing the sum-of-squares error function given by:
E ( w ) = 1 2 ∑ n = 1 N { y ( x n , w ) − t n } 2 E(w) = \frac{1}{2}\sum_{n=1}^{N}\{y(x_{n}, w) - t_{n} \}^{2} E(w)=21n=1∑N{y(xn,w)−tn}2
Having found w M L w_{ML} wML, the value of β \beta β can be found by minimizing the negative log likelihood to give
1 β M L = 1 N ∑ n = 1 N { y ( x n , w M L ) − t n } 2 \frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^{N}\{ y(x_{n}, w_{ML}) - t_{n} \}^{2} βML1=N1n=1∑N{y(xn,wML)−tn}2
In the regression case, we can view the network as having an output activation function that is the identity, so that y k = a k y_k=a_k yk=ak. The corresponding sum-of-squares error function has the property
∂ E ∂ a k = y k − t k \frac{\partial E}{\partial a_{k}} = y_{k} - t_{k} ∂ak∂E=yk−tk
In the case of binary classification, the conditional distribution of targets is a Bernoulli distribution:
p ( t ∣ x , w ) = y ( x , w ) t { 1 − y ( x , w ) } 1 − t p(t|x,w)=y(x,w)^t\{1-y(x,w)\}^{1-t} p(t∣x,w)=y(x,w)t{1−y(x,w)}1−t
Given by negative log likelihood, we have a cross-entropy error function:
E ( w ) = − ∑ n = 1 N { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } E(w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \} E(w)=−n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
And for multiclass classification problem:
E ( w ) = − ∑ n = 1 N ∑ k = 1 K t n k ln y k ( x n , w ) E(w) = -\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}\ln y_{k}(x_{n}, w) E(w)=−n=1∑Nk=1∑Ktnklnyk(xn,w)
5.2.1 Parameter optimization
We choose some initial value and moving through weight space in a succession of steps:
w τ + 1 = w τ + Δ w τ w^{\tau + 1} = w^{\tau} + \Delta w^{\tau} wτ+1=wτ+Δwτ
5.2.2 Local quadratic approximation
Consider the Taylor expansion of E ( w ) E(w) E(w) around some point w ^ \hat{w} w^ in weight space:
E ( w ) ≃ E ( w ^ ) + ( w − w ^ ) T b + 1 2 ( w − w ^ ) T H ( w − w ^ ) E(w) \simeq E(\hat{w}) + (w - \hat{w})^{T}b + \frac{1}{2}(w - \hat{w})^{T}H(w - \hat{w}) E(w)≃E(w^)+(w−w^)Tb+21(w−w^)TH(w−w^)x
where b b b is defined to be the gradient of E E E evaluated at w ^ \hat{w} w^: b ≡ ∇ E ∣ w = w ^ b \equiv \nabla E|_{w=\hat{w}} b≡∇E∣w=w^
When w ∗ w^* w∗ is a minimum of the error function, there is no linear term because ∇ E = 0 \nabla E=0 ∇E=0, we have:
E ( w ) = E ( w ∗ ) + 1 2 ( w − w ∗ ) T H ( w − w ∗ ) E(w) = E(w^{*}) + \frac{1}{2}(w-w^{*})^{T}H(w-w^{*}) E(w)=E(w∗)+21(w−w∗)TH(w−w∗)
5.2.3 Use of gradient information
5.2.4 Gradient descent optimization
Using gradient information:
w ( τ + 1 ) = w ( τ ) − η ∇ E ( w ( τ ) ) w^{(\tau + 1)} = w^{(\tau)} - \eta\nabla E(w^{(\tau)}) w(τ+1)=w(τ)−η∇E(w(τ))
On-line gradient descent(sequential/stochastic gradient descent)
w ( τ + 1 ) = w ( τ ) − η ∇ E n ( w ( τ ) ) w^{(\tau+1)} = w^{(\tau)}- \eta\nabla E_{n}(w^{(\tau)}) w(τ+1)=w(τ)−η∇En(w(τ))
5.3 Error Backpropagation
5.3.1 Evaluation of error-function derivatives
We now derive the backpropagation algorithm for a general network:
E n = 1 2 ∑ k ( y k ( x n , w ) − t n k ) 2 → ∂ E n ∂ w j i = ( y n j − t n j ) x n i E_n=\frac{1}{2}\sum_k(y_k(x_n,w)-t_{nk})^2\ \rightarrow\ \frac{\partial E_n}{\partial w_{ji}}=(y_{nj}-t_{nj})x_{ni} En=21k∑(yk(xn,w)−tnk)2 → ∂wji∂En=(ynj−tnj)xni
In a general feed-forward network, each unit computes a weighted sum of its inputs, and the sum is transformed by a nonlinear activation function:
a j = ∑ i w j i z i , z j = h ( a j ) a_j=\sum_i w_{ji}z_i,\ \ z_j=h(a_j) aj=i∑wjizi, zj=h(aj)
Apply the chain rule
∂ E n ∂ w j i = ∂ E n ∂ a j ∂ a j ∂ w j i = δ j z i \frac{\partial E_n}{\partial w_{ji}}=\frac{\partial E_n}{\partial a_j}\frac{\partial a_j}{\partial w_{ji}}=\delta_j z_i ∂wji∂En=∂aj∂En∂wji∂aj=δjzi
We can obtain the backpropagation formula:
δ j = ∂ E n ∂ a j = ∑ k ∂ E n ∂ a k ∂ a k ∂ a j = h ′ ( a j ) ∑ k w k j δ k \delta_j=\frac{\partial E_n}{\partial a_j}=\sum_k\frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j}=h'(a_j)\sum_k w_{kj}\delta_k δj=∂aj∂En=k∑∂ak∂En∂aj∂ak=h′(aj)k∑wkjδk
For all data points, we have:
E ( w ) = ∑ n = 1 N E n ( w ) , ∂ E ∂ w j i = ∑ n ∂ E n ∂ w j i E(w)=\sum_{n=1}^N E_n(w),\ \ \frac{\partial E}{\partial w_{ji}}=\sum_n\frac{\partial E_n}{\partial w_{ji}} E(w)=n=1∑NEn(w), ∂wji∂E=n∑∂wji∂En
5.3.2 A simple example
5.3.3 Efficiency of backpropagation
5.3.4 The Jacobian matrix
Consider the evaluation of the Jacobian matrix, first we write down the backpropagation formula to determine the derivatives ∂ y k ∂ a j \frac{\partial y_{k}}{\partial a_{j}} ∂aj∂yk.
∂ y k ∂ a j = ∑ l ∂ y k ∂ a l ∂ a l ∂ a j = h ′ ( a j ) ∑ l w l j ∂ y k ∂ a l \frac{\partial y_{k}}{\partial a_{j}} = \sum_{l}\frac{\partial y_{k}}{\partial a_{l}}\frac{\partial a_{l}}{\partial a_{j}} = h^{'}(a_{j})\sum_{l}w_{lj}\frac{\partial y_{k}}{\partial a_{l}} ∂aj∂yk=l∑∂al∂yk∂aj∂al=h′(aj)l∑wlj∂al∂yk
If we have individual sigmoidal activation functions at each output unit, then
∂ y k ∂ a l = δ k l ′ ( a l ) \frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}^{'}(a_{l}) ∂al∂yk=δkl′(al)
whereas for softmax outputs we have:
∂ y k ∂ a l = δ k l y k − y k y l \frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}y_{k} - y_{k}y_{l} ∂al∂yk=δklyk−ykyl
Finally we can calculate the element in the Jacobi matrix:
J j i = ∑ j w j i ∂ y k ∂ a j J_{ji} = \sum_{j}w_{ji}\frac {\partial y_{k}}{\partial a_{j}} Jji=j∑wji∂aj∂yk
5.4 The Hessian Matrix
5.4.1 Diagonal approximation
The diagonal elements of the Hessian can be written:
∂ 2 E n ∂ w j i 2 = ∂ 2 E n ∂ a j 2 z i 2 \frac{\partial^2 E_n}{\partial w^2_{ji}}=\frac{\partial^2 E_n}{\partial a_j^2} z^2_i ∂wji2∂2En=∂aj2∂2Enzi2
Recursively using the chain rule of differential calculus to give a backpropagation equation of the form:
∂ 2 E n ∂ a j 2 = h ′ ( a j ) 2 ∑ k ∑ k ′ w k j w k ′ j ∂ 2 E n ∂ a k ∂ a k ′ + h ′ ′ ( a j ) ∑ k w k j ∂ E n ∂ a k \frac{\partial^2 E_n}{\partial a_j^2}=h'(a_j)^2\sum_k\sum_{k'}w_{kj}w_{k'j} \frac{\partial^2 E_n}{\partial a_k \partial a_{k'}}+h''(a_j)\sum_k w_{kj}\frac{\partial E^n}{\partial a_k} ∂aj2∂2En=h′(aj)2k∑k′∑wkjwk′j∂ak∂ak′∂2En+h′′(aj)k∑wkj∂ak∂En
5.4.2 Outer product approximation
Write teh Hessian matrix in the form:
H = ∇ ∇ E = ∑ n = 1 N ∇ y n ∇ y n + ∑ n = 1 N ( y n − t n ) ∇ ∇ y n H=\nabla\nabla E=\sum_{n=1}^N\nabla y_n\nabla y_n+\sum_{n=1}^N(y_n-t_n)\nabla\nabla y_n H=∇∇E=n=1∑N∇yn∇yn+n=1∑N(yn−tn)∇∇yn
By neglecting the second term, we get the Levenberg-Marquardt approximation (outer product approximation)
H ≃ ∑ n = 1 N b n b n T H\simeq\sum_{n=1}^N b_n b_n^T H≃n=1∑NbnbnT
where b n = ∇ y n = ∇ a n b_n=\nabla y_n=\nabla a_n bn=∇yn=∇an.
In the case of the cross-entropy error function for a network with logistic sigmoid output-unit activation functions, the corresponding approximation is given by:
H ≃ ∑ n = 1 N y n ( 1 − y n ) b n b n T H\simeq\sum_{n=1}^N y_n(1-y_n)b_nb_n^T H≃n=1∑Nyn(1−yn)bnbnT
5.5 Regularization in Neural Networks
To control the complexity of a neural network, the simplest regularizer is the quadratic, giving a regularized error:
E ~ ( w ) = E ( w ) + λ 2 w T w \tilde{E}(w)=E(w)+\frac{\lambda}{2} w^Tw E~(w)=E(w)+2λwTw.
5.5.1 Consistent Gaussian priors
A regularizer which is invariant under the linear transformations is given by:
λ 1 2 ∑ w ∈ W 1 w 2 + λ 2 2 ∑ w ∈ W 2 w 2 \frac{\lambda_{1}}{2}\sum_{w\in W_{1}}w^{2} + \frac{\lambda_{2}}{2}\sum_{w\in W_{2}}w^{2} 2λ1w∈W1∑w2+2λ2w∈W2∑w2
5.5.2 Early stopping
Training can be stopped at the point of smallest error with respect to the validation set in order to obtain a network having good generalization performance.
5.5.3 Invariances
5.5.4 Tangent propagation
We can use regularization to encourage models to be invariant to transformations of the input through the technique of tangent propagation. Let the vector that results from acting on x n x_n xn bu this transformation be denoted by s ( x n , ϵ ) s(x_{n}, \epsilon) s(xn,ϵ) and s ( x n , 0 ) = x s(x_n,0)=x s(xn,0)=x. Then the tangent to the curve M M M is given by the directional derivative τ = ∂ x ∂ ϵ \tau = \frac{\partial x}{\partial \epsilon} τ=∂ϵ∂x,and the tangent vector at the point x n x_n xn is given by:
τ n = ∂ s ( x n , ϵ ) ∂ ϵ ∣ ϵ = 0 \tau_n=\frac{\partial s(x_n,\epsilon)}{\partial\epsilon}|_{\epsilon =0} τn=∂ϵ∂s(xn,ϵ)∣ϵ=0
The derivative of output k with respect to ϵ \epsilon ϵ is given by:
∂ y k ∂ ϵ ∣ ϵ = 0 = ∑ i = 1 D ∂ y k ∂ x i x i ∂ ϵ ∣ ϵ = 0 = ∑ i = 1 D J k i τ i \frac{\partial y_{k}}{\partial \epsilon} |_{\epsilon=0} =\sum_{i=1}^D\frac{\partial y_k}{\partial x_i}\frac{x_i}{\partial\epsilon}|_{\epsilon=0} =\sum_{i=1}^{D}J_{ki}\tau_{i} ∂ϵ∂yk∣ϵ=0=i=1∑D∂xi∂yk∂ϵxi∣ϵ=0=i=1∑DJkiτi
The result can be used to modify the standard error funciton:
E ~ = E + λ Ω \tilde{E}=E+\lambda\Omega E~=E+λΩ
where λ \lambda λ is a regularization coefficient and:
Ω = 1 2 ∑ n ∑ k ( ∂ y n k ∂ ϵ ∣ ϵ = 0 ) 2 = 1 2 ∑ n ∑ k ( ∑ i = 1 D J n k i τ n i ) 2 \Omega=\frac{1}{2}\sum_n\sum_k(\frac{\partial y_{nk}}{\partial \epsilon}|_{\epsilon=0})^2=\frac{1}{2}\sum_n\sum_k(\sum_{i=1}^D J_{nki}\tau_{ni})^2 Ω=21n∑k∑(∂ϵ∂ynk∣ϵ=0)2=21n∑k∑(i=1∑DJnkiτni)2
5.5.5 Training with transformed data
Consider a transformation governed by a single parameter ϵ \epsilon ϵ and describe by the function s ( x , ϵ ) s(x,\epsilon) s(x,ϵ). Consider a sum-of-squares error function, for untransformed inputs can be written in the form:
E = 1 2 ∫ ∫ { y ( x ) − t } 2 p ( t ∣ x ) p ( x ) d x d t E = \frac{1}{2}\int\int \{ y(x) - t\}^{2}p(t|x)p(x) dx dt E=21∫∫{y(x)−t}2p(t∣x)p(x)dxdt
if the parameter ϵ \epsilon ϵ is drawn from a distribution p ( ϵ ) p(\epsilon) p(ϵ), then:
E ~ = 1 2 ∫ ∫ { y ( s ( x , ϵ ) ) − t } 2 p ( t ∣ x ) p ( x ) p ( ϵ ) d x d t d ϵ \tilde{E} = \frac{1}{2}\int\int \{ y(s(x, \epsilon)) - t\}^{2}p(t|x)p(x)p(\epsilon) dx dt d\epsilon E~=21∫∫{y(s(x,ϵ))−t}2p(t∣x)p(x)p(ϵ)dxdtdϵ
Further assume that
p
(
ϵ
)
p(\epsilon)
p(ϵ) has zero mean with small variance, after the Taylor expansion and substituting into the mean error function, the average error
E
~
=
E
+
λ
Ω
\tilde{E} = E + \lambda\Omega
E~=E+λΩ
where E is the original sum-of-squares error, and the regularization term O m e g a Omega Omega takes the form:
Ω = 1 2 ∫ [ { y ( x ) − E [ t ∣ x ] } { ( τ ′ ) T ∇ y ( x ) + τ T ∇ ∇ y ( x ) τ } + ( τ T ∇ y ( x ) ) 2 ] p ( x ) d x \Omega = \frac{1}{2}\int [ \{ y(x) - E[t|x] \} \{ (\tau')^T\nabla y(x) + \tau^{T}\nabla\nabla y(x)\tau \} + (\tau^{T}\nabla y(x))^{2} ]p(x) dx Ω=21∫[{y(x)−E[t∣x]}{(τ′)T∇y(x)+τT∇∇y(x)τ}+(τT∇y(x))2]p(x)dx
5.5.6 Convolutional networks
5.5.7 Soft weight sharing
In this part, the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Furthermore, the division of weights into groups, the mean weight value for each group, and the spread of values within the groups are all determined as part of the learning process.
5.6 Mixture Density Networks
Develop the model explicitly for Gaussian components, so that:
p ( t ∣ x ) = ∑ k = 1 K π k ( x ) N ( t ∣ μ k ( x ) , σ k 2 ( x ) I ) p(t|x) = \sum_{k=1}^{K}\pi_{k}(x)N(t | \mu_{k}(x), \sigma_{k}^{2}(x)I) p(t∣x)=k=1∑Kπk(x)N(t∣μk(x),σk2(x)I)
For indenpendent data, the error function takes the form:
E ( w ) = − ∑ n = 1 N ln { ∑ n = 1 K π k ( x n , w ) N ( t n ∣ μ k ( x n , w ) , σ k 2 ( x n , w ) I ) } E(w) = -\sum_{n=1}^{N}\ln \left\{ \sum_{n=1}^{K}\pi_{k}(x_{n}, w)N(t_{n} | \mu_{k}(x_{n},w), \sigma_{k}^{2}(x_{n}, w)\mathbf{I}) \right\} E(w)=−n=1∑Nln{n=1∑Kπk(xn,w)N(tn∣μk(xn,w),σk2(xn,w)I)}
5.7 Bayesian Neural Networks
In this part, we will approximate the posterior distribution by a Guassian, centred at a mode of the true posterior. We will also assume that the covariance of this Gaussian is small so that the network function is approximately linear.
5.7.1 Posterior parameter distribution
We suppose that the conditional distribution p ( t ∥ x ) p(t\|x) p(t∥x) is Gaussian.
p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t|x,w,\beta) = N(t | y(x, w), \beta^{-1}) p(t∣x,w,β)=N(t∣y(x,w),β−1)
Also, we choose a prior distribution over the weights w w w that is Guassian of the form.
p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) p(w | \alpha) = N(w | 0, \alpha^{-1}\mathbf{I}) p(w∣α)=N(w∣0,α−1I)
For an i.i.d. data set of N N N observations x 1 , . . . , x N x_1,...,x_N x1,...,xN, with a corresponding set of target values D = { t 1 , . . . , t N } D=\{t_1,...,t_N\} D={t1,...,tN}, the likelihood function is given by:
p ( D ∣ w , β ) = ∏ n = 1 N N ( t n ∣ y ( x , w ) , β − 1 ) p(D | w, \beta) = \prod_{n=1}^{N}N(t_{n} | y(x, w), \beta^{-1}) p(D∣w,β)=n=1∏NN(tn∣y(x,w),β−1)
so we can get the posterior distribution:
p ( w ∣ D , α , β ) ∝ p ( w ∣ α ) p ( D ∣ w , β ) p(w | D, \alpha, \beta) \propto p(w | \alpha)p(D | w, \beta) p(w∣D,α,β)∝p(w∣α)p(D∣w,β)
The Gaussian approximation to the posterior is given by:
q ( w ∣ D ) = N ( w ∣ w M A P , A − 1 ) q(w | D) = N(w | w_{MAP}, \mathbf{A}^{-1}) q(w∣D)=N(w∣wMAP,A−1)
Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution:
p ( t ∣ x , D ) = ∫ p ( t ∣ x , w ) q ( w ∣ D ) d w p(t | x, D) = \int p(t | x, w)q(w | D) dw p(t∣x,D)=∫p(t∣x,w)q(w∣D)dw
Make a Taylor series expansion of the network function around w M A P w_{MAP} wMAP and retain the linear terms, we will get a linear-Gaussian model:
p ( t ∣ x , w , β ) ≃ N ( t ∣ y ( x , w M A P ) + g T ( w − w M A P ) , β − 1 ) p(t| x, w, \beta) \simeq N(t | y(x, w_{MAP}) + g^{T}(w - w_{MAP}), \beta^{-1}) p(t∣x,w,β)≃N(t∣y(x,wMAP)+gT(w−wMAP),β−1)
we can therefore make use of the general result for the marginal p ( x ) p(x) p(x) to give:
p ( t ∣ x , D , α , β ) = N ( t ∣ y ( x , w M A P ) , σ 2 ( x ) ) p(t | x, D, \alpha, \beta) = N(t | y(x, w_{MAP}), \sigma^{2}(x)) p(t∣x,D,α,β)=N(t∣y(x,wMAP),σ2(x))
where
σ 2 ( x ) = β − 1 + g T A − 1 g \sigma^{2}(x) = \beta^{-1} + g^{T}\mathbf{A}^{-1}g σ2(x)=β−1+gTA−1g
g = ∇ w y ( x , w ) ∣ w = w M A P g = \nabla_{w}y(x,w)|_{w = w_{MAP}} g=∇wy(x,w)∣w=wMAP
5.7.2 Hyperparameter optimization
5.7.3 Bayesian neural networks for classification
The logistic sigmoid output corresponding to a two-class classification problem. The log likelihood function for this model is given by:
ln p ( D ∣ w ) = ∑ n = 1 N { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } \ln p(D|w) = \sum_{n=1}^{N}\{t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \} lnp(D∣w)=n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
Minimizing the regularized error function:
E ( w ) = ln p ( D ∣ w ) + α 2 w T w E(w) = \ln p(D|w) + \frac{\alpha}{2}w^{T}w E(w)=lnp(D∣w)+2αwTw
The result of the approximate distribution will be
p ( t = 1 ∣ x , D ) = σ ( k ( σ a 2 ) b T w M A P ) p(t=1 | x, D) = \sigma(k(\sigma_{a}^{2})b^T w_{MAP}) p(t=1∣x,D)=σ(k(σa2)bTwMAP)