模式识别 | PRML Chapter 5 Neural Networks

PRML Chapter 5 Neural Networks

5.1 Feed-forward Network Functions

A network with one hidden layer may be the form like this:

y k ( x , w ) = σ ( ∑ j = 1 M w k j 2 h ( ∑ i = 1 D w j i 1 x i + w j 0 1 ) + w k 0 2 ) y_{k}(x, w) = \sigma\left( \sum_{j=1}^{M}w_{kj}^{2}h(\sum_{i=1}^{D}w_{ji}^{1}x_{i} + w_{j0}^{1}) + w_{k0}^{2} \right) yk(x,w)=σ(j=1Mwkj2h(i=1Dwji1xi+wj01)+wk02)

5.1.1 Weight-space symmetries

5.2 Network Training

The error function that we need to minimize:

E ( w ) = 1 2 ∑ n = 1 N ∣ ∣ y ( x n , w ) − t n ∣ ∣ 2 E(w) = \frac{1}{2}\sum_{n=1}^{N} || y(x_{n}, w) - t_{n} ||^{2} E(w)=21n=1Ny(xn,w)tn2

For the regression problem, assume that the output of the network is

p ( t ∣ x , w ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t | x, w) = N(t | y(x,w), \beta^{-1}) p(tx,w)=N(ty(x,w),β1)

The corresponding likelihood function will be

p ( t ∣ X , w , β ) = ∏ n = 1 N p ( t n ∣ x n , w , β ) p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta) p(tX,w,β)=n=1Np(tnxn,w,β)

Taking the negative logarithm, we obtain the error function:

β 2 ∑ n = 1 N { y ( x , w ) − t n } 2 − N 2 ln ⁡ β + N 2 ln ⁡ ( 2 π ) \frac{\beta}{2}\sum_{n=1}^{N}\{ y(x, w) - t_{n} \}^{2} - \frac{N}{2}\ln\beta + \frac{N}{2} \ln(2\pi) 2βn=1N{y(x,w)tn}22Nlnβ+2Nln(2π)

from which we can learn w w w and β \beta β. Maximizing the likelihood funciton is equivalent to minimizing the sum-of-squares error function given by:

E ( w ) = 1 2 ∑ n = 1 N { y ( x n , w ) − t n } 2 E(w) = \frac{1}{2}\sum_{n=1}^{N}\{y(x_{n}, w) - t_{n} \}^{2} E(w)=21n=1N{y(xn,w)tn}2

Having found w M L w_{ML} wML, the value of β \beta β can be found by minimizing the negative log likelihood to give

1 β M L = 1 N ∑ n = 1 N { y ( x n , w M L ) − t n } 2 \frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^{N}\{ y(x_{n}, w_{ML}) - t_{n} \}^{2} βML1=N1n=1N{y(xn,wML)tn}2

In the regression case, we can view the network as having an output activation function that is the identity, so that y k = a k y_k=a_k yk=ak. The corresponding sum-of-squares error function has the property

∂ E ∂ a k = y k − t k \frac{\partial E}{\partial a_{k}} = y_{k} - t_{k} akE=yktk

In the case of binary classification, the conditional distribution of targets is a Bernoulli distribution:

p ( t ∣ x , w ) = y ( x , w ) t { 1 − y ( x , w ) } 1 − t p(t|x,w)=y(x,w)^t\{1-y(x,w)\}^{1-t} p(tx,w)=y(x,w)t{1y(x,w)}1t

Given by negative log likelihood, we have a cross-entropy error function:

E ( w ) = − ∑ n = 1 N { t n ln ⁡ y n + ( 1 − t n ) ln ⁡ ( 1 − y n ) } E(w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \} E(w)=n=1N{tnlnyn+(1tn)ln(1yn)}

And for multiclass classification problem:

E ( w ) = − ∑ n = 1 N ∑ k = 1 K t n k ln ⁡ y k ( x n , w ) E(w) = -\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}\ln y_{k}(x_{n}, w) E(w)=n=1Nk=1Ktnklnyk(xn,w)

5.2.1 Parameter optimization

We choose some initial value and moving through weight space in a succession of steps:

w τ + 1 = w τ + Δ w τ w^{\tau + 1} = w^{\tau} + \Delta w^{\tau} wτ+1=wτ+Δwτ

5.2.2 Local quadratic approximation

Consider the Taylor expansion of E ( w ) E(w) E(w) around some point w ^ \hat{w} w^ in weight space:

E ( w ) ≃ E ( w ^ ) + ( w − w ^ ) T b + 1 2 ( w − w ^ ) T H ( w − w ^ ) E(w) \simeq E(\hat{w}) + (w - \hat{w})^{T}b + \frac{1}{2}(w - \hat{w})^{T}H(w - \hat{w}) E(w)E(w^)+(ww^)Tb+21(ww^)TH(ww^)x

where b b b is defined to be the gradient of E E E evaluated at w ^ \hat{w} w^: b ≡ ∇ E ∣ w = w ^ b \equiv \nabla E|_{w=\hat{w}} bEw=w^

When w ∗ w^* w is a minimum of the error function, there is no linear term because ∇ E = 0 \nabla E=0 E=0, we have:

E ( w ) = E ( w ∗ ) + 1 2 ( w − w ∗ ) T H ( w − w ∗ ) E(w) = E(w^{*}) + \frac{1}{2}(w-w^{*})^{T}H(w-w^{*}) E(w)=E(w)+21(ww)TH(ww)

5.2.3 Use of gradient information

5.2.4 Gradient descent optimization

Using gradient information:

w ( τ + 1 ) = w ( τ ) − η ∇ E ( w ( τ ) ) w^{(\tau + 1)} = w^{(\tau)} - \eta\nabla E(w^{(\tau)}) w(τ+1)=w(τ)ηE(w(τ))

On-line gradient descent(sequential/stochastic gradient descent)

w ( τ + 1 ) = w ( τ ) − η ∇ E n ( w ( τ ) ) w^{(\tau+1)} = w^{(\tau)}- \eta\nabla E_{n}(w^{(\tau)}) w(τ+1)=w(τ)ηEn(w(τ))

5.3 Error Backpropagation

5.3.1 Evaluation of error-function derivatives

We now derive the backpropagation algorithm for a general network:

E n = 1 2 ∑ k ( y k ( x n , w ) − t n k ) 2   →   ∂ E n ∂ w j i = ( y n j − t n j ) x n i E_n=\frac{1}{2}\sum_k(y_k(x_n,w)-t_{nk})^2\ \rightarrow\ \frac{\partial E_n}{\partial w_{ji}}=(y_{nj}-t_{nj})x_{ni} En=21k(yk(xn,w)tnk)2  wjiEn=(ynjtnj)xni

In a general feed-forward network, each unit computes a weighted sum of its inputs, and the sum is transformed by a nonlinear activation function:

a j = ∑ i w j i z i ,    z j = h ( a j ) a_j=\sum_i w_{ji}z_i,\ \ z_j=h(a_j) aj=iwjizi,  zj=h(aj)

Apply the chain rule

∂ E n ∂ w j i = ∂ E n ∂ a j ∂ a j ∂ w j i = δ j z i \frac{\partial E_n}{\partial w_{ji}}=\frac{\partial E_n}{\partial a_j}\frac{\partial a_j}{\partial w_{ji}}=\delta_j z_i wjiEn=ajEnwjiaj=δjzi

We can obtain the backpropagation formula:

δ j = ∂ E n ∂ a j = ∑ k ∂ E n ∂ a k ∂ a k ∂ a j = h ′ ( a j ) ∑ k w k j δ k \delta_j=\frac{\partial E_n}{\partial a_j}=\sum_k\frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j}=h'(a_j)\sum_k w_{kj}\delta_k δj=ajEn=kakEnajak=h(aj)kwkjδk

For all data points, we have:

E ( w ) = ∑ n = 1 N E n ( w ) ,    ∂ E ∂ w j i = ∑ n ∂ E n ∂ w j i E(w)=\sum_{n=1}^N E_n(w),\ \ \frac{\partial E}{\partial w_{ji}}=\sum_n\frac{\partial E_n}{\partial w_{ji}} E(w)=n=1NEn(w),  wjiE=nwjiEn

5.3.2 A simple example

5.3.3 Efficiency of backpropagation

5.3.4 The Jacobian matrix

Consider the evaluation of the Jacobian matrix, first we write down the backpropagation formula to determine the derivatives ∂ y k ∂ a j \frac{\partial y_{k}}{\partial a_{j}} ajyk.

∂ y k ∂ a j = ∑ l ∂ y k ∂ a l ∂ a l ∂ a j = h ′ ( a j ) ∑ l w l j ∂ y k ∂ a l \frac{\partial y_{k}}{\partial a_{j}} = \sum_{l}\frac{\partial y_{k}}{\partial a_{l}}\frac{\partial a_{l}}{\partial a_{j}} = h^{'}(a_{j})\sum_{l}w_{lj}\frac{\partial y_{k}}{\partial a_{l}} ajyk=lalykajal=h(aj)lwljalyk

If we have individual sigmoidal activation functions at each output unit, then

∂ y k ∂ a l = δ k l ′ ( a l ) \frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}^{'}(a_{l}) alyk=δkl(al)

whereas for softmax outputs we have:

∂ y k ∂ a l = δ k l y k − y k y l \frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}y_{k} - y_{k}y_{l} alyk=δklykykyl

Finally we can calculate the element in the Jacobi matrix:

J j i = ∑ j w j i ∂ y k ∂ a j J_{ji} = \sum_{j}w_{ji}\frac {\partial y_{k}}{\partial a_{j}} Jji=jwjiajyk

5.4 The Hessian Matrix

5.4.1 Diagonal approximation

The diagonal elements of the Hessian can be written:

∂ 2 E n ∂ w j i 2 = ∂ 2 E n ∂ a j 2 z i 2 \frac{\partial^2 E_n}{\partial w^2_{ji}}=\frac{\partial^2 E_n}{\partial a_j^2} z^2_i wji22En=aj22Enzi2

Recursively using the chain rule of differential calculus to give a backpropagation equation of the form:

∂ 2 E n ∂ a j 2 = h ′ ( a j ) 2 ∑ k ∑ k ′ w k j w k ′ j ∂ 2 E n ∂ a k ∂ a k ′ + h ′ ′ ( a j ) ∑ k w k j ∂ E n ∂ a k \frac{\partial^2 E_n}{\partial a_j^2}=h'(a_j)^2\sum_k\sum_{k'}w_{kj}w_{k'j} \frac{\partial^2 E_n}{\partial a_k \partial a_{k'}}+h''(a_j)\sum_k w_{kj}\frac{\partial E^n}{\partial a_k} aj22En=h(aj)2kkwkjwkjakak2En+h(aj)kwkjakEn

5.4.2 Outer product approximation

Write teh Hessian matrix in the form:

H = ∇ ∇ E = ∑ n = 1 N ∇ y n ∇ y n + ∑ n = 1 N ( y n − t n ) ∇ ∇ y n H=\nabla\nabla E=\sum_{n=1}^N\nabla y_n\nabla y_n+\sum_{n=1}^N(y_n-t_n)\nabla\nabla y_n H=E=n=1Nynyn+n=1N(yntn)yn

By neglecting the second term, we get the Levenberg-Marquardt approximation (outer product approximation)

H ≃ ∑ n = 1 N b n b n T H\simeq\sum_{n=1}^N b_n b_n^T Hn=1NbnbnT

where b n = ∇ y n = ∇ a n b_n=\nabla y_n=\nabla a_n bn=yn=an.

In the case of the cross-entropy error function for a network with logistic sigmoid output-unit activation functions, the corresponding approximation is given by:

H ≃ ∑ n = 1 N y n ( 1 − y n ) b n b n T H\simeq\sum_{n=1}^N y_n(1-y_n)b_nb_n^T Hn=1Nyn(1yn)bnbnT

5.5 Regularization in Neural Networks

To control the complexity of a neural network, the simplest regularizer is the quadratic, giving a regularized error:

E ~ ( w ) = E ( w ) + λ 2 w T w \tilde{E}(w)=E(w)+\frac{\lambda}{2} w^Tw E~(w)=E(w)+2λwTw.

5.5.1 Consistent Gaussian priors

A regularizer which is invariant under the linear transformations is given by:

λ 1 2 ∑ w ∈ W 1 w 2 + λ 2 2 ∑ w ∈ W 2 w 2 \frac{\lambda_{1}}{2}\sum_{w\in W_{1}}w^{2} + \frac{\lambda_{2}}{2}\sum_{w\in W_{2}}w^{2} 2λ1wW1w2+2λ2wW2w2

5.5.2 Early stopping

Training can be stopped at the point of smallest error with respect to the validation set in order to obtain a network having good generalization performance.

5.5.3 Invariances

5.5.4 Tangent propagation

We can use regularization to encourage models to be invariant to transformations of the input through the technique of tangent propagation. Let the vector that results from acting on x n x_n xn bu this transformation be denoted by s ( x n , ϵ ) s(x_{n}, \epsilon) s(xn,ϵ) and s ( x n , 0 ) = x s(x_n,0)=x s(xn,0)=x. Then the tangent to the curve M M M is given by the directional derivative τ = ∂ x ∂ ϵ \tau = \frac{\partial x}{\partial \epsilon} τ=ϵx,and the tangent vector at the point x n x_n xn is given by:

τ n = ∂ s ( x n , ϵ ) ∂ ϵ ∣ ϵ = 0 \tau_n=\frac{\partial s(x_n,\epsilon)}{\partial\epsilon}|_{\epsilon =0} τn=ϵs(xn,ϵ)ϵ=0

The derivative of output k with respect to ϵ \epsilon ϵ is given by:

∂ y k ∂ ϵ ∣ ϵ = 0 = ∑ i = 1 D ∂ y k ∂ x i x i ∂ ϵ ∣ ϵ = 0 = ∑ i = 1 D J k i τ i \frac{\partial y_{k}}{\partial \epsilon} |_{\epsilon=0} =\sum_{i=1}^D\frac{\partial y_k}{\partial x_i}\frac{x_i}{\partial\epsilon}|_{\epsilon=0} =\sum_{i=1}^{D}J_{ki}\tau_{i} ϵykϵ=0=i=1Dxiykϵxiϵ=0=i=1DJkiτi

The result can be used to modify the standard error funciton:

E ~ = E + λ Ω \tilde{E}=E+\lambda\Omega E~=E+λΩ

where λ \lambda λ is a regularization coefficient and:

Ω = 1 2 ∑ n ∑ k ( ∂ y n k ∂ ϵ ∣ ϵ = 0 ) 2 = 1 2 ∑ n ∑ k ( ∑ i = 1 D J n k i τ n i ) 2 \Omega=\frac{1}{2}\sum_n\sum_k(\frac{\partial y_{nk}}{\partial \epsilon}|_{\epsilon=0})^2=\frac{1}{2}\sum_n\sum_k(\sum_{i=1}^D J_{nki}\tau_{ni})^2 Ω=21nk(ϵynkϵ=0)2=21nk(i=1DJnkiτni)2

5.5.5 Training with transformed data

Consider a transformation governed by a single parameter ϵ \epsilon ϵ and describe by the function s ( x , ϵ ) s(x,\epsilon) s(x,ϵ). Consider a sum-of-squares error function, for untransformed inputs can be written in the form:

E = 1 2 ∫ ∫ { y ( x ) − t } 2 p ( t ∣ x ) p ( x ) d x d t E = \frac{1}{2}\int\int \{ y(x) - t\}^{2}p(t|x)p(x) dx dt E=21{y(x)t}2p(tx)p(x)dxdt

if the parameter ϵ \epsilon ϵ is drawn from a distribution p ( ϵ ) p(\epsilon) p(ϵ), then:

E ~ = 1 2 ∫ ∫ { y ( s ( x , ϵ ) ) − t } 2 p ( t ∣ x ) p ( x ) p ( ϵ ) d x d t d ϵ \tilde{E} = \frac{1}{2}\int\int \{ y(s(x, \epsilon)) - t\}^{2}p(t|x)p(x)p(\epsilon) dx dt d\epsilon E~=21{y(s(x,ϵ))t}2p(tx)p(x)p(ϵ)dxdtdϵ

Further assume that p ( ϵ ) p(\epsilon) p(ϵ) has zero mean with small variance, after the Taylor expansion and substituting into the mean error function, the average error
E ~ = E + λ Ω \tilde{E} = E + \lambda\Omega E~=E+λΩ

where E is the original sum-of-squares error, and the regularization term O m e g a Omega Omega takes the form:

Ω = 1 2 ∫ [ { y ( x ) − E [ t ∣ x ] } { ( τ ′ ) T ∇ y ( x ) + τ T ∇ ∇ y ( x ) τ } + ( τ T ∇ y ( x ) ) 2 ] p ( x ) d x \Omega = \frac{1}{2}\int [ \{ y(x) - E[t|x] \} \{ (\tau')^T\nabla y(x) + \tau^{T}\nabla\nabla y(x)\tau \} + (\tau^{T}\nabla y(x))^{2} ]p(x) dx Ω=21[{y(x)E[tx]}{(τ)Ty(x)+τTy(x)τ}+(τTy(x))2]p(x)dx

5.5.6 Convolutional networks

5.5.7 Soft weight sharing

In this part, the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Furthermore, the division of weights into groups, the mean weight value for each group, and the spread of values within the groups are all determined as part of the learning process.

5.6 Mixture Density Networks

Develop the model explicitly for Gaussian components, so that:

p ( t ∣ x ) = ∑ k = 1 K π k ( x ) N ( t ∣ μ k ( x ) , σ k 2 ( x ) I ) p(t|x) = \sum_{k=1}^{K}\pi_{k}(x)N(t | \mu_{k}(x), \sigma_{k}^{2}(x)I) p(tx)=k=1Kπk(x)N(tμk(x),σk2(x)I)

For indenpendent data, the error function takes the form:

E ( w ) = − ∑ n = 1 N ln ⁡ { ∑ n = 1 K π k ( x n , w ) N ( t n ∣ μ k ( x n , w ) , σ k 2 ( x n , w ) I ) } E(w) = -\sum_{n=1}^{N}\ln \left\{ \sum_{n=1}^{K}\pi_{k}(x_{n}, w)N(t_{n} | \mu_{k}(x_{n},w), \sigma_{k}^{2}(x_{n}, w)\mathbf{I}) \right\} E(w)=n=1Nln{n=1Kπk(xn,w)N(tnμk(xn,w),σk2(xn,w)I)}

5.7 Bayesian Neural Networks

In this part, we will approximate the posterior distribution by a Guassian, centred at a mode of the true posterior. We will also assume that the covariance of this Gaussian is small so that the network function is approximately linear.

5.7.1 Posterior parameter distribution

We suppose that the conditional distribution p ( t ∥ x ) p(t\|x) p(tx) is Gaussian.

p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t|x,w,\beta) = N(t | y(x, w), \beta^{-1}) p(tx,w,β)=N(ty(x,w),β1)

Also, we choose a prior distribution over the weights w w w that is Guassian of the form.

p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) p(w | \alpha) = N(w | 0, \alpha^{-1}\mathbf{I}) p(wα)=N(w0,α1I)

For an i.i.d. data set of N N N observations x 1 , . . . , x N x_1,...,x_N x1,...,xN, with a corresponding set of target values D = { t 1 , . . . , t N } D=\{t_1,...,t_N\} D={t1,...,tN}, the likelihood function is given by:

p ( D ∣ w , β ) = ∏ n = 1 N N ( t n ∣ y ( x , w ) , β − 1 ) p(D | w, \beta) = \prod_{n=1}^{N}N(t_{n} | y(x, w), \beta^{-1}) p(Dw,β)=n=1NN(tny(x,w),β1)

so we can get the posterior distribution:

p ( w ∣ D , α , β ) ∝ p ( w ∣ α ) p ( D ∣ w , β ) p(w | D, \alpha, \beta) \propto p(w | \alpha)p(D | w, \beta) p(wD,α,β)p(wα)p(Dw,β)

The Gaussian approximation to the posterior is given by:

q ( w ∣ D ) = N ( w ∣ w M A P , A − 1 ) q(w | D) = N(w | w_{MAP}, \mathbf{A}^{-1}) q(wD)=N(wwMAP,A1)

Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution:

p ( t ∣ x , D ) = ∫ p ( t ∣ x , w ) q ( w ∣ D ) d w p(t | x, D) = \int p(t | x, w)q(w | D) dw p(tx,D)=p(tx,w)q(wD)dw

Make a Taylor series expansion of the network function around w M A P w_{MAP} wMAP and retain the linear terms, we will get a linear-Gaussian model:

p ( t ∣ x , w , β ) ≃ N ( t ∣ y ( x , w M A P ) + g T ( w − w M A P ) , β − 1 ) p(t| x, w, \beta) \simeq N(t | y(x, w_{MAP}) + g^{T}(w - w_{MAP}), \beta^{-1}) p(tx,w,β)N(ty(x,wMAP)+gT(wwMAP),β1)

we can therefore make use of the general result for the marginal p ( x ) p(x) p(x) to give:

p ( t ∣ x , D , α , β ) = N ( t ∣ y ( x , w M A P ) , σ 2 ( x ) ) p(t | x, D, \alpha, \beta) = N(t | y(x, w_{MAP}), \sigma^{2}(x)) p(tx,D,α,β)=N(ty(x,wMAP),σ2(x))

where

σ 2 ( x ) = β − 1 + g T A − 1 g \sigma^{2}(x) = \beta^{-1} + g^{T}\mathbf{A}^{-1}g σ2(x)=β1+gTA1g

g = ∇ w y ( x , w ) ∣ w = w M A P g = \nabla_{w}y(x,w)|_{w = w_{MAP}} g=wy(x,w)w=wMAP

5.7.2 Hyperparameter optimization

5.7.3 Bayesian neural networks for classification

The logistic sigmoid output corresponding to a two-class classification problem. The log likelihood function for this model is given by:

ln ⁡ p ( D ∣ w ) = ∑ n = 1 N { t n ln ⁡ y n + ( 1 − t n ) ln ⁡ ( 1 − y n ) } \ln p(D|w) = \sum_{n=1}^{N}\{t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \} lnp(Dw)=n=1N{tnlnyn+(1tn)ln(1yn)}

Minimizing the regularized error function:

E ( w ) = ln ⁡ p ( D ∣ w ) + α 2 w T w E(w) = \ln p(D|w) + \frac{\alpha}{2}w^{T}w E(w)=lnp(Dw)+2αwTw

The result of the approximate distribution will be

p ( t = 1 ∣ x , D ) = σ ( k ( σ a 2 ) b T w M A P ) p(t=1 | x, D) = \sigma(k(\sigma_{a}^{2})b^T w_{MAP}) p(t=1x,D)=σ(k(σa2)bTwMAP)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值