吴恩达·Machine Learning || chap9 Neural Network : Learning简记

9 Neural Network : Learning

9-1 Cost function

Neural Network(classification)

( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯   , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))

L L L = total no. of layers in network

s l s_l sl = no. of unit (not counting bias unit) in layer l l l

Compare between Binary | Multi-class(K classes) classification

Binary classification

y = 0    o r    1 y=0\; or\; 1 y=0or1​​

​ 1 out put unit

Multi-class classification(K classes)

y ∈ R K y\in\mathbb{R{\LARGE } } ^K yRK

​ K output unit

Cost function

​ Logistic regression:

J ( Θ ) = − 1 m [ ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n Θ j 2 J ( \Theta ) = - \frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { ( i ) } \log h _ { \theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log ( 1 - h _ { \theta } ( x ^ { ( i ) } ) ) ] + \frac { \lambda } { 2 m } \sum _ { j = 1 } ^ { n } \Theta^2_j J(Θ)=m1[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]+2mλj=1nΘj2​​​

​ Neural network:

h Θ ( x ) ∈ R K ( h Θ ( x ) ) i = i t h o u t p u t h _ { \Theta } ( x ) \in R ^ { K } \quad ( h _ { \Theta } ( x ) ) _ { i } = i ^ { t h } \quad output hΘ(x)RK(hΘ(x))i=ithoutput

J ( Θ ) = − 1 m [ ∑ i = 11 m ∑ k = 1 K y k ( i ) log ⁡ ( h k ( x ( i ) ) ) k + ( 1 − y k ( i ) log ⁡ ( 1 − ( h Θ ( x ( i ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J ( \Theta ) = - \frac { 1 } { m } [ \sum _ { i = 11 } ^ { m } \sum _ { k = 1 } ^ { K } y _ { k } ^ { (i) } \log ( h _ { k } ( x ^ { ( i ) } ) ) _ { k } + ( 1 - y _ { k } ^ { (i) } \log( 1 - ( h _ { \Theta } ( x ^ { ( i ) } ) _k)]+ \frac { \lambda } { 2 m } \sum _ { l = 1 } ^ { L - 1 } \sum _ { i = 1 } ^ { s_l } \sum _ { j = 1 } ^ { s_l + 1 } ( \Theta _ { ji } ^ { ( l ) } ) ^ { 2 } J(Θ)=m1[i=11mk=1Kyk(i)log(hk(x(i)))k+(1yk(i)log(1(hΘ(x(i))k)]+2mλl=1L1i=1slj=1sl+1(Θji(l))2​​

9-2 Backpropagation algorithm

反向传播算法

Gradient computation

m i n Θ J ( Θ ) min_\Theta J(\Theta) minΘJ(Θ)

Need code to compute:

  • J ( Θ ) J(\Theta) J(Θ)
  • ∂ ∂ Θ i j ( l ) J ( Θ ) {\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)}{\LARGE} Θij(l)J(Θ)​​

Given one training example (x,y):

​ Forward propagation:

a ( 1 ) = x z ( 2 ) = Θ ( 1 ) a ( 1 ) a ( 2 ) = g ( z ( 2 ) ) ( a d d    a 0 ( 2 ) ) z ( 3 ) = Θ ( 2 ) a ( 2 ) a ( 3 ) = g ( z ( 3 ) ) ( a d d    a 0 ( 3 ) ) z ( 4 ) = Θ ( 3 ) a ( 3 ) a ( 4 ) = h Θ ( x ) = g ( z ( 4 ) ) \begin{array}{l}{a^{(1)}=x}\\z^{(2)}=\Theta^{(1)}a^{(1)}\\a^{(2)}=g(z^{(2)})\quad(add\; a_0^{(2)})\\z^{(3)}=\Theta^{(2)}a^{(2)}\\a^{(3)}=g(z^{(3)})\quad(add\; a_0^{(3)})\\z^{(4)}=\Theta^{(3)}a^{(3)}\\a^{(4)}=h_\Theta(x)=g(z^{(4)})\end{array} a(1)=xz(2)=Θ(1)a(1)a(2)=g(z(2))(adda0(2))z(3)=Θ(2)a(2)a(3)=g(z(3))(adda0(3))z(4)=Θ(3)a(3)a(4)=hΘ(x)=g(z(4))

Gradient computation: Backpropagation algorithm

Intuition: δ j ( l ) = \delta_j^{(l)}= δj(l)=“error” of node j j j in layer l l l

For each out put unit (layer L=4)

δ j ( 4 ) = a j ( 4 ) − y j \delta _ { j } ^ { ( 4 ) } = a _ { j } ^ { ( 4 ) } - y _ { j } δj(4)=aj(4)yj

δ ( 3 ) = ( Θ ( 3 ) ) T δ ( 4 ) . ∗ g ′ ( z ( 3 ) ) δ ( 2 ) = ( Θ ( 2 ) ) T δ ( 3 ) . ∗ g ′ ( z ( 2 ) ) \begin{array} { l } { \delta ^{( 3 )} = ( \Theta ^ { ( 3 ) } ) ^ { T } \delta^{( 4 )} .* g ^ { \prime } ( z ^ { ( 3 ) } ) } \\ { \delta ^ { ( 2 ) } = ( \Theta ^ { ( 2 ) } ) ^ { T } \delta ^ {( 3 )} .* g ^ { \prime } ( z ^ { ( 2 ) } ) } \end{array} δ(3)=(Θ(3))Tδ(4).g(z(3))δ(2)=(Θ(2))Tδ(3).g(z(2))​​​​​

( n o δ ( 1 ) no \qquad \delta^{(1)} noδ(1)​​​)

Backpropagation algorithm

Training set: ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯   , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))

Set: △ i j ( l ) = 0 ( f o r    a l l    l , i , j ) \bigtriangleup_{ij}^{(l)}=0\quad(for\;all\;l,i,j) ij(l)=0(foralll,i,j)​​​

△ i j ( l ) \bigtriangleup_{ij}^{(l)} ij(l)​​​​ used to compute ∂ ∂ Θ i j ( l ) J ( Θ ) {\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)}{\LARGE} Θij(l)J(Θ)​​​​​

For i = 1 i=1 i=1​​ to m m m:

Set a ( 1 ) = x ( i ) a^{(1)}=x^{(i)} a(1)=x(i)​​

​ Perform forward propagation to compute a ( l ) a^{(l)} a(l)​ for l = 2 , 3 , ⋯   , L l=2,3,\cdots,L l=2,3,,L​​

​ Using y ( i ) y^{(i)} y(i)​​​,compute δ ( L ) = a ( L ) − y ( i ) \delta( L ) = a ^ { ( L ) } - y ^ { ( i ) } δ(L)=a(L)y(i)​​​

Compute δ ( L − 1 ) , δ ( L − 2 ) , ⋯   , δ ( 2 ) \delta{( L - 1 )} , \delta ^{( L - 2 )} , \cdots , \delta^{( 2 )} δ(L1),δ(L2),,δ(2)​​​

△ i j ( i ) : = △ i j ( l ) + a j ( l ) δ i ( l + 1 ) \bigtriangleup _ { i j } ^ { ( i ) } : = \bigtriangleup _ { i j } ^ { ( l ) }+a_j^{(l)}\delta_i^{(l+1)} ij(i):=ij(l)+aj(l)δi(l+1)​​​

D i j ( l ) : = 1 m △ i j ( l ) + λ Θ i j ( t ) i f    j ≠ 0 D i j ( l ) : = 1 m △ i j ( l ) i f    j = 0 \left. \begin{array} { l } { D _ { ij } ^ { ( l ) } : = \frac { 1 } { m } \bigtriangleup _ { i j } ^ { ( l ) } + \lambda \Theta _ { i j } ^ { ( t ) } \quad if\;j \neq 0 } \\ { D _ { i j } ^ { ( l ) } : = \frac { 1 } { m } \bigtriangleup_{ij}^{(l)} \quad if \; j=0} \end{array} \right. Dij(l):=m1ij(l)+λΘij(t)ifj=0Dij(l):=m1ij(l)ifj=0

∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) \frac { \partial } { \partial \Theta _ { i j}^{(l)} } J ( \Theta ) = D _ { i j } ^ { ( l ) } Θij(l)J(Θ)=Dij(l)

9-3 Backpropagation intuition

Forward propagation

x 1 ( i ) , x 2 ( i ) , … ⇒ z ( 2 ) = Θ ( 1 ) a ( 1 ) + ⋯ z 1 , z 2 , ⋯ ⇒ a ( 2 ) = g ( z ( 2 ) ) a 1 , a 2 , ⋯ ⇒ ⋯ x_1^{(i)},x_2^{(i)},\dots\xRightarrow{z^{(2)}=\Theta^{(1)}a^{(1)}+\cdots}z_1,z_2,\cdots\xRightarrow{a^{(2)}=g(z^{(2)})}a_1,a_2,\cdots \Rightarrow \cdots x1(i),x2(i),z(2)=Θ(1)a(1)+ z1,z2,a(2)=g(z(2)) a1,a2,

​​​What is backpropagation doing?
J ( θ ) = − 1 m [ ∑ i = 1 m y ( i ) log ⁡ ( h Θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − ( h Θ ( x ( i ) ) ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J ( \theta ) = - \frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { (i) } \log ( h _ { \Theta } {( x ^ { ( i ) })} ) + ( 1 - y ^ { (i) }) \log( 1 - ( h _ { \Theta } {( x ^ { ( i ) } )} )]+ \frac { \lambda } { 2 m } \sum _ { l = 1 } ^ { L - 1 } \sum _ { i = 1 } ^ { s_l } \sum _ { j = 1 } ^ { s_l + 1 } ( \Theta _ { ji } ^ { ( l ) } ) ^ { 2 } J(θ)=m1[i=1my(i)log(hΘ(x(i)))+(1y(i))log(1(hΘ(x(i)))]+2mλl=1L1i=1slj=1sl+1(Θji(l))2
Focusing on a single example x ( i ) , y ( i ) x^{(i)},y^{(i)} x(i),y(i), the case of 1 output unit,and ignoring regularization ( λ = 0 \lambda=0 λ=0)

cost ⁡ ( i ) = y ( i ) log ⁡ h Θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ h Θ ( x ( i ) ) \operatorname {cost} ( i ) = y ^ { ( i ) } \log h _ { \Theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log h_\Theta ( x ^ { ( i ) } ) cost(i)=y(i)loghΘ(x(i))+(1y(i))loghΘ(x(i))​​

(Thinking of cost ⁡ ( i ) ≈ ( h Θ ( x ( i ) ) − y ( i ) ) 2 \operatorname {cost} ( i ) \approx ( h _ { \Theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 } cost(i)(hΘ(x(i))y(i))2​​)

I.e. how well is the network doing on example i?

Backpropagation

δ j ( l ) = \delta_j^{(l)}= δj(l)=“error” of cost for a j ( l ) ( u n i t    j    i n    l a y e r    l ) . a_j^{(l)}\quad(unit\;j\;in\; layer \;l). aj(l)(unitjinlayerl).

F o r m a l l y , δ j ( l ) = ∂ ∂ z j ( l ) c o s t ( i ) ( f o r    j ≥ 0 ) , w h e r e cost ⁡ ( i ) = y ( i ) log ⁡ h Θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ h Θ ( x ( i ) ) Formally,\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i)\quad (for \;j\ge0),where\\\operatorname {cost} ( i ) = y ^ { ( i ) } \log h _ { \Theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log h_\Theta ( x ^ { ( i ) } ) Formally,δj(l)=zj(l)cost(i)(forj0),wherecost(i)=y(i)loghΘ(x(i))+(1y(i))loghΘ(x(i))​​

9-4 Implementation note: Unrolling parameters

Advanced optimization

function [jval,gradient] =costFunction(theta)
...
optTheta = fminunc(@costFunction,initialTheta,options)

Neural Network(L=4):

Θ ( 1 ) , Θ ( 2 ) , Θ ( 3 ) − \Theta^{(1)},\Theta^{(2)},\Theta^{(3)}- Θ(1),Θ(2),Θ(3)​matrices(Theta1,Theta2,Theta3)

D ( 1 ) , D ( 2 ) , D ( 3 ) − D^{(1)},D^{(2)},D^{(3)}- D(1),D(2),D(3)matrices(D1,D2,D3)

“unroll” into vectors

Example

s 1 = 10 , s 2 = 10 , s 3 = 1 Θ ( 1 ) ∈ R 10 × 11 , Θ ( 2 ) ∈ R 10 × 11 , Θ ( 3 ) ∈ R 1 × 11 D ( 1 ) ∈ R 10 × 11 , D ( 2 ) ∈ R 10 × 11 , D ( 3 ) ∈ R 1 × 11 \begin{array} { l } { s _ { 1 } = 10 , s _ { 2 } = 10 , s _ { 3 } = 1 } \\ { \Theta ^ { ( 1 ) } \in \mathbb{R} ^ { 10 \times 11 } , \Theta ^ { ( 2 ) } \in \mathbb{R} ^ { 10 \times 11 } , \Theta ^ { ( 3 ) } \in \mathbb{R} ^ { 1 \times 11 } } \\ { D ^ { ( 1 ) } \in \mathbb{R} ^ { 10 \times 11 } , D ^ { ( 2 ) } \in \mathbb{R} ^ { 10 \times 11 } , D ^ { ( 3 ) } \in \mathbb{R} ^ { 1 \times 11 } }\end{array} s1=10,s2=10,s3=1Θ(1)R10×11,Θ(2)R10×11,Θ(3)R1×11D(1)R10×11,D(2)R10×11,D(3)R1×11​​​

thetaVec =[Theta1(:);Theta2(:);Theta3(:)];
Dvec =[D1(:),D2(:),D3(:)];

Theta1=reshape(thetaVec(1:110),10,11)
Theta2=reshape(thetaVec(111,220),10,11)
Theta1=reshape(thetaVec(221:231),1 ,11)

Learning Algorithm

Have initial parameters Θ ( 1 ) , Θ ( 2 ) , Θ ( 3 ) \Theta^{(1)},\Theta^{(2)},\Theta^{(3)} Θ(1),Θ(2),Θ(3)

Unroll to get i n i t i a l T h e t a \color{Blue}initialTheta initialTheta​ to pass to

fminunc(@costFunction,initialTheta,options)
function[jval,gradientVec] = costFunction(thetaVec)

From t h e t a V e c \color{Blue}thetaVec thetaVec​, get Θ ( 1 ) , Θ ( 2 ) , Θ ( 3 ) \Theta^{(1)},\Theta^{(2)},\Theta^{(3)} Θ(1),Θ(2),Θ(3)

Use forward prop/back prop to compute D ( 1 ) , D ( 2 ) , D ( 3 ) D^{(1)},D^{(2)},D^{(3)} D(1),D(2),D(3) and J ( θ ) J(\theta) J(θ)

Unroll D ( 1 ) , D ( 2 ) , D ( 3 ) D^{(1)},D^{(2)},D^{(3)} D(1),D(2),D(3) to get g r a d i e n t V e c \color{Blue}gradientVec gradientVec

9-5 Gradient checking

Numerical estimation of gradient

d d θ J ( θ ) ≈ J ( θ + ϵ ) − J ( θ − ϵ ) 2 ϵ \frac{d}{d\theta}J(\theta)\approx\frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon} dθdJ(θ)2ϵJ(θ+ϵ)J(θϵ)

Implement

gradApprox = (J(theta+EPSILON)-J(theta-EPSILON))/(2*EPSILON)

Parameter vector θ \theta θ

θ ∈ R n \theta\in\mathbb{R}^n θRn (E.g. θ \theta θ is “unrolled” version of Θ ( 1 ) , Θ ( 2 ) , Θ ( 3 ) \Theta^{(1)},\Theta^{(2)},\Theta^{(3)} Θ(1),Θ(2),Θ(3))

θ = θ 1 , θ 2 , θ 3 , ⋯   , θ n \theta=\theta_1,\theta_2,\theta_3,\cdots,\theta_n θ=θ1,θ2,θ3,,θn

∂ ∂ θ 1 J ( θ ) ≈ J ( θ 1 + ϵ , θ 2 , θ 3 ⋯   , θ n ) − J ( θ 1 − ϵ , θ 2 , θ 3 ⋯   , θ n ) 2 ϵ ∂ ∂ θ 1 J ( θ ) ≈ J ( θ 1 , θ 2 + ϵ , θ 3 ⋯   , θ n ) − J ( θ 1 , θ 2 − ϵ , θ 3 ⋯   , θ n ) 2 ϵ ⋮ ∂ ∂ θ 1 J ( θ ) ≈ J ( θ 1 , θ 2 , θ 3 ⋯   , θ n + ϵ ) − J ( θ 1 , θ 2 , θ 3 ⋯   , θ n − ϵ ) 2 ϵ \begin{array}{l} & \\\frac{ \partial } { \partial \theta _ { 1 } } J ( \theta ) \approx \frac { J ( \theta _ { 1 } + \epsilon ,\theta_ { 2 } , \theta _ { 3 } \cdots , \theta _ { n } ) - J ( \theta _ { 1 } - \epsilon , \theta _ { 2 } , \theta _ { 3 } \cdots , \theta _ { n } ) } { 2 \epsilon } & \\\frac{ \partial } { \partial \theta _ { 1 } } J ( \theta ) \approx \frac { J ( \theta _ { 1 } ,\theta_ { 2 } + \epsilon, \theta _ { 3 } \cdots , \theta _ { n } ) - J ( \theta _ { 1 } , \theta _ { 2 }- \epsilon , \theta _ { 3 } \cdots , \theta _ { n } ) } { 2 \epsilon } & \\\vdots & \\\frac{ \partial } { \partial \theta _ { 1 } } J ( \theta ) \approx \frac { J ( \theta _ { 1 } ,\theta_ { 2 } , \theta _ { 3 }\cdots , \theta _ { n } + \epsilon ) - J ( \theta _ { 1 } , \theta _ { 2 } , \theta _ { 3 } \cdots , \theta _ { n } - \epsilon) } { 2 \epsilon } \end{array} θ1J(θ)2ϵJ(θ1+ϵ,θ2,θ3,θn)J(θ1ϵ,θ2,θ3,θn)θ1J(θ)2ϵJ(θ1,θ2+ϵ,θ3,θn)J(θ1,θ2ϵ,θ3,θn)θ1J(θ)2ϵJ(θ1,θ2,θ3,θn+ϵ)J(θ1,θ2,θ3,θnϵ)

for i=1:n,
	thetaPlus=theta;
	thetaPlus(i)=thetaPlus(i)+EPSILON;
	thetaMinus=theta;
	thetaMinus(i)=thetaMinus(i)-EPSILON;
	gradApprox(i)=(J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
end;

Check that g r a d A p p r o x ≈ D v e c \color{Blue}{gradApprox\approx Dvec} gradApproxDvec

Implementation Note:

  • Implement backprop to compute D v e c \color{Blue}Dvec Dvec(unrolled D ( 1 ) , D ( 2 ) , D ( 3 ) , D^{(1)},D^{(2)},D^{(3)}, D(1),D(2),D(3),)
  • Implement numerical gradient check to compute g r a d A p p r o x \color{Blue}gradApprox gradApprox
  • Make sure they give similar values
  • Turn off gradient checking Using backprop code for learning

Important:

  • Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent(or in the inner loop of c o s t F u n c t i o n ( ⋯   ) \color{Blue}costFunction(\cdots) costFunction()​​​) your code will be very slow.

9-6 Random initialization

Initial value of Θ \Theta Θ

For gradient descent and advanced optimization method, need initial value for Θ \Theta Θ

optTheta = fminunc(@costFunction,initialTheta,options)

Consider gradient descent

Set i n i t i a l T h e t a = z e r o s ( n , 1 ) \color{Blue}initialTheta = zeros(n,1) initialTheta=zeros(n,1)​? × \Huge\color{red}\times ×

Zero initialization

Θ i j ( l ) = 0    f o r    a l l    i , j , l . \Theta_{ij}^{(l)}=0\;for \; all \;i,j,l. Θij(l)=0foralli,j,l.

After each update, parameters corresponding to inputs going into each of two hidden units are identical

Random each Θ i j ( l ) \Theta_{ij}^{(l)} Θij(l)​ to a random value in [ − ϵ , ϵ ] [-\epsilon,\epsilon] [ϵ,ϵ]​( i . e . − ϵ ≤ Θ i j ( l ) ≤ ϵ i.e.\quad -\epsilon \le \Theta_{ij}^{(l)}\le\epsilon i.e.ϵΘij(l)ϵ​)

E . g . E.g. E.g.

Theta1 = rand(10,11)*(2*INIT_EPSILON)-INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON)-INIT_EPSILON;

9-7 Putting it together

Training a neural network

Pick a network architecture( connectivity pattern between neurons)

  • No of input units: Dimension of features x ( i ) x^{(i)} x(i)

  • No output units: Number of classes

    y ( 1 ) = [ 1 0 0 0 0 ] , y ( 2 ) = [ 0 1 0 0 0 ] , ⋯ y^{(1)}=\begin{bmatrix}1\\0\\0\\0\\0\end{bmatrix},y^{(2)}=\begin{bmatrix}0\\1\\0\\0\\0\end{bmatrix},\cdots y(1)=10000,y(2)=01000,

  • Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no of hidden units in every layer(usually the more the better

  1. Randomly initialize weights
  2. Implement forward propagation to get h Θ ( x ( i ) ) h_\Theta(x^{(i)}) hΘ(x(i)) for any x ( i ) x^{(i)} x(i)
  3. Implement code to compute cost function, J ( Θ ) J(\Theta) J(Θ)
  4. Implement backprop to compute partial derivatives ∂ ∂ Θ j k ( l ) J ( Θ ) \frac { \partial } { \partial \Theta _ { j k }^{(l)} } J ( \Theta ) Θjk(l)J(Θ)​​

for i=1:m

​ Perform forward propagation and backpropagation using example ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))
​ ( Get activations a ( l ) a^{(l)} a(l)​ and delta terms δ ( l )    f o r    l = 2 , ⋯   , L \delta^{(l)}\;for\;l=2,\cdots,L δ(l)forl=2,,L

  1. Use gradient checking to compare ∂ ∂ Θ j k ( l ) J ( Θ ) \frac { \partial } { \partial \Theta _ { j k }^{(l)} } J ( \Theta ) Θjk(l)J(Θ)​ computed using backpropagation vs. using numerical estimate of gradient of J ( Θ ) J(\Theta) J(Θ)​.
    Then disable gradient checking code.
  2. Use gradient descent or advanced optimization method withbackpropagation to try to minimize J ( Θ ) J(\Theta) J(Θ)​ as a function ofparameters Θ \Theta Θ​​

9-8 Autonomous driving example

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值