吴恩达·Machine Learning || chap9 Neural Network : Learning简记

最新推荐文章于 2024-08-14 19:43:55 发布

The Prestige

最新推荐文章于 2024-08-14 19:43:55 发布

阅读量89

点赞数

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_46203130/article/details/119741651

版权

Machine Learning 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

9 Neural Network : Learning

9-1 Cost function

Neural Network(classification)

$(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(m)},y^{(m)})$

$L$ = total no. of layers in network

$s_l$ = no. of unit (not counting bias unit) in layer $l$

Compare between Binary | Multi-class(K classes) classification

Binary classification

$y=0\; or\; 1$

1 out put unit

Multi-class classification(K classes)

$y\in\mathbb{R{\LARGE } } ^K$

K output unit

Cost function

Logistic regression:

$\Theta ) = - \frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { ( i ) } \log h _ { \theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log ( 1 - h _ { \theta } ( x ^ { ( i ) } ) ) ] + \frac { \lambda } { 2 m } \sum _ { j = 1 } ^ { n } \Theta^2_j$

Neural network:

$\Theta } ( x ) \in R ^ { K } \quad ( h _ { \Theta } ( x ) ) _ { i } = i ^ { t h } \quad output$

$\Theta ) = - \frac { 1 } { m } [ \sum _ { i = 11 } ^ { m } \sum _ { k = 1 } ^ { K } y _ { k } ^ { (i) } \log ( h _ { k } ( x ^ { ( i ) } ) ) _ { k } + ( 1 - y _ { k } ^ { (i) } \log( 1 - ( h _ { \Theta } ( x ^ { ( i ) } ) _k)]+ \frac { \lambda } { 2 m } \sum _ { l = 1 } ^ { L - 1 } \sum _ { i = 1 } ^ { s_l } \sum _ { j = 1 } ^ { s_l + 1 } ( \Theta _ { ji } ^ { ( l ) } ) ^ { 2 }$

9-2 Backpropagation algorithm

反向传播算法

Gradient computation

$min_\Theta J(\Theta)$

Need code to compute:

$J(\Theta)$
${\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)}{\LARGE}$

Given one training example (x,y):

Forward propagation:

$\begin{array}{l}{a^{(1)}=x}\\z^{(2)}=\Theta^{(1)}a^{(1)}\\a^{(2)}=g(z^{(2)})\quad(add\; a_0^{(2)})\\z^{(3)}=\Theta^{(2)}a^{(2)}\\a^{(3)}=g(z^{(3)})\quad(add\; a_0^{(3)})\\z^{(4)}=\Theta^{(3)}a^{(3)}\\a^{(4)}=h_\Theta(x)=g(z^{(4)})\end{array}$

Gradient computation: Backpropagation algorithm

Intuition: $\delta_j^{(l)}=$ “error” of node $j$ in layer $l$

For each out put unit (layer L=4)

$\delta _ { j } ^ { ( 4 ) } = a _ { j } ^ { ( 4 ) } - y _ { j }$

$\begin{array} { l } { \delta ^{( 3 )} = ( \Theta ^ { ( 3 ) } ) ^ { T } \delta^{( 4 )} .* g ^ { \prime } ( z ^ { ( 3 ) } ) } \\ { \delta ^ { ( 2 ) } = ( \Theta ^ { ( 2 ) } ) ^ { T } \delta ^ {( 3 )} .* g ^ { \prime } ( z ^ { ( 2 ) } ) } \end{array}$

( $\qquad \delta^{(1)}$ )

Backpropagation algorithm

Training set: $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(m)},y^{(m)})$ 

Set: $\bigtriangleup_{ij}^{(l)}=0\quad(for\;all\;l,i,j)$ 

$\bigtriangleup_{ij}^{(l)}$  used to compute ${\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)}{\LARGE}$ 

For $i = 1$ to $m$ :

Set $a^{(1)}=x^{(i)}$ 

Perform forward propagation to compute $a^{(l)}$  for $l=2,3,\cdots,L$ 

Using $y^{(i)}$ ,compute $\delta( L ) = a ^ { ( L ) } - y ^ { ( i ) }$ 

Compute $\delta{( L - 1 )} , \delta ^{( L - 2 )} , \cdots , \delta^{( 2 )}$ 

$\bigtriangleup _ { i j } ^ { ( i ) } : = \bigtriangleup _ { i j } ^ { ( l ) }+a_j^{(l)}\delta_i^{(l+1)}$ 

$\left. \begin{array} { l } { D _ { ij } ^ { ( l ) } : = \frac { 1 } { m } \bigtriangleup _ { i j } ^ { ( l ) } + \lambda \Theta _ { i j } ^ { ( t ) } \quad if\;j \neq 0 } \\ { D _ { i j } ^ { ( l ) } : = \frac { 1 } { m } \bigtriangleup_{ij}^{(l)} \quad if \; j=0} \end{array} \right.$

$\frac { \partial } { \partial \Theta _ { i j}^{(l)} } J ( \Theta ) = D _ { i j } ^ { ( l ) }$

9-3 Backpropagation intuition

Forward propagation

$x_1^{(i)},x_2^{(i)},\dots\xRightarrow{z^{(2)}=\Theta^{(1)}a^{(1)}+\cdots}z_1,z_2,\cdots\xRightarrow{a^{(2)}=g(z^{(2)})}a_1,a_2,\cdots \Rightarrow \cdots$

What is backpropagation doing?
$\theta ) = - \frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { (i) } \log ( h _ { \Theta } {( x ^ { ( i ) })} ) + ( 1 - y ^ { (i) }) \log( 1 - ( h _ { \Theta } {( x ^ { ( i ) } )} )]+ \frac { \lambda } { 2 m } \sum _ { l = 1 } ^ { L - 1 } \sum _ { i = 1 } ^ { s_l } \sum _ { j = 1 } ^ { s_l + 1 } ( \Theta _ { ji } ^ { ( l ) } ) ^ { 2 }$
Focusing on a single example $x^{(i)},y^{(i)}$ , the case of 1 output unit,and ignoring regularization ( $\lambda=0$ )

$\operatorname {cost} ( i ) = y ^ { ( i ) } \log h _ { \Theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log h_\Theta ( x ^ { ( i ) } )$

(Thinking of $\operatorname {cost} ( i ) \approx ( h _ { \Theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 }$ )

I.e. how well is the network doing on example i?

Backpropagation

$\delta_j^{(l)}=$ “error” of cost for $a_j^{(l)}\quad(unit\;j\;in\; layer \;l).$

$Formally,\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i)\quad (for \;j\ge0),where\\\operatorname {cost} ( i ) = y ^ { ( i ) } \log h _ { \Theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log h_\Theta ( x ^ { ( i ) } )$

9-4 Implementation note: Unrolling parameters

Advanced optimization

function [jval,gradient] =costFunction(theta)
...
optTheta = fminunc(@costFunction,initialTheta,options)

Neural Network(L=4):

$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}-$ matrices(Theta1,Theta2,Theta3)

$D^{(1)},D^{(2)},D^{(3)}-$ matrices(D1,D2,D3)

“unroll” into vectors

Example

$\begin{array} { l } { s _ { 1 } = 10 , s _ { 2 } = 10 , s _ { 3 } = 1 } \\ { \Theta ^ { ( 1 ) } \in \mathbb{R} ^ { 10 \times 11 } , \Theta ^ { ( 2 ) } \in \mathbb{R} ^ { 10 \times 11 } , \Theta ^ { ( 3 ) } \in \mathbb{R} ^ { 1 \times 11 } } \\ { D ^ { ( 1 ) } \in \mathbb{R} ^ { 10 \times 11 } , D ^ { ( 2 ) } \in \mathbb{R} ^ { 10 \times 11 } , D ^ { ( 3 ) } \in \mathbb{R} ^ { 1 \times 11 } }\end{array}$

thetaVec =[Theta1(:);Theta2(:);Theta3(:)];
Dvec =[D1(:),D2(:),D3(:)];

Theta1=reshape(thetaVec(1:110),10,11)
Theta2=reshape(thetaVec(111,220),10,11)
Theta1=reshape(thetaVec(221:231),1 ,11)

Learning Algorithm

Have initial parameters $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$

Unroll to get $\color{Blue}initialTheta$ to pass to

fminunc(@costFunction,initialTheta,options)

function[jval,gradientVec] = costFunction(thetaVec)

From $\color{Blue}thetaVec$ , get $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$

Use forward prop/back prop to compute $D^{(1)},D^{(2)},D^{(3)}$ and $J(\theta)$

Unroll $D^{(1)},D^{(2)},D^{(3)}$ to get $\color{Blue}gradientVec$

9-5 Gradient checking

Numerical estimation of gradient

$\frac{d}{d\theta}J(\theta)\approx\frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}$

Implement

gradApprox = (J(theta+EPSILON)-J(theta-EPSILON))/(2*EPSILON)

Parameter vector $\theta$

$\theta\in\mathbb{R}^n$ (E.g. $\theta$ is “unrolled” version of $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$ )

$\theta=\theta_1,\theta_2,\theta_3,\cdots,\theta_n$

$\begin{array}{l} & \\\frac{ \partial } { \partial \theta _ { 1 } } J ( \theta ) \approx \frac { J ( \theta _ { 1 } + \epsilon ,\theta_ { 2 } , \theta _ { 3 } \cdots , \theta _ { n } ) - J ( \theta _ { 1 } - \epsilon , \theta _ { 2 } , \theta _ { 3 } \cdots , \theta _ { n } ) } { 2 \epsilon } & \\\frac{ \partial } { \partial \theta _ { 1 } } J ( \theta ) \approx \frac { J ( \theta _ { 1 } ,\theta_ { 2 } + \epsilon, \theta _ { 3 } \cdots , \theta _ { n } ) - J ( \theta _ { 1 } , \theta _ { 2 }- \epsilon , \theta _ { 3 } \cdots , \theta _ { n } ) } { 2 \epsilon } & \\\vdots & \\\frac{ \partial } { \partial \theta _ { 1 } } J ( \theta ) \approx \frac { J ( \theta _ { 1 } ,\theta_ { 2 } , \theta _ { 3 }\cdots , \theta _ { n } + \epsilon ) - J ( \theta _ { 1 } , \theta _ { 2 } , \theta _ { 3 } \cdots , \theta _ { n } - \epsilon) } { 2 \epsilon } \end{array}$

for i=1:n,
	thetaPlus=theta;
	thetaPlus(i)=thetaPlus(i)+EPSILON;
	thetaMinus=theta;
	thetaMinus(i)=thetaMinus(i)-EPSILON;
	gradApprox(i)=(J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
end;

Check that $\color{Blue}{gradApprox\approx Dvec}$

Implementation Note:

Implement backprop to compute $\color{Blue}Dvec$ (unrolled $D^{(1)},D^{(2)},D^{(3)},$ )
Implement numerical gradient check to compute $\color{Blue}gradApprox$
Make sure they give similar values
Turn off gradient checking Using backprop code for learning

Important:

Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent(or in the inner loop of $\color{Blue}costFunction(\cdots)$ ) your code will be very slow.

9-6 Random initialization

Initial value of $\Theta$

For gradient descent and advanced optimization method, need initial value for $\Theta$

optTheta = fminunc(@costFunction,initialTheta,options)

Consider gradient descent

Set $\color{Blue}initialTheta = zeros(n,1)$ ? $\Huge\color{red}\times$

Zero initialization

$\Theta_{ij}^{(l)}=0\;for \; all \;i,j,l.$

After each update, parameters corresponding to inputs going into each of two hidden units are identical

Random each $\Theta_{ij}^{(l)}$ to a random value in $[-\epsilon,\epsilon]$ ( $i.e.\quad -\epsilon \le \Theta_{ij}^{(l)}\le\epsilon$ )

$E . g .$

Theta1 = rand(10,11)*(2*INIT_EPSILON)-INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON)-INIT_EPSILON;

9-7 Putting it together

Training a neural network

Pick a network architecture( connectivity pattern between neurons)

No of input units: Dimension of features $x^{(i)}$
No output units: Number of classes

$y^{(1)}=\begin{bmatrix}1\\0\\0\\0\\0\end{bmatrix},y^{(2)}=\begin{bmatrix}0\\1\\0\\0\\0\end{bmatrix},\cdots$
Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no of hidden units in every layer(usually the more the better

Randomly initialize weights
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
Implement code to compute cost function, $J(\Theta)$
Implement backprop to compute partial derivatives $\frac { \partial } { \partial \Theta _ { j k }^{(l)} } J ( \Theta )$

for i=1:m

Perform forward propagation and backpropagation using example $x^{(i)},y^{(i)})$
( Get activations $a^{(l)}$ and delta terms $\delta^{(l)}\;for\;l=2,\cdots,L$

Use gradient checking to compare $\frac { \partial } { \partial \Theta _ { j k }^{(l)} } J ( \Theta )$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$ .
Then disable gradient checking code.
Use gradient descent or advanced optimization method withbackpropagation to try to minimize $J(\Theta)$ as a function ofparameters $\Theta$

9-8 Autonomous driving example

The Prestige

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
吴恩达·Machine Learning || chap9 Neural Network : Learning简记

9 Neural Network : Learning9-1 Cost functionNeural Network(classification)(x(1),y(1)),(x(2),y(2)),⋯ ,(x(m),y(m))(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(m)},y^{(m)})(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))LLL = total no. of layers in networksls_ls
复制链接

扫一扫

专栏目录