[ML of Andrew Ng]Week 5 Neural Networks:Learning

最新推荐文章于 2021-07-05 16:01:50 发布

大庆csdn

最新推荐文章于 2021-07-05 16:01:50 发布

阅读量341

点赞数

分类专栏： meachine learning

本文链接：https://blog.csdn.net/mrliudq/article/details/51001166

版权

meachine learning 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Week 5 Neural Networks:Learning

Week 5 Neural NetworksLearning

Cost function

Classification

abc

Examples: $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(m)},y^{(m)})\}$

$L =$ total no. of layers in network
$S_l =$ no. of units(not counting bias unit) in layer $l$

if binary classification
$y = 0 \quad or \quad 1$
- Multi-class classification (K classes)
  $y \in \mathbb{R}^k$ , E.g. $\begin{bmatrix} 1 & 0 & 0 & 0\end{bmatrix}^T , \begin{bmatrix} 0 & 1 & 0 & 0 \end{bmatrix}^T , \begin{bmatrix} 0 & 0 & 1 & 0\end{bmatrix}^T , \begin{bmatrix} 0 & 0 & 0 & 1\end{bmatrix}^T$
- In matlab we often should transfor $y$ from a real number to a vector:
```
Y = zeros(m,num_labels);
for i=1:m,
    Y(i,y(i)) = 1;
end
```
  Forward propagation algorithm
  
  From $a^{(1)}$ To $a^{(L)}$ .
  
  a(1)=X
  
  and if we have a(i) :
  
  z(i+1)=Θ(i)a(i)
  
  (sl+1×(sl+1))×((1+sl)×1)=sl+1×1
  
  (where must add $a_0^{(i)}$ )
  
  a(i+1)=g(z(i+1))
  
  So we get a(l) ,Then
  
  hΘ(x)=a(l)
  
  $\Theta_{ij}^{(l)}$ mappping form note $j$ in layer $l$ to note $i$ in layer $l+1$ .
  $i = 1,2,\cdots,s_{l+1}$ and $j = 1,2,\cdots,s_l+1$
  
  Cost function
  
  J(Θ)=−1m[∑i=1m∑k=1Ky(i)klog((hΘ(x(i)))k+(1−y(i)k)log(1−((hΘ(x(i)))k)]+λ2m∑l=1L−1∑i=2sl+1∑j=1sl+1(Θ(l)ji)2
  
  Where: $h_\Theta(x) \in \mathbb{R}^K$ , $(h_\Theta(x))_i=i^{th}$ output h.
  
  E.g. in matlab:
```
J = 1/m * sum(sum(-Y.*log(a3)-(1-Y).*log(1-a3)))...
    + lambda/2/m * (sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2)));
```
  Backpropagation algorithm
  
  Similarly,to min $J(\Theta)$ we need compute $J(\Theta)$ and $\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$ .
  
  Gradient computation
  
  Intution: $\delta^{(l)}_j =$ “error” of node $j$ in layer $l$ .
  
  From $\delta^{(L)}$ To $\delta^{(2)}$ (No $\delta^{(1)}$ ):
  
  δ(L)=a(L)−y
  
  and if we have δ(i) :
  
  δ(i−1)=(Θ(i−1):j)Tδ(i),j≠0
  
  (sl×(sl−1+1)−1)T×(sl×1)=sl−1×1
  
  (where must minus $\Theta_0^{(i-1)}$ )
  
  where:
  
  g′(z(i))=g(z)(1−g(z))
  
  and we can get Δ(l) :
  
  Δ(l)=δ(l+1)(a(l))T
  
  (sl+1×1)×((sl+1)×1)T=sl+1×(sl+1)
  
  Then:
  
  D(l)=1mΔ(l)+λΘ(l):j,j≠0
  
  ∂∂Θ(l)J(Θ)=D(l)
  
  Implementation note:Unrolling parameters
  
  Example
  
  $\Theta^{(1)} \in \mathbb{R}^{10 \times 11},\Theta^{(1)} \in \mathbb{R}^{10 \times 11},\Theta^{(1)} \in \mathbb{R}^{1 \times 11}$
  
  Unrolling
```
thetaVec = [Theta1(:);Theta2(:);Theta3(:)];
```
  Reshape
```
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta1 = reshape(thetaVec(221:231),1,11);
```
  Gradient checking
```
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaMinus(i) – EPSILON;
gradApprox(i) = (J(thetaPlus) – J(thetaMinus))/(2*EPSILON);
end;
```
  Implementation Note:
  - Implement backprop to compute DVec (unrolled).
  - Implement numerical gradient check to compute gradApprox.
  - Make sure they give similar values.
  - Turn off gradient checking. Using backprop code for learning.
  Important:
  - Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent (or in the inner loop of costFunction(…))your code will be very slow.
  Random initialization
  
  If zero initialization:
  After each update, parameters corresponding to inputs going into each of two hidden units are identical.
  
  So we must random initialization to break symmetry.
  Initialize each $\Theta_{ij}^{(l)}$ to a random value in $[-\epsilon,\epsilon]$
  E.g.
```
Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON;
```
  Training a neural network
  
  Pick a network architecture (connectivity pattern between neurons)
  
  - No. of input units: Dimension of features $x^{(i)}$
  - No. output units: Number of classes
  
  Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better)
  
  Steps
  2. Randomly initialize weights
  3. Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
  4. Implement code to compute cost function $J(\Theta)$
  5. Implement backprop to compute partial derivatives $\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$
  for i = 1:m
  Perform forward propagation and backpropagation using example $(x^{(i)},y^{(i)})$
  (Get activations $a^{(l)}$ and delta terms $\delta^{(l)}$ for $l=2,\cdots,L$ ).
  - Use gradient checking to compare $\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$ .
    Then disable gradient checking code.
  - Use gradient descent or advanced optimization method with backpropagation to try to minimize $J(\Theta)$ as a function of parameters $\Theta$

大庆csdn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[ML of Andrew Ng]Week 5 Neural Networks:Learning

Week 5 Neural Networks:LearningWeek 5 Neural NetworksLearningCost functionClassificationForward propagation algorithmCost functionBackpropagation algorithmGradient computationImplementation note
复制链接

扫一扫