Machine Learning 04 - Neural Networks

最新推荐文章于 2024-07-24 10:39:16 发布

能智工人

最新推荐文章于 2024-07-24 10:39:16 发布

阅读量274

点赞数

分类专栏：机器学习文章标签：人工智能机器学习 Stanford

本文链接：https://blog.csdn.net/ddragon1/article/details/79289706

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。

Week 04

4.1 Model Representation

4.1.1 Origin of model

Neural network be modelled from the neuron in the brain.

Neuron in the brain

4.1.2 Logistic unit

A basic model of neural network is as follow :

Logistic unit

Remark :

x = ⎡ ⎣ ⎢ ⎢ ⎢ x 0 x 1 x 2 x 3 ⎤ ⎦ ⎥ ⎥ ⎥, θ = ⎡ ⎣ ⎢ ⎢ ⎢ θ 0 θ 1 θ 2 θ 3 ⎤ ⎦ ⎥ ⎥ ⎥

$x=\begin{bmatrix} x_{0}\\ x_{1}\\ x_{2}\\ x_{3} \end{bmatrix} , \quad \theta =\begin{bmatrix} \theta _{0}\\ \theta _{1}\\ \theta _{2}\\ \theta _{3} \end{bmatrix}$

$\theta$ is also called “weights” in neural networks.

4.1.3 Neural network

(1) Schematic diagram

neural network

Symbol

$s_{j}$ - the number of the units in layer $j$ ， not counting bias unit.

$a_{i}^{(j)}$ - “activation” of unit $i$ in layer $j$

$\Theta^{(j)}$ - matrix of weights controlling function mapping from layer $j$ to layer $j+1$ , with dimension of $s_{j+1}\times (s_{j}+1)$

$L$ - total number of layers in network

(2) Mathematical representation

Layer 2

a_{1}^{(2)} = g (Θ_{10}^{(1)} x_{0} + Θ_{11}^{(1)} x_{1} + Θ_{12}^{(1)} x_{2} + Θ_{13}^{(1)} x_{3}) a_{2}^{(2)} = g (Θ_{20}^{(1)} x_{0} + Θ_{21}^{(1)} x_{1} + Θ_{22}^{(1)} x_{2} + Θ_{23}^{(1)} x_{3}) a_{3}^{(2)} = g (Θ_{30}^{(1)} x_{0} + Θ_{31}^{(1)} x_{1} + Θ_{32}^{(1)} x_{2} + Θ_{33}^{(1)} x_{3})

$a_{1}^{(2)}=g(\Theta _{10}^{(1)}x_{0}+\Theta _{11}^{(1)}x_{1}+\Theta _{12}^{(1)}x_{2}+\Theta _{13}^{(1)}x_{3}) \\ a_{2}^{(2)}=g(\Theta _{20}^{(1)}x_{0}+\Theta _{21}^{(1)}x_{1}+\Theta _{22}^{(1)}x_{2}+\Theta _{23}^{(1)}x_{3}) \\ a_{3}^{(2)}=g(\Theta _{30}^{(1)}x_{0}+\Theta _{31}^{(1)}x_{1}+\Theta _{32}^{(1)}x_{2}+\Theta _{33}^{(1)}x_{3}) \\$

Layer 3

h Θ (x) = a (3) 1 = g (Θ (2) 10 a (2) 0 + Θ (2) 11 a (2) 1 + Θ (2) 12 a (2) 2 + Θ (2) 13 a (2) 3)

$h_{\Theta }(x)=a_{1}^{(3)}=g(\Theta _{10}^{(2)}a_{0}^{(2)}+\Theta _{11}^{(2)}a_{1}^{(2)}+\Theta _{12}^{(2)}a_{2}^{(2)}+\Theta _{13}^{(2)}a_{3}^{(2)})$

(3) Vectorization

Layer 1

a (1) = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ a (1) 0 a (1) 1 a (1) 2 a (1) 3 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ x 0 x 1 x 2 x 3 ⎤ ⎦ ⎥ ⎥ ⎥ = x

$a^{(1)}=\begin{bmatrix} a_{0}^{(1)}\\ a_{1}^{(1)}\\ a_{2}^{(1)}\\ a_{3}^{(1)} \end{bmatrix} =\begin{bmatrix} x_{0}\\ x_{1}\\ x_{2}\\ x_{3} \end{bmatrix}=x$

Layer 2

a (2) = ⎡ ⎣ ⎢ ⎢ ⎢ a (2) 1 a (2) 2 a (2) 3 ⎤ ⎦ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ g (z (2) 1) g (z (2) 2) g (z (2) 3) ⎤ ⎦ ⎥ ⎥ ⎥ = g (⎡ ⎣ ⎢ ⎢ ⎢ z (2) 1 z (2) 2 z (2) 3 ⎤ ⎦ ⎥ ⎥ ⎥) = g (z (2)) = g (Θ (1) a (1)) = g (⎡ ⎣ ⎢ ⎢ ⎢ Θ (1) 10 x 0 + Θ (1) 11 x 1 + Θ (1) 12 x 2 + Θ (1) 13 x 3 Θ (1) 20 x 0 + Θ (1) 21 x 1 + Θ (1) 22 x 2 + Θ (1) 23 x 3 Θ (1) 30 x 0 + Θ (1) 31 x 1 + Θ (1) 32 x 2 + Θ (1) 33 x 3 ⎤ ⎦ ⎥ ⎥ ⎥) = ⎡ ⎣ ⎢ ⎢ ⎢ g (Θ (1) 10 x 0 + Θ (1) 11 x 1 + Θ (1) 12 x 2 + Θ (1) 13 x 3) g (Θ (1) 20 x 0 + Θ (1) 21 x 1 + Θ (1) 22 x 2 + Θ (1) 23 x 3) g (Θ (1) 30 x 0 + Θ (1) 31 x 1 + Θ (1) 32 x 2 + Θ (1) 33 x 3) ⎤ ⎦ ⎥ ⎥ ⎥

$a^{(2)}=\begin{bmatrix} a_{1}^{(2)}\\ a_{2}^{(2)}\\ a_{3}^{(2)} \end{bmatrix}=\begin{bmatrix} g(z_{1}^{(2)})\\ g(z_{2}^{(2)})\\ g(z_{3}^{(2)}) \end{bmatrix}=g(\begin{bmatrix} z_{1}^{(2)}\\ z_{2}^{(2)}\\ z_{3}^{(2)} \end{bmatrix})=g(z^{(2)}) \\ =g(\Theta ^{(1)}a^{(1)}) \\ =g(\begin{bmatrix} \Theta _{10}^{(1)}x_{0}+\Theta _{11}^{(1)}x_{1}+\Theta _{12}^{(1)}x_{2}+\Theta _{13}^{(1)}x_{3}\\ \Theta _{20}^{(1)}x_{0}+\Theta _{21}^{(1)}x_{1}+\Theta _{22}^{(1)}x_{2}+\Theta _{23}^{(1)}x_{3}\\ \Theta _{30}^{(1)}x_{0}+\Theta _{31}^{(1)}x_{1}+\Theta _{32}^{(1)}x_{2}+\Theta _{33}^{(1)}x_{3} \end{bmatrix}) \\ =\begin{bmatrix} g(\Theta _{10}^{(1)}x_{0}+\Theta _{11}^{(1)}x_{1}+\Theta _{12}^{(1)}x_{2}+\Theta _{13}^{(1)}x_{3})\\ g(\Theta _{20}^{(1)}x_{0}+\Theta _{21}^{(1)}x_{1}+\Theta _{22}^{(1)}x_{2}+\Theta _{23}^{(1)}x_{3})\\ g(\Theta _{30}^{(1)}x_{0}+\Theta _{31}^{(1)}x_{1}+\Theta _{32}^{(1)}x_{2}+\Theta _{33}^{(1)}x_{3}) \end{bmatrix}$

Layer 3

h Θ (x) = a (3) = g (z (3)) = g (Θ (2) a (2)) = g (Θ (2) 10 a (2) 0 + Θ (2) 11 a (2) 1 + Θ (2) 12 a (2) 2 + Θ (2) 13 a (2) 3)

$h_{\Theta }(x)=a^{(3)}=g(z^{(3)})=g(\Theta ^{(2)}a^{(2)})=g(\Theta _{10}^{(2)}a_{0}^{(2)}+\Theta _{11}^{(2)}a_{1}^{(2)}+\Theta _{12}^{(2)}a_{2}^{(2)}+\Theta _{13}^{(2)}a_{3}^{(2)})$

Remark :

$h_{\Theta }(x) \in [0,1]$ is not a logstic function compare to the logistic regression.

The key of the vectorization is $a^{(j)}=g(z^{(j)})=g(\Theta^{(j-1)}a^{(j-1)})$ , it likes a “loop”.

4.1.4 Multiclass classification

To classify data into multiple types, let the hypothesis function return a vector of values.

Simlarly, using one-vs-all method to solve the mutiple classfication problem.

The multiple output units :

one-va-all

4.2 Backpropagation

4.2.1 Cost function

Review the cost function of logistic regression :

J (θ) = - 1 m [\sum i = 1 m (y (i) l o g h θ (x (i)) + (1 - y (i)) l o g (1 - h θ (x (i)))] + λ 2 m \sum j = 1 n θ 2 j

$J(\theta)=-\frac{1}{m}\left [ \sum_{i=1}^{m}\left ( y^{(i)} \mathrm{log}h_{\theta}(x^{(i)})+(1-y^{(i)}) \mathrm{log}(1-h_{\theta}(x^{(i)}) \right ) \right ]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}$

In neural network, we have $K$ output, that is

h_{Θ} (x) \in R^{K} (h_{Θ} (x))_{i} = i^{t h} output

$h_{\Theta}(x)\in \mathbb{R}^{K} \qquad (h_{\Theta}(x))_{i}=i^{th} \; \text{output}$

then the cost function of neural network is the sum of all $K$ logistic cost function :

J (Θ) = - \frac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} [y_{k}^{(i)} l o g ((h_{Θ} (x^{(i)}))_{k}) + (1 - y_{k}^{(i)}) l o g ((1 - h_{Θ} (x^{(i)}))_{k})] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l + 1}} \sum_{j = 1}^{s_{l}} (Θ_{i j}^{(l)})^{2}

$J(\Theta)=-\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \left [y_{k}^{(i)} \mathrm{log} \left( (h_{\Theta}(x^{(i)}))_{k} \right ) +(1-y^{(i)}_{k}) \mathrm{log} \left( (1-h_{\Theta}(x^{(i)}))_{k} \right ) \right ] \\ +\frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l+1}} \sum_{j=1}^{s_{l}}(\Theta_{ij}^{(l)})^{2}$

Remark : in the regulation part, the columns inludes the bias unit, the rows exclude the bias unit.

4.2.2 Gradient of cost function and algorithm

Io order to use gradient descent or other algorithm, we need to compute $J(\Theta)$ and $\frac{\partial }{\partial \Theta_{ij}^{(l)}}J(\Theta)$

Let

δ L = a (L) - y; δ (i) = Θ (i) T δ (j) . * g' (z (i)), 2 \leq i \leq L - 1

$\delta ^{L} = a^{(L)} - y \, ; \; \delta ^{(i)} = \Theta^{(i)T} \delta^{(j)}.*g'(z^{(i)}) \, , \; 2 \leq i \leq L-1$

(for a detailed deducing, there is a reference material BP算法的推导过程)

then

\partial \partial Θ ( l ) i j J (Θ) = a (l) j δ (l + 1) i, (λ = 0)

$\frac{\partial }{\partial \Theta_{ij}^{(l)}}J(\Theta)=a_{j}^{(l)}\delta _{i}^{(l+1)} \, , \; (\lambda = 0)$

Backpropagation algorithm for neural network - Algorthm 3

Training set $\left \{ \left ( x^{(1)},y^{(1)} \right ) ,\cdots , \left ( x^{(m)},y^{(m)} \right ) \right \}$
Set $\; \Delta _{ij}^{(l)}=0 \; \text{ for all } l,i,j$
For $i=1$ to $m$
$\qquad$ Set $a^{(1)}=x^{(i)}$
$\qquad$ Perform forward propagation to compute $a^{(l)}$ for $l=2,3,\cdots ,L$
$\qquad$ Using $y^{(i)}$ , compute $\delta^{(L)}=a^{(L)}-y^{(i)}$
$\qquad$ Compute $\delta^{(L-1)}, \delta^{(L-2)}, \cdots , \delta^{(2)}$
$\qquad$ $\Delta ^{(l)}:=\Delta ^{(l)}+\delta ^{(l+1)}(a^{(l)})^{T}$
$D_{ij}^{(l)}:=\frac{1}{m}(\Delta _{ij}^{(l)}+\lambda \Theta _{ij}^{(l)}),\; \text{ if } j\neq 0$
$D_{ij}^{(l)}:=\frac{1}{m}\Delta _{ij}^{(l)}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ,\ \text{ if } j = 0$
method of SGD

Thus we get $\frac{\partial }{\partial \Theta_{ij}^{(l)}}J(\Theta)=D_{ij}^{(l)}$ .

4.3 Implement in Practice

4.3.1 Unrolling paramrters

With neural network, we are working with sets of matrices, in order to use advanced optimization function, we need to transform them into one long vector.

Code : unroll

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [D1(:); D2(:); D3(:)]

Code : roll

Theta1 = reshape(thetaVector(1:110), 10, 11)
Theta2 = reshape(thetaVector(111:220), 10, 11)
Thera3 = reshape(thetaVector(221:231), 1, 11)

4.3.2 Gradient checking

In order to assure that our backpropagation works as intended, we need to check the gradient.

We can approximate the derivative of our cost function with:

\partial J ( Θ ) \partial Θ ( j ) \approx J ( Θ ( j ) 1 , \dots , Θ ( j ) k + ϵ , Θ ( j ) n ) - J ( Θ ( j ) 1 , \dots , Θ ( j ) k - ϵ , Θ ( j ) n ) 2 ϵ

$\frac{\partial J(\Theta)}{\partial \Theta^{(j)}}\approx \frac{J(\Theta^{(j)}_{1},\cdots ,\Theta^{(j)}_{k}+\epsilon,\Theta^{(j)}_{n} )-J(\Theta^{(j)}_{1},\cdots ,\Theta^{(j)}_{k}-\epsilon,\Theta^{(j)}_{n} )}{2\epsilon }$

The $\epsilon$ is usually set $10^{-4}$ to guarantee the accuracy.

Code

epsilon = 1e-4;
for i:n
    thetaPlus = theta;
    thetaPlus += epsilon;
    thetaMinus = theta;
    thetaMinuw += epsilon;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2*epsilon)
end

4.3.3 Random initialization

Initialize all theta weights to zero cause symmetry breaking, we can randomly initialize theta.

Initialize each $\Theta _{ij}^{(l)}$ to a random value between $\left [ -\epsilon ,\epsilon \right ]$ .

Code

Theta1 = ran(10,11)*(2*INIT_EPSILON)-INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON)-INIT_EPSILON;
...

4.4 Summary

Pick a Network Architecture

number of input units = dimension of features $x^{(i)}$
number of output units = number of classes
number of hidden units per layer = ususlly more is better (must balance with cost function computation)

Training a Neural Network

Randomly initialize the weights
Implement forward propagation
Implement the cost function
Implement backpropagation
Gradient checking (remember to disable checking)
Use gradient descent or built-in optimization function

能智工人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning 04 - Neural Networks

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。Week 044.1 Model Representation4.1.1 Origin of modelNeural network be modelled from the neuron in the brain.4.1.2 Lo...
复制链接

扫一扫