DL-1-week3-One hidden layer Neural Network

最新推荐文章于 2022-05-03 21:42:14 发布

qq_25500357

最新推荐文章于 2022-05-03 21:42:14 发布

阅读量234

点赞数

分类专栏： Deep Learning

本文链接：https://blog.csdn.net/qq_25500357/article/details/79480062

版权

Deep Learning 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

3 One hidden layer Neural Network

3 One hidden layer Neural Network

Date:2018.3.8

3.1&3.2 Neural Networks Overview & Neural Network Representation

A neural networks with a single hidden layer(input layer,hidden layer,output layer):
这里写图片描述

3.3 Computing a Neural Network’s Output

Like logistic regression,the circle images the regression really represents two steps of computation.First compute z and then compute the activation.
这里写图片描述

In the hidden layer:

z [1] i = ω [1] T i x + b [1] i

$z_i^{[1]} = \omega _i^{[1]T}x+b_i^{[1]}$

a [1] i = σ (z [1] i)

$a_i^{[1]} = \sigma(z_i^{[1]})$
Let

W [1] = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ω [1] T 1 ω [1] T 2 ω [1] T 3 ω [1] T 4 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

$W^{[1]} = \begin{pmatrix} \omega_1^{[1]T}\\ \omega_2^{[1]T}\\ \omega_3^{[1]T}\\ \omega_4^{[1]T}\\ \end{pmatrix}$

z [1] = (z 1 [1], z 2 [1], z 3 [1], z 4 [1]) T

$z^{[1]}=(z_1{[1]},z_2{[1]},z_3{[1]},z_4{[1]})^T$
So the hidden layer’s output is :

z [1] = W [1] x + b [1]

$z^{[1]} = W^{[1]}x+b^{[1]}$

a [1] = σ (z [1])

$a^{[1]} = \sigma(z^{[1]})$

z [2] = W [2] a + b [2]

$z^{[2]} = W^{[2]}a+b^{[2]}$

a [2] = σ (z [2])

$a^{[2]} = \sigma(z^{[2]})$

3.4 Vectorizing across multiple examples

Suppose we has m samples, for i = 1 to n,we have :

z [1] (i) = W [1] x (i) + b [1]

$z^{[1](i)} = W^{[1]}x^{(i)}+b^{[1]}$

a [1] (i) = σ (z [1] (i))

$a^{[1](i)} = \sigma(z^{[1](i)})$

z [2] (i) = W [2] a [1] (i) + b [2]

$z^{[2](i)} = W^{[2]}a^{[1](i)}+b^{[2]}$

a [2] (i) = σ (z [2] (i))

$a^{[2](i)} = \sigma(z^{[2](i)})$
Let

X=[x(1),x(2),...,x(m)]∈Rnx×m X = [ x ( 1 ) , x ( 2 ) , . . . , x ( m ) ] ∈ R n x × m $X=[x^{(1)},x^{(2)},...,x^{(m)}] \in R^{n_x \times m}$ ,So we get:

Z [1] = W [1] X + b [1]

$Z^{[1]} =W ^{[1]} X+b^{[1]}$

A [1] = σ (Z [1])

$A^{[1]} = \sigma (Z^{[1]})$

Z [2] = W [2] A [1] + b [2]

$Z^{[2]} = W^{[2]} A^{[1]} +b^{[2]}$

A [2] = σ (Z [2])

$A^{[2]} = \sigma (Z^{[2]})$

3.5 Explanation for vectorized implementation

这里写图片描述

3.6 Activation functions

In the forward propagation steps for the neural network we have to use activation functions.

sigmoid function
$f (z) = 1 1 + e - z$ $f(z) = \frac{1}{1+e^{-z}}$
tanh fuction(a shifted version of the sigmoid function)

$f (z) = e z - e - z e z + e - z$ $f(z) = \frac{e^z -e^{-z}}{e^z +e^{-z}}$
The tanh function is almost always strictly superior.But when you’re using binary classification, in which case you might use the sigmoid activation function for the output layer.
One of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small, and this can slow down gradient descent.
ReLu fucntion

$f (z) = m a x (0, z)$ $f(z)= max(0,z)$
Leaky ReLu fucntion
$f (z) = m a x (0.01 * z, z)$ $f(z)=max(0.01*z,z)$

Some rules of thumb for choosing activation functions:
If your output is 0,1 value(binary classification), then the sigmoid activation functions is very natural for the output layer.We usually use ReLU or tanh activation function in hidden layer. In practice using the ReLU activation function your neural network will often learn much faster.

3.7 Why do you need non-linear activation functions?

If you use linear activation function,then:

a [1] = z [1] = ω [1] x + b [1]

$a^{[1]} = z^{[1]} = \omega ^{[1]} x +b^{[1]}$

a [2] = z [2] = ω [2] a [1] + b [2]

$a^{[2]} = z^{[2]} = \omega ^{[2]} a^{[1]}+b^{[2]}$

= (ω [1] ω [2]) x + (ω [1] b [1] + b)

$=(\omega^{[1]}\omega^{[2]})x+(\omega^{[1]}b^{[1]}+b)$
So the neural networks is just outputting a linear function of the input. A linear hidden layer is more or less useless. The only one layer use linear activate function is the output layer.

Date:2018.3.9

3.8 Derivatives of activation functions

Sigmoid function $f' (z) = f (z) (1 - f (z))$ $f'(z) = f(z)(1-f(z))$
Tanh function $f' (z) = 1 - (t a n h (z)) 2$ $f'(z) = 1-(tanh(z))^2$
ReLU funciton $f' (z) = {01 z < 0 z \geq 0$ $f'(z) = \begin{cases} 0 & \text{$z<0$}\\ 1 & \text{$z \geq 0$} \end{cases}$
Leaky ReLU function $f' (z) = {0.01 1 z < 0 z \geq 0$ $f'(z) = \begin{cases} 0.01 & \text{$z<0$}\\ 1 &\text{$z \geq 0$} \end{cases}$

3.9 Gradient descent for neural networks

Parameters:

ω [1], b [1], ω [2], b [2], n x = n [0], n [1], . ., n [m]

$\omega^{[1]},b^{[1]},\omega^{[2]},b^{[2]},n_x = n^{[0]},n^{[1]},..,n^{[m]}$
Cost function:

J (ω [1], b [1], ω [2], b [2]) = 1 m \sum i = 1 m L (y^, y)

$J(\omega^{[1]},b^{[1]},\omega^{[2]},b^{[2]})=\frac{1}{m}\sum \limits _{i=1}^mL(\hat y ,y)$
Gradient decent:
Repeat {
Compute predict(

y^[i],i=1,...,m y ^ [ i ] , i = 1 , . . . , m $\hat y ^{[i]},i = 1,...,m$ )

d ω [i] = \partial J \partial ω [ i ], d b [i] = \partial J \partial b [ i ]

$d\omega^{[i ]} = \frac{\partial J}{\partial \omega^{[i]}},db^{[i]}=\frac{\partial J}{\partial b^{[i]}}$

ω [i] = ω [i] - α d ω [i]

$\omega ^{[i]} = \omega^{[i]} - \alpha d\omega ^{[i]}$

b [i] = b [i] - α d b [i]

$b^{[i]}=b^{[i]}-\alpha db^{[i]}$
}
So this would be one iteration of gradient descent and then your repeat this some number of times until your parameters look like they’re converging. So the key is to know how to compute these partial derivative terms.
Forward propagation & Back propagation

3.11 Random Initialization

If you initialize weights to zero, then the hidden units are completely identical, so they’re completely sometimes you say they’re completely symmetric,which just means that the computing exactly the same function. No matter how long you train your neural network both hidden units are still computing exactly the same function. So initialize your parameters randomly.

w1 = np.random.randn((2,2))*0.01
b1 = np.zero((2,1))

Usually prefer to initialize the ways to very very small random values. If the weights are too large then when you compute the activation values, z will be very large or very small, so in that case the slope of the gradient may be very small.