Course1-week3-one hidden layer neural network

最新推荐文章于 2021-08-13 18:33:29 发布

土肥宅娘口三三

最新推荐文章于 2021-08-13 18:33:29 发布

阅读量493

点赞数

分类专栏： deep learning 文章标签： Andrew Ng deep learning deeplearning.ai

本文链接：https://blog.csdn.net/robin_Xu_shuai/article/details/80624210

版权

deep learning 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

3.1 - neural networks overview

Some new notation have been introduce, we’ll use superscript square bracket 1 to refer to the layer of neural network, for instance, $w^{[1]}$ representing the parameters of layer 1, and this superscript square bracker we have here are not to confuse with the superscript round bracket which we used to refer to individual training example. so $x^{(i)}$ refer to the $i^{th}$ training example.

3.2 - neural network representation

3.3 - computing a neural network output

Let’s go more deeply into exactly what this neural network compute.

z [1] 1 = w [1] T 1 x + b [1] 1, a [1] 1 = σ (z [1] 1) z [1] 2 = w [1] T 2 x + b [1] 2, a [1] 2 = σ (z [1] 2) z [1] 3 = w [1] T 3 x + b [1] 3, a [1] 3 = σ (z [1] 3) z [1] 4 = w [1] T 4 x + b [1] 4, a [1] 4 = σ (z [1] 4)

$z^{[1]}_{1} = w^{[1]T}_{1}x + b^{[1]}_{1}, a^{[1]}_{1} = \sigma(z^{[1]}_{1}) \\ z^{[1]}_{2} = w^{[1]T}_{2}x + b^{[1]}_{2}, a^{[1]}_{2} = \sigma(z^{[1]}_{2}) \\ z^{[1]}_{3} = w^{[1]T}_{3}x + b^{[1]}_{3}, a^{[1]}_{3} = \sigma(z^{[1]}_{3}) \\ z^{[1]}_{4} = w^{[1]T}_{4}x + b^{[1]}_{4}, a^{[1]}_{4} = \sigma(z^{[1]}_{4}) \\$
take these four equations and vectorize.

z [1] = W [1] x + b [1] = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ \dots w [1] T 1 \dots \dots w [1] T 2 \dots \dots w [1] T 3 \dots \dots w [1] T 4 \dots ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ x 1 x 2 x 3 ⎤ ⎦ ⎥ + ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ b [1] 1 b [1] 2 b [1] 3 b [1] 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ w [1] T 1 x + b [1] 1 w [1] T 2 x + b [1] 2 w [1] T 3 x + b [1] 3 w [1] T 4 x + b [1] 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ z [1] 1 z [1] 2 z [1] 3 z [1] 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ (1)

$\begin{equation} z^{[1]}=W^{[1]}x+b{[1]}= \left[ \begin{aligned} \cdots w^{[1]T}_{1} \cdots \\ \cdots w^{[1]T}_{2} \cdots \\ \cdots w^{[1]T}_{3} \cdots \\ \cdots w^{[1]T}_{4} \cdots \\ \end{aligned} \right] \left[ \begin{aligned} x_1 \\ x_2 \\ x_3 \\ \end{aligned} \right]+ \left[ \begin{aligned} b^{[1]}_{1} \\ b^{[1]}_{2} \\ b^{[1]}_{3} \\ b^{[1]}_{4} \\ \end{aligned} \right]= \left[ \begin{aligned} w^{[1]T}_{1}x + b^{[1]}_{1} \\ w^{[1]T}_{2}x + b^{[1]}_{2} \\ w^{[1]T}_{3}x + b^{[1]}_{3} \\ w^{[1]T}_{4}x + b^{[1]}_{4} \\ \end{aligned} \right]= \left[ \begin{aligned} z^{[1]}_1 \\ z^{[1]}_2 \\ z^{[1]}_3 \\ z^{[1]}_4 \\ \end{aligned} \right] \end{equation}$

When we are vectorizing one of the rule of thumb that when we have different nodes in a layer, we stack them vertically.

z [1] = σ (a [1])

$z^{[1]} = \sigma(a^{[1]})$

So when you have a neural network with one hidden layer, what you need to implement to compute the output is just the four equations below.

z [1] = W [1] x + b [1] a [1] = σ (z [1]) z [2] = W [2] x + b [2] a [2] = σ (z [2])

$z^{[1]} = W^{[1]}x + b^{[1]} \\ a^{[1]} = \sigma(z^{[1]}) \\ z^{[2]} = W^{[2]}x + b^{[2]} \\ a^{[2]} = \sigma(z^{[2]}) \\$

Now we have seen how given a single input feature vector $x$ you can with four line of codes compute the output of this neural network. Similiar to what we did for logistic regression, we also want to vectorize across multiple training examples. So be able to compute the output of the neural network not just one example at a time, but your entire training set at a time.

3.4 - vectorizing across multiple examples

Stacking up different training examples in different columns of a matrix, we’ll able to take the four equations from the previous talk and very little modifiation.
If you have a unvectorized implementation to compute the output for all training examples, you need to for $i$ equal 1 to m.

x (1) ⟶ a [2] (1) x (2) ⟶ a [2] (2) \dots x (3) ⟶ a [2] (m)

$x^{(1)} \longrightarrow a^{[2](1)} \\ x^{(2)} \longrightarrow a^{[2](2)} \\ \cdots \\ x^{(3)} \longrightarrow a^{[2](m)}$

And the vectorized version is:

Z [1] = W [1] A [0] + b [1] A [1] = σ (Z [1]) Z [2] = W [2] A [1] + b [2] A [2] = σ (Z [2])

$Z^{[1]} = W^{[1]}A^{[0]} + b^{[1]} \\ A^{[1]} = \sigma(Z^{[1]}) \\ Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} \\ A^{[2]} = \sigma(Z^{[2]}) \\$
where:

X = A [0] = ⎡ ⎣ ⎢ x (1) x (2) \dots x (m) ⎤ ⎦ ⎥ (n x, m) (2)

$\begin{equation} X = A^{[0]} = \left[ \begin{aligned} & & & & \\ & x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ & & & & \\ \end{aligned} \right]_{(n_x, m)} \end{equation}$

Z [1] = ⎡ ⎣ ⎢ z [1] (1) z [1] (2) \dots z [1] (m) ⎤ ⎦ ⎥ (3)

$\begin{equation} Z^{[1]} = \left[ \begin{aligned} & & & & \\ & z^{[1](1)} & z^{[1](2)} & \cdots & z^{[1](m)} \\ & & & & \\ \end{aligned} \right] \end{equation}$

A [1] = ⎡ ⎣ ⎢ a [1] (1) a [1] (2) \dots a [1] (m) ⎤ ⎦ ⎥ (4)

$\begin{equation} A^{[1]} = \left[ \begin{aligned} & & & & \\ & a^{[1](1)} & a^{[1](2)} & \cdots & a^{[1](m)} \\ & & & & \\ \end{aligned} \right] \end{equation}$

So the horizontal, the matrix $A, Z$ goes over different training examples; vertical, the matrix goes over different hidden unit neural.

3.5 - explanation for vectorized implementation

3.6 - activation functions

s i g m o i d (z) = 1 1 + e ( - z )

$sigmoid(z) = \frac{1}{1+e^{(-z)}}$

t a n h (z) = e z - e - z e z + e - z

$tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$
It’s turn out if you take the activation function be equal to

tanh(z) t a n h ( z ) $tanh(z)$ , this almost always better than the

sigmoid s i g m o i d $sigmoid$ function, because the value between -1 and 1, the means of activation value that come out from your hidden layer are close to 0, this actually meke your learning for the next layer a little bit easier. I pertty much never use the

sigmoid s i g m o i d $sigmoid$ activation function anymore, but the one exception is for the output layer, because if y is either 0 or 1, than you want to output just between 0 and 1 rather than between -1 and 1.

One of the downsides of both the sigmoid function and tanh function is that is z is very small or very large, then the gradient, the derivative or the slope of the function become very small, this can slow down the gradient descent.

R e l u (z) = m a x (0, z)

$Relu(z) = max(0, z)$

l e a k y R u l u (z) = m a x (0.01 z, z)

$leaky\ Rulu(z) = max(0.01z, z)$

The advantage of Rule and leaky Relu is for a lot of space of z, the slope of the activation function is very different from 0, so in partice, using the Relu activation function your neural network will often learn much faster than using the tanh and sigmoid activation function.

3.7 - why do you need non-linear activation function

Why does the neural network need a nonlinear activation function? if you do this, your model is just computing $\hat{y}$ as a linear function of your input feature x.

a [1] a [2] = w [1] x + b [1] = w [2] a [1] + b [2]

$\begin{aligned} a^{[1]} & = w^{[1]}x + b^{[1]} \\ a^{[2]} & = w^{[2]}a^{[1]} + b^{[2]} \\ \end{aligned}$
so

a [2] = w [2] (w [1] x + b [1]) + b [2] = w [2] w [1]        x + w [2] b [1] + b [2]              = w' x + b'

$\begin{aligned} a^{[2]} & = w^{[2]}(w^{[1]}x + b^{[1]}) + b^{[2]} \\ & = \underbrace{w^{[2]}w^{[1]}}x + \underbrace{w^{[2]}b^{[1]} + b^{[2]}} \\ & = w'x+b' \end{aligned}$
just outputing a linear function of input x. One place you may use linear activation function is where the problem you have face is regression.

3.8 - derivatives of activation function

a = g (z) = 1 1 + e - z

$a = g(z) = \frac{1}{1 + e^{-z}}$

g' (z) = a (1 - a) (1)

$g'(z) = a(1 - a)\tag1$

a = g (z) = e z - e - z e z + e - z

$a = g(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$

g' (z) = 1 - a 2 (2)

$g'(z) = 1 - a^2\tag2$

g (z) = m a x (0, z)

$g(z) = max(0, z)$

g' (z) = {1, i f z \geq 0; 0, i f z < 0; (3)

$\begin{equation} g'(z) = \left\{ \begin{aligned} 1, \ if \ z \ge 0; \\ 0, \ if \ z < 0; \end{aligned} \right.\tag3 \end{equation}$

g (z) = m a x (0.001 z, z)

$g(z) = max(0.001z, z)$

g' (z) = {1, i f z \geq 0; 0.001, i f z < 0; (4)

$\begin{equation} g'(z) = \left\{ \begin{aligned} 1, \ if \ z \ge 0; \\ 0.001, \ if \ z < 0; \end{aligned} \right.\tag4 \end{equation}$

3.9 - gradient descent for neural networks

We will see how the implement gradient descent for neural network with one hidden layer. The parameters of neural network with one hidden layer are:

W [1] (n [1], n (0)), b [1] (n [1], 1), W [2] (n [2], n (1)), b [2] (n [2], (n [1])

$W^{[1]}_{(n^{[1]},n^{(0)})}, b^{[1]}_{(n^{[1]}, 1)}, W^{[2]}_{(n^{[2]},n^{(1)})}, b^{[2]}_{(n^{[2]}, (n^{[1]})}$

assuming that we are doing binary classification, so the cost function is:

J (W [1], b [1], W [2], b [2]) = 1 m \sum i = 1 m L (h^, a [2])

$J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac1m \sum_{i = 1}^{m} \mathcal{L}(\hat{h}, a^{[2]})$

if you doing binary classification, the loss function can be exactly what you ues for logistic regression.

Let’s summarize again the equations for forward propagation:

Z [1] = W [1] X + b [1] A [1] = g [1] (Z [1]) Z (2) = W [2] A [1] + b [2] A [2] = g [2] (Z [2])

$Z^{[1]} = W^{[1]}X + b^{[1]} \\ A^{[1]} = g^{[1]}(Z^{[1]}) \\ Z^{(2)} = W^{[2]}A^{[1]} + b^{[2]} \\ A^{[2]} = g^{[2]}(Z^{[2]}) \\$

back propagation

d Z [2] = A [2] - Y, Y = [y [1], y [2], \dots, y [m]], A = [a [1], a [2], \dots, a [m]] d W [2] = 1 m d Z [2] A [1] T d b [2] = 1 m n p . s u m (d Z [2], a x i s = 1, k e e p d i m s = T r u e) d Z [1] = W [2] T d Z [2] * g [1]' (Z [1]), (g [1]' (Z [1]) i s (n [1], m) a n d W [2] T d Z [2] i s (n [1], m)) d W [1] = 1 m d Z [1] X T d b [1] = 1 m n p . s u m (d Z [1], a x i s = 1, k e e p d i m s = T r u e)

$\begin{aligned} & dZ^{[2]} = A^{[2]} - Y, Y = [y^{[1]}, y^{[2]}, \cdots, y^{[m]}], A = [a^{[1]}, a^{[2]}, \cdots, a^{[m]}] \\ & dW^{[2]} = \frac1m dZ^{[2]}A^{[1]T} \\ & db^{[2]} = \frac1m np.sum(dZ^{[2]}, axis = 1, keepdims = True) \\ & dZ^{[1]} = W^{[2]T}dZ^{[2]} * g^{[1]'}(Z^{[1]}), (g^{[1]'}(Z^{[1]}) \ is \ (n^{[1]}, m) \ and \ W^{[2]T}dZ^{[2]} \ is \ (n^{[1]}, m)) \\ & dW^{[1]} = \frac1m dZ^{[1]}X^T \\ & db^{[1]} = \frac1m np.sum(dZ^{[1]}, axis = 1, keepdims = True) \end{aligned}$

3.10 - backpropagation intuition

backpropagation for logistic regression

backpropagation for neural network

这里写图片描述

equations for backpropagation

这里写图片描述

3-11 random initialization

When you training neural network, it’s important to initialize the weights randomly, for logisitic regression, it was okay to initialize the weights to zero, but for neural network to initialize the weights to all zero and apply the gradient descent it won’t work.

if you initialize the neural network:

W [1] = [0000], W [2] = [00]

$W^{[1]} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \end{bmatrix} , W^{[2]} = [0 \ 0]$
the two hidden unit are completely identical, so they compute the exactly same function, but we want different unit to compute the different things.

W1 = np.random.randn((n_1, n_0)) * 0.01
b = np.zeros((n_1, 0))

where did this constant 0.01 comes from, why is it 0.01, why not put the number 100 or 1000, it turns out that our usually prefer to initialize the paramenters very small random values. Because if you use a sigmoid or tanh activation function, if the paramenter are too large when you compute the activation values – $z^{(1)} = W^{[1]}x + b^{[1]}, a^{[1]} = g(z^{[1]})$ , so if $W$ is very big, the $z$ will be very big, you will starting off training with very large of z <script type="math/tex" id="MathJax-Element-42">z</script>, which causes tanh and sigmoid activation function to be saturated, thus slowing down learning.