Course1-week3-one hidden layer neural network

3.1 - neural networks overview

Some new notation have been introduce, we’ll use superscript square bracket 1 to refer to the layer of neural network, for instance, w[1] w [ 1 ] representing the parameters of layer 1, and this superscript square bracker we have here are not to confuse with the superscript round bracket which we used to refer to individual training example. so x(i) x ( i ) refer to the ith i t h training example.

3.2 - neural network representation


这里写图片描述

3.3 - computing a neural network output

Let’s go more deeply into exactly what this neural network compute.


这里写图片描述


这里写图片描述

z[1]1=w[1]T1x+b[1]1,a[1]1=σ(z[1]1)z[1]2=w[1]T2x+b[1]2,a[1]2=σ(z[1]2)z[1]3=w[1]T3x+b[1]3,a[1]3=σ(z[1]3)z[1]4=w[1]T4x+b[1]4,a[1]4=σ(z[1]4) z 1 [ 1 ] = w 1 [ 1 ] T x + b 1 [ 1 ] , a 1 [ 1 ] = σ ( z 1 [ 1 ] ) z 2 [ 1 ] = w 2 [ 1 ] T x + b 2 [ 1 ] , a 2 [ 1 ] = σ ( z 2 [ 1 ] ) z 3 [ 1 ] = w 3 [ 1 ] T x + b 3 [ 1 ] , a 3 [ 1 ] = σ ( z 3 [ 1 ] ) z 4 [ 1 ] = w 4 [ 1 ] T x + b 4 [ 1 ] , a 4 [ 1 ] = σ ( z 4 [ 1 ] )

take these four equations and vectorize.

z[1]=W[1]x+b[1]=w[1]T1w[1]T2w[1]T3w[1]T4x1x2x3+b[1]1b[1]2b[1]3b[1]4=w[1]T1x+b[1]1w[1]T2x+b[1]2w[1]T3x+b[1]3w[1]T4x+b[1]4=z[1]1z[1]2z[1]3z[1]4(1) (1) z [ 1 ] = W [ 1 ] x + b [ 1 ] = [ ⋯ w 1 [ 1 ] T ⋯ ⋯ w 2 [ 1 ] T ⋯ ⋯ w 3 [ 1 ] T ⋯ ⋯ w 4 [ 1 ] T ⋯ ] [ x 1 x 2 x 3 ] + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] b 4 [ 1 ] ] = [ w 1 [ 1 ] T x + b 1 [ 1 ] w 2 [ 1 ] T x + b 2 [ 1 ] w 3 [ 1 ] T x + b 3 [ 1 ] w 4 [ 1 ] T x + b 4 [ 1 ] ] = [ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] z 4 [ 1 ] ]

When we are vectorizing one of the rule of thumb that when we have different nodes in a layer, we stack them vertically.

z[1]=σ(a[1]) z [ 1 ] = σ ( a [ 1 ] )

So when you have a neural network with one hidden layer, what you need to implement to compute the output is just the four equations below.

z[1]=W[1]x+b[1]a[1]=σ(z[1])z[2]=W[2]x+b[2]a[2]=σ(z[2]) z [ 1 ] = W [ 1 ] x + b [ 1 ] a [ 1 ] = σ ( z [ 1 ] ) z [ 2 ] = W [ 2 ] x + b [ 2 ] a [ 2 ] = σ ( z [ 2 ] )

Now we have seen how given a single input feature vector x x you can with four line of codes compute the output of this neural network. Similiar to what we did for logistic regression, we also want to vectorize across multiple training examples. So be able to compute the output of the neural network not just one example at a time, but your entire training set at a time.

3.4 - vectorizing across multiple examples

Stacking up different training examples in different columns of a matrix, we’ll able to take the four equations from the previous talk and very little modifiation.
If you have a unvectorized implementation to compute the output for all training examples, you need to for i equal 1 to m.

x(1)a[2](1)x(2)a[2](2)x(3)a[2](m) x ( 1 ) ⟶ a [ 2 ] ( 1 ) x ( 2 ) ⟶ a [ 2 ] ( 2 ) ⋯ x ( 3 ) ⟶ a [ 2 ] ( m )

And the vectorized version is:

Z[1]=W[1]A[0]+b[1]A[1]=σ(Z[1])Z[2]=W[2]A[1]+b[2]A[2]=σ(Z[2]) Z [ 1 ] = W [ 1 ] A [ 0 ] + b [ 1 ] A [ 1 ] = σ ( Z [ 1 ] ) Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = σ ( Z [ 2 ] )

where:
X=A[0]=x(1)x(2)x(m)(nx,m)(2) (2) X = A [ 0 ] = [ x ( 1 ) x ( 2 ) ⋯ x ( m ) ] ( n x , m )

Z[1]=z[1](1)z[1](2)z[1](m)(3) (3) Z [ 1 ] = [ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) ⋯ z [ 1 ] ( m ) ]

A[1]=a[1](1)a[1](2)a[1](m)(4) (4) A [ 1 ] = [ a [ 1 ] ( 1 ) a [ 1 ] ( 2 ) ⋯ a [ 1 ] ( m ) ]

So the horizontal, the matrix A,Z A , Z goes over different training examples; vertical, the matrix goes over different hidden unit neural.

3.5 - explanation for vectorized implementation


这里写图片描述

3.6 - activation functions

sigmoid(z)=11+e(z) s i g m o i d ( z ) = 1 1 + e ( − z )

tanh(z)=ezezez+ez t a n h ( z ) = e z − e − z e z + e − z

It’s turn out if you take the activation function be equal to tanh(z) t a n h ( z ) , this almost always better than the sigmoid s i g m o i d function, because the value between -1 and 1, the means of activation value that come out from your hidden layer are close to 0, this actually meke your learning for the next layer a little bit easier. I pertty much never use the sigmoid s i g m o i d activation function anymore, but the one exception is for the output layer, because if y is either 0 or 1, than you want to output just between 0 and 1 rather than between -1 and 1.

One of the downsides of both the sigmoid function and tanh function is that is z is very small or very large, then the gradient, the derivative or the slope of the function become very small, this can slow down the gradient descent.

Relu(z)=max(0,z) R e l u ( z ) = m a x ( 0 , z )

leaky Rulu(z)=max(0.01z,z) l e a k y   R u l u ( z ) = m a x ( 0.01 z , z )

The advantage of Rule and leaky Relu is for a lot of space of z, the slope of the activation function is very different from 0, so in partice, using the Relu activation function your neural network will often learn much faster than using the tanh and sigmoid activation function.

3.7 - why do you need non-linear activation function

Why does the neural network need a nonlinear activation function? if you do this, your model is just computing y^ y ^ as a linear function of your input feature x.

a[1]a[2]=w[1]x+b[1]=w[2]a[1]+b[2] a [ 1 ] = w [ 1 ] x + b [ 1 ] a [ 2 ] = w [ 2 ] a [ 1 ] + b [ 2 ]

so
a[2]=w[2](w[1]x+b[1])+b[2]=w[2]w[1]x+w[2]b[1]+b[2]=wx+b a [ 2 ] = w [ 2 ] ( w [ 1 ] x + b [ 1 ] ) + b [ 2 ] = w [ 2 ] w [ 1 ] ⏟ x + w [ 2 ] b [ 1 ] + b [ 2 ] ⏟ = w ′ x + b ′

just outputing a linear function of input x. One place you may use linear activation function is where the problem you have face is regression.

3.8 - derivatives of activation function

a=g(z)=11+ez a = g ( z ) = 1 1 + e − z

g(z)=a(1a)(1) (1) g ′ ( z ) = a ( 1 − a )

a=g(z)=ezezez+ez a = g ( z ) = e z − e − z e z + e − z

g(z)=1a2(2) (2) g ′ ( z ) = 1 − a 2

g(z)=max(0,z) g ( z ) = m a x ( 0 , z )

g(z)={1, if z0;0, if z<0;(3) (3) g ′ ( z ) = { 1 ,   i f   z ≥ 0 ; 0 ,   i f   z < 0 ;

g(z)=max(0.001z,z) g ( z ) = m a x ( 0.001 z , z )

g(z)={1, if z0;0.001, if z<0;(4) (4) g ′ ( z ) = { 1 ,   i f   z ≥ 0 ; 0.001 ,   i f   z < 0 ;

3.9 - gradient descent for neural networks

We will see how the implement gradient descent for neural network with one hidden layer. The parameters of neural network with one hidden layer are:

W[1](n[1],n(0)),b[1](n[1],1),W[2](n[2],n(1)),b[2](n[2],(n[1]) W ( n [ 1 ] , n ( 0 ) ) [ 1 ] , b ( n [ 1 ] , 1 ) [ 1 ] , W ( n [ 2 ] , n ( 1 ) ) [ 2 ] , b ( n [ 2 ] , ( n [ 1 ] ) [ 2 ]

assuming that we are doing binary classification, so the cost function is:

J(W[1],b[1],W[2],b[2])=1mi=1mL(h^,a[2]) J ( W [ 1 ] , b [ 1 ] , W [ 2 ] , b [ 2 ] ) = 1 m ∑ i = 1 m L ( h ^ , a [ 2 ] )

if you doing binary classification, the loss function can be exactly what you ues for logistic regression.

Let’s summarize again the equations for forward propagation:

Z[1]=W[1]X+b[1]A[1]=g[1](Z[1])Z(2)=W[2]A[1]+b[2]A[2]=g[2](Z[2]) Z [ 1 ] = W [ 1 ] X + b [ 1 ] A [ 1 ] = g [ 1 ] ( Z [ 1 ] ) Z ( 2 ) = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = g [ 2 ] ( Z [ 2 ] )

back propagation

dZ[2]=A[2]Y,Y=[y[1],y[2],,y[m]],A=[a[1],a[2],,a[m]]dW[2]=1mdZ[2]A[1]Tdb[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]g[1](Z[1]),(g[1](Z[1]) is (n[1],m) and W[2]TdZ[2] is (n[1],m))dW[1]=1mdZ[1]XTdb[1]=1mnp.sum(dZ[1],axis=1,keepdims=True) d Z [ 2 ] = A [ 2 ] − Y , Y = [ y [ 1 ] , y [ 2 ] , ⋯ , y [ m ] ] , A = [ a [ 1 ] , a [ 2 ] , ⋯ , a [ m ] ] d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T d b [ 2 ] = 1 m n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ∗ g [ 1 ] ′ ( Z [ 1 ] ) , ( g [ 1 ] ′ ( Z [ 1 ] )   i s   ( n [ 1 ] , m )   a n d   W [ 2 ] T d Z [ 2 ]   i s   ( n [ 1 ] , m ) ) d W [ 1 ] = 1 m d Z [ 1 ] X T d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e )

3.10 - backpropagation intuition

backpropagation for logistic regression


这里写图片描述
backpropagation for neural network

这里写图片描述
equations for backpropagation

这里写图片描述

3-11 random initialization

When you training neural network, it’s important to initialize the weights randomly, for logisitic regression, it was okay to initialize the weights to zero, but for neural network to initialize the weights to all zero and apply the gradient descent it won’t work.

if you initialize the neural network:

W[1]=[0000],W[2]=[0 0] W [ 1 ] = [ 0 0 0 0 ] , W [ 2 ] = [ 0   0 ]

the two hidden unit are completely identical, so they compute the exactly same function, but we want different unit to compute the different things.

W1 = np.random.randn((n_1, n_0)) * 0.01
b = np.zeros((n_1, 0))

where did this constant 0.01 comes from, why is it 0.01, why not put the number 100 or 1000, it turns out that our usually prefer to initialize the paramenters very small random values. Because if you use a sigmoid or tanh activation function, if the paramenter are too large when you compute the activation values – z(1)=W[1]x+b[1],a[1]=g(z[1]) z ( 1 ) = W [ 1 ] x + b [ 1 ] , a [ 1 ] = g ( z [ 1 ] ) , so if W W is very big, the z will be very big, you will starting off training with very large of z z <script type="math/tex" id="MathJax-Element-42">z</script>, which causes tanh and sigmoid activation function to be saturated, thus slowing down learning.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值