DL-1-week3-One hidden layer Neural Network

3 One hidden layer Neural Network


Date:2018.3.8

3.1&3.2 Neural Networks Overview & Neural Network Representation

A neural networks with a single hidden layer(input layer,hidden layer,output layer):
这里写图片描述

3.3 Computing a Neural Network’s Output

Like logistic regression,the circle images the regression really represents two steps of computation.First compute z and then compute the activation.
这里写图片描述

In the hidden layer:

z[1]i=ω[1]Tix+b[1]i z i [ 1 ] = ω i [ 1 ] T x + b i [ 1 ]
a[1]i=σ(z[1]i) a i [ 1 ] = σ ( z i [ 1 ] )

Let
W[1]=ω[1]T1ω[1]T2ω[1]T3ω[1]T4 W [ 1 ] = ( ω 1 [ 1 ] T ω 2 [ 1 ] T ω 3 [ 1 ] T ω 4 [ 1 ] T )
z[1]=(z1[1],z2[1],z3[1],z4[1])T z [ 1 ] = ( z 1 [ 1 ] , z 2 [ 1 ] , z 3 [ 1 ] , z 4 [ 1 ] ) T

So the hidden layer’s output is :
z[1]=W[1]x+b[1] z [ 1 ] = W [ 1 ] x + b [ 1 ]
a[1]=σ(z[1]) a [ 1 ] = σ ( z [ 1 ] )
z[2]=W[2]a+b[2] z [ 2 ] = W [ 2 ] a + b [ 2 ]
a[2]=σ(z[2]) a [ 2 ] = σ ( z [ 2 ] )

3.4 Vectorizing across multiple examples

Suppose we has m samples, for i = 1 to n,we have :

z[1](i)=W[1]x(i)+b[1] z [ 1 ] ( i ) = W [ 1 ] x ( i ) + b [ 1 ]
a[1](i)=σ(z[1](i)) a [ 1 ] ( i ) = σ ( z [ 1 ] ( i ) )
z[2](i)=W[2]a[1](i)+b[2] z [ 2 ] ( i ) = W [ 2 ] a [ 1 ] ( i ) + b [ 2 ]
a[2](i)=σ(z[2](i)) a [ 2 ] ( i ) = σ ( z [ 2 ] ( i ) )

Let X=[x(1),x(2),...,x(m)]Rnx×m X = [ x ( 1 ) , x ( 2 ) , . . . , x ( m ) ] ∈ R n x × m ,So we get:
Z[1]=W[1]X+b[1] Z [ 1 ] = W [ 1 ] X + b [ 1 ]
A[1]=σ(Z[1]) A [ 1 ] = σ ( Z [ 1 ] )
Z[2]=W[2]A[1]+b[2] Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ]
A[2]=σ(Z[2]) A [ 2 ] = σ ( Z [ 2 ] )

3.5 Explanation for vectorized implementation

这里写图片描述

3.6 Activation functions

In the forward propagation steps for the neural network we have to use activation functions.

  • sigmoid function
    f(z)=11+ez f ( z ) = 1 1 + e − z
  • tanh fuction(a shifted version of the sigmoid function)

    f(z)=ezezez+ez f ( z ) = e z − e − z e z + e − z

    The tanh function is almost always strictly superior.But when you’re using binary classification, in which case you might use the sigmoid activation function for the output layer.
    One of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small, and this can slow down gradient descent.

  • ReLu fucntion

    f(z)=max(0,z) f ( z ) = m a x ( 0 , z )

  • Leaky ReLu fucntion
    f(z)=max(0.01z,z) f ( z ) = m a x ( 0.01 ∗ z , z )

    这里写图片描述
    Some rules of thumb for choosing activation functions:
    If your output is 0,1 value(binary classification), then the sigmoid activation functions is very natural for the output layer.We usually use ReLU or tanh activation function in hidden layer. In practice using the ReLU activation function your neural network will often learn much faster.

3.7 Why do you need non-linear activation functions?

If you use linear activation function,then:

a[1]=z[1]=ω[1]x+b[1] a [ 1 ] = z [ 1 ] = ω [ 1 ] x + b [ 1 ]
a[2]=z[2]=ω[2]a[1]+b[2] a [ 2 ] = z [ 2 ] = ω [ 2 ] a [ 1 ] + b [ 2 ]
=(ω[1]ω[2])x+(ω[1]b[1]+b) = ( ω [ 1 ] ω [ 2 ] ) x + ( ω [ 1 ] b [ 1 ] + b )

So the neural networks is just outputting a linear function of the input. A linear hidden layer is more or less useless. The only one layer use linear activate function is the output layer.


Date:2018.3.9

3.8 Derivatives of activation functions

  • Sigmoid function
    f(z)=f(z)(1f(z)) f ′ ( z ) = f ( z ) ( 1 − f ( z ) )
  • Tanh function
    f(z)=1(tanh(z))2 f ′ ( z ) = 1 − ( t a n h ( z ) ) 2
  • ReLU funciton
    f(z)={01z<0z0 f ′ ( z ) = { 0 z < 0 1 z ≥ 0
  • Leaky ReLU function
    f(z)={0.011z<0z0 f ′ ( z ) = { 0.01 z < 0 1 z ≥ 0

3.9 Gradient descent for neural networks

Parameters:

ω[1],b[1],ω[2],b[2],nx=n[0],n[1],..,n[m] ω [ 1 ] , b [ 1 ] , ω [ 2 ] , b [ 2 ] , n x = n [ 0 ] , n [ 1 ] , . . , n [ m ]

Cost function:
J(ω[1],b[1],ω[2],b[2])=1mi=1mL(y^,y) J ( ω [ 1 ] , b [ 1 ] , ω [ 2 ] , b [ 2 ] ) = 1 m ∑ i = 1 m L ( y ^ , y )

Gradient decent:
Repeat {
Compute predict( y^[i],i=1,...,m y ^ [ i ] , i = 1 , . . . , m )
dω[i]=Jω[i],db[i]=Jb[i] d ω [ i ] = ∂ J ∂ ω [ i ] , d b [ i ] = ∂ J ∂ b [ i ]
ω[i]=ω[i]αdω[i] ω [ i ] = ω [ i ] − α d ω [ i ]
b[i]=b[i]αdb[i] b [ i ] = b [ i ] − α d b [ i ]

}
So this would be one iteration of gradient descent and then your repeat this some number of times until your parameters look like they’re converging. So the key is to know how to compute these partial derivative terms.
Forward propagation & Back propagation

3.11 Random Initialization

If you initialize weights to zero, then the hidden units are completely identical, so they’re completely sometimes you say they’re completely symmetric,which just means that the computing exactly the same function. No matter how long you train your neural network both hidden units are still computing exactly the same function. So initialize your parameters randomly.

w1 = np.random.randn((2,2))*0.01
b1 = np.zero((2,1))

Usually prefer to initialize the ways to very very small random values. If the weights are too large then when you compute the activation values, z will be very large or very small, so in that case the slope of the gradient may be very small.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值