- 3 One hidden layer Neural Network
- 3.1&3.2 Neural Networks Overview & Neural Network Representation
- 3.3 Computing a Neural Network’s Output
- 3.4 Vectorizing across multiple examples
- 3.5 Explanation for vectorized implementation
- 3.6 Activation functions
- 3.7 Why do you need non-linear activation functions?
- 3.8 Derivatives of activation functions
- 3.9 Gradient descent for neural networks
- 3.11 Random Initialization
3 One hidden layer Neural Network
Date:2018.3.8
3.1&3.2 Neural Networks Overview & Neural Network Representation
A neural networks with a single hidden layer(input layer,hidden layer,output layer):
3.3 Computing a Neural Network’s Output
Like logistic regression,the circle images the regression really represents two steps of computation.First compute z and then compute the activation.
In the hidden layer:
Let
So the hidden layer’s output is :
3.4 Vectorizing across multiple examples
Suppose we has m samples, for i = 1 to n,we have :
Let X=[x(1),x(2),...,x(m)]∈Rnx×m X = [ x ( 1 ) , x ( 2 ) , . . . , x ( m ) ] ∈ R n x × m ,So we get:
3.5 Explanation for vectorized implementation
3.6 Activation functions
In the forward propagation steps for the neural network we have to use activation functions.
- sigmoid function
f(z)=11+e−z f ( z ) = 1 1 + e − z tanh fuction(a shifted version of the sigmoid function)
f(z)=ez−e−zez+e−z f ( z ) = e z − e − z e z + e − z
The tanh function is almost always strictly superior.But when you’re using binary classification, in which case you might use the sigmoid activation function for the output layer.
One of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small, and this can slow down gradient descent.ReLu fucntion
f(z)=max(0,z) f ( z ) = m a x ( 0 , z )- Leaky ReLu fucntion
f(z)=max(0.01∗z,z) f ( z ) = m a x ( 0.01 ∗ z , z )
Some rules of thumb for choosing activation functions:
If your output is 0,1 value(binary classification), then the sigmoid activation functions is very natural for the output layer.We usually use ReLU or tanh activation function in hidden layer. In practice using the ReLU activation function your neural network will often learn much faster.
3.7 Why do you need non-linear activation functions?
If you use linear activation function,then:
So the neural networks is just outputting a linear function of the input. A linear hidden layer is more or less useless. The only one layer use linear activate function is the output layer.
Date:2018.3.9
3.8 Derivatives of activation functions
- Sigmoid function
f′(z)=f(z)(1−f(z)) f ′ ( z ) = f ( z ) ( 1 − f ( z ) )
- Tanh function
f′(z)=1−(tanh(z))2 f ′ ( z ) = 1 − ( t a n h ( z ) ) 2
- ReLU funciton
f′(z)={01z<0z≥0 f ′ ( z ) = { 0 z < 0 1 z ≥ 0
- Leaky ReLU function
f′(z)={0.011z<0z≥0 f ′ ( z ) = { 0.01 z < 0 1 z ≥ 0
3.9 Gradient descent for neural networks
Parameters:
Cost function:
Gradient decent:
Repeat {
Compute predict( y^[i],i=1,...,m y ^ [ i ] , i = 1 , . . . , m )
}
So this would be one iteration of gradient descent and then your repeat this some number of times until your parameters look like they’re converging. So the key is to know how to compute these partial derivative terms.
Forward propagation & Back propagation
3.11 Random Initialization
If you initialize weights to zero, then the hidden units are completely identical, so they’re completely sometimes you say they’re completely symmetric,which just means that the computing exactly the same function. No matter how long you train your neural network both hidden units are still computing exactly the same function. So initialize your parameters randomly.
w1 = np.random.randn((2,2))*0.01
b1 = np.zero((2,1))
Usually prefer to initialize the ways to very very small random values. If the weights are too large then when you compute the activation values, z will be very large or very small, so in that case the slope of the gradient may be very small.