introduction of Activation Function

Grant|zgh

于 2020-05-20 20:01:54 发布

阅读量287

点赞数

文章标签：深度学习

本文链接：https://blog.csdn.net/weixin_42963486/article/details/106232013

版权

Many people may have some questions like:What is activation function? Why do we have so many activation function?Today,I will introduce the activation functions in neural network.
Convolutional neural network always consist of many neurons stacked in layers.Let’s consider a single neuron for simplicity.

What is Activation Function?
在这里插入图片描述

So what a simple artificial neuron do? The operations performed by a neuron basically involve multiplication and summation operations which are linear, and produce an output.
First of all,it calculate a “weighted sum” of its input, adds a bias.This makes the neuron only approximate linear functions.

Essentially, neural networks are constantly fitting data through updating the parameters.
Therefore,there has to be a way to add a non-linear property to the computation of results to make neural networks approximate non-linear or complex functions.

Then it is the activation functions that serves the purpose of introducing non-linearity into the model.,which makes it possible for the deep learning models to fit data well.

Common activation functions
1.sigmoid
在这里插入图片描述
From the figure above,we can easily find that it squashes the input value to range zero to one,which makes data easier to converge during the process of transferring.While it have three drawbacks.
1.First,Saturation neurons will ‘kill’ the gradients.During back propagation, the gradient transmission decreases by at least 1/4 for each layer of neurons.it means that,the deeper the layer of the neural network,the more possibility the gradient will disappear.

2.The second problem is that sigmoid function output values are not zero-centered.if the data input of neuron are all positive,then the gradients coming back from the upstream on weight W are all positive or all negative,which gives very inefficient gradient update.
在这里插入图片描述
3.Another disadvantage is that the calculation of exponential function take more time to get the gradient .

2.tanh
在这里插入图片描述
Compared to the sigmoid function ,tanh function squashes numbers to range [-1,1] and its output values are zero-centered.
Although the output value of this function get larger, the gradient decrease slowly during back propagation, which alleviates the problem of gradient disappearance, but the gradient is still lossy propagation, and the problem of gradient disappearance is still not solved.

3.Relu
在这里插入图片描述
Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU) ,thanks to its simplicity and effectiveness.

Deep networks with ReLUs are more easily optimized and coverges much faster than networks with sigmoid or tanh units in practice, because gradients are able to flow and do not saturate when the input to the ReLU function is positive.What’s more, the Relu function is very computationally efficient.

Although the Relu function succefully solve the problem of gradient disappearanes,having bad initialization of weight and big learning rate will lead to a gradient explosion.And after updating the parameters,some neurons will be “dead” which means that it will never be activated.Also,the output value of Relu function is not zero-centered.

4.Leaky-Relu
在这里插入图片描述
In order to settle the problem of “dead neuron”,when the input value is negative,the Leaky-Relu function adds a hyperparameter α，retains some values, and also corrects the data distribution,This parameter is usually set a small value, based on empirical values.

5.Randomized Relu
在这里插入图片描述
Compared to the leaky-relu function,the hyperparameter α in randomized relu is selected from a random distribution and then modified during the test process.

6.Parametric Relu
在这里插入图片描述
The parametric relu function converge faster than relu and leaky-relu function as the hyperparameter α can be trained and modified during train process.

7.Exponential Linear Unit(ELU)
在这里插入图片描述
Exponential Linear Unit have all benefits of Relu and its output value is closer to zero-centered.Moreover,negative saturation status compared with Leaky-Relu adds some robustness to noise.But due to the operation of exponential function,it is a bit compute expensive.

Activation function table
在这里插入图片描述

Grant|zgh

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
introduction of Activation Function

Many people may have some questions like:What is activation function? Why do we have so many activation function?Today,I will introduce the activation functions in neural network.Convolutional neural network always consist of many neurons stacked in layer
复制链接

扫一扫