One hidden layer Neural Network - Activation functions

The notes when study the Coursera class by Mr. Andrew Ng "Neural Networks & Deep Learning", section 3.6 "Activation functions". It shows different types of activation functions for NN and the rule of thumb for choosing the activation functions in practice. Share it with you and hope it helps!
————————————————
When build a NN, one of the choices to make is what activation functions to use in the hidden layers as well as the output layer. Besides the sigmoid activation function, sometimes other choices can work much better.

  • sigmoid function

It's defined as:

Its plot looks like:

figure-1
  • tanh function

Also called hyperbolic function. And defined as:

tanh(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}

Its plot looks like:

figure-2

It's actually a shifted version of sigmoid function and crosses point (0,0) and goes between (-1,1)

It turns out that if you replace the sigmoid function of hidden layer with tanh function, it almost always works better. This is because the mean of activations that come out of your hidden layer is closer or equal to 0. Sometimes, when you train a learning algorithm, you might center the data and let your data have zero mean. Using a tanh instead of a sigmoid function kind of has such effect. This makes the learning for the next layer a little bit easier. One exception is for output layer where we still use sigmoid function. The reason is y\in \left \{ 0,1 \right \}, it makes sense for 0\leq \hat{y}\leq 1

  •  ReLU

One downside for both sigmoid function and tanh function is that when z is very large or small, then the gradient or slope of this function will be close to 0. This can slow down gradient descent.

ReLU (Rectified Linear Unit) is one of popular choises of activation that mitigated this issue, and it's defined as below:

ReLu=max(0,z)

Its plot looks like this:

figure-4

Note that for half of the range of z, the slope for ReLu is 0. But in practice, enough of the hidden units will have z greater than 0, so learning can still be quite fast.  

  • Leaky ReLu

One disadvantage of ReLu is the derivative is equal to 0 when z is negative. One activation function is called Leaky ReLU. It usually works better than ReLU although it's just not used as much in practice.

Here's the definition:

ReLu=max(0.01z,z)

The plot looks like:

Rules of thumb for choosing activation functions:

  • For binary classification, the sigmoid function is very natural for the output layer.
  • For all other units of hidden layers, ReLU is increasingly the default choice of activation function
  • If you're not sure which activation functions work best, try them all and evaluate them on validation set

<end>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值