Improving the way neural networks learn

LINK

Why sigmoid + quadratic cost function learning slow?

The quadratic cost function is given by

C=(ya)22(1)

where a is the neuron’s output. a=σ(z), where z=wx+b . Using the chain rule to differentiate with respect to the weight and bias we get
Cw=(ay)σ(z)x=aσ(z)(2)

Cb=(ay)σ(z)=aσ(z)(3)

where I have substituted x=1 and y=0 .
Recall the shape of the σ function:
sigmoid function
We can see from this graph that when the neuron’s output is close to 1, the curve gets very flat, and so σ(z) gets very small. Equations (2) and (3) then tell us that C/w and C/b get very small.

Using the quadratic cost when we have linear neurons in the output layer. Suppose that we have a many-layer multi-neuron network. Suppose all the neurons in the final layer are linear neurons, meaning that the sigmoid activation function is not applied, and the outputs are simply aLj=zLj . Show that if we use the quadratic cost function then the output error δL for a single training example x is given by

δL=aLy

Similarly to the previous problem, use this expression to show that the partial derivatives with respect to the weights and biases in the output layer are given by

CwLjkCbLj==1nxaL1k(aLjyj)1nx(aLjyj).

This shows that if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic cost is, in fact, an appropriate cost function to use.

sigmoid + cross-entropy cost function

The cross-entropy cost function

C=1nx[ylna+(1y)ln(1a)](4)

where n is the total number of items of training data, the sum is over all training inputs, x, and y is the corresponding desired output.

The partial derivative of the cross-entropy cost with respect to the weights. We substitute a=(z) into (4), and apply the chain rule twice, obtaining:

Cwj==1nx(yσ(z)(1y)1σ(z))σwj1nx(yσ(z)(1y)1σ(z))σ(z)xj.(5)(6)

Putting everything over a common denominator and simplifying this becomes:
Cwj=1nxσ(z)xjσ(z)(1σ(z))(σ(z)y).(7)

Using the definition of the sigmoid function, σ(z)=1/(1+ez) , and a little algebra we can show that σ(z)=σ(z)(1σ(z)) .
We see that the σ(z) and σ(z)(1σ(z)) terms cancel in the equation just above, and it simplifies to become:
Cwj=1nxxj(σ(z)y).(8)

This is a beautiful expression. It tells us that the rate at which the weight learns is controlled by σ(z)y , i.e., by the error in the output. The larger the error, the faster the neuron will learn. In particular, it avoids the learning slowdown caused by the σ'(z) term in the analogous equation for the quadratic cost, Equation (2).

In a similar way, we can compute the partial derivative for the bias.

Cb=1nx(σ(z)y).(9)

It’s easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose y=y1,y2,... are the desired values at the output neurons, i.e., the neurons in the final layer, while aL1,aL2,... are the actual output values. Then we define the cross-entropy by

C=1nxj[yjlnaLj+(1yj)ln(1aLj)].

Softmax + log-likelihood cost

In a softmax layer we apply the so-called softmaxfunction to the zLj . According to this function, the activation aLj of the j th output neuron is

aLj=ezLjkezLk,(10)

where in the denominator we sum over all the output neurons.

The log-likelihood cost:

ClnaLy.(11)

The partial derivative:

CbLjCwLjk==aLjyjaL1k(aLjyj)(12)(13)

These expressions ensure that we will not encounter a learning slowdown. In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That’s not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.

overfitting

In general, one of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值