I’m currently working on machine learning applications, and a problem that is rarely mentioned in papers, but occurs frequently in practice is numerical overflows in thesigmoid (aka logit) function and in its big sister, softmax.
Sigmoid
As a reminder: σ(x)=11+exp(−x)
Its derivative: ddxσ(x)=(1−σ(x))∗σ(x)
The problem here is exp, which quickly goes to infinity, even though the result ofσ is restricted to the interval [0, 1]. The solution: The sigmoid can be expressed in terms of tanh: σ(x)=12(1+tanh(x2)) .
Softmax
Softmax, which is defined as softmaxi(a)=exp(ai)∑jexp(aj) (where a is a vector), is a little more complicated. The key here is to expresssoftmax in terms of the logsumexp function: logsumexp(a) = log(∑ iexpai), for which good, non-overflowing implementations are usually available.
Then, we have softmax(a) = exp(a − logsumexp(a)).
As a bonus: The vector of partial derivatives / the gradient of softmax is analogous to the sigmoid, i.e ∂∂aisoftmax(a)=(1−softmaxi(a))∗softmaxi(a) .
4927

被折叠的 条评论
为什么被折叠?



