I’m currently working on machine learning applications, and a problem that is rarely mentioned in papers, but occurs frequently in practice is numerical overflows in thesigmoid (aka logit) function and in its big sister, softmax.

Sigmoid

As a reminder: σ(x)=11+exp(x)

Its derivative: ddxσ(x)=(1σ(x))σ(x)

The problem here is exp, which quickly goes to infinity, even though the result ofσ is restricted to the interval [0, 1]. The solution: The sigmoid can be expressed in terms of tanh: σ(x)=12(1+tanh(x2)) .

Softmax

Softmax, which is defined as softmaxi(a)=exp(ai)jexp(aj) (where a is a vector), is a little more complicated. The key here is to expresssoftmax in terms of the logsumexp function: logsumexp(a) = log(∑ iexpai), for which good, non-overflowing implementations are usually available.

Then, we have softmax(a) = exp(a − logsumexp(a)).

As a bonus: The vector of partial derivatives / the gradient of softmax is analogous to the sigmoid, i.e aisoftmax(a)=(1softmaxi(a))softmaxi(a) .