MLlib - Optimization Module - Gradient
@(Hadoop & Spark)[machine learning|algorithm|statistics|Spark]
Topic: Gradient - LogisticGradient
Inference process
- probility
P(y=0|x,w)=1/(1+∑iK−1exp(xwi))
P(y=1|x,w)=exp(xw1)/(1+∑iK−1exp(xwi))
...
P(y=K−1|x,w)=exp(xwK−1)/(1+∑iK−1exp(xwi)) - loss function
l(w,x)=−logP(y|x,w)=−α(y)logP(y=0|x,w)−(1−α(y))logP(y|x,w)=log(1+∑iK−1exp(xwi))−(1−α(y))xwy−1=log(1+∑iK−1exp(marginsi))−(1−α(y))marginsy−1
whereα(i)=1 if i !=0,andα(i)=0 if i==0,marginsi=xwi - first derivative
∂l(w,x)∂wij=(exp(xwi)(1+∑K−1kexp(xwk))−(1−α(y)δy,i+1))∗xj=multiplieri∗xj
whereδi,j=1 if i==j,δi,j=0 if i!=j,andmultiplier=exp(marginsi)(1+∑K−1kexp(marginsi))−(1−α(y)δy,i+1)
Arithmetic overflow
when max(margins) > 0 Arithmetic overflow happen so the loss function and the multiplier need rewritten as below:
l(w,x)=log(1+∑iK−1exp(marginsi))−(1−α(y))marginsy−1=log(exp(−maxMargin)+∑iK−1exp(marginsi−maxMargin))+maxMargin−(1−α(y))marginsy−1=log(sum+1)+maxMargin−(1−α(y))marginsy−1
multiplier=exp(marginsi)(1+∑K−1kexp(marginsi))−(1−α(y)δy,i+1)=exp(marginsi−maxMargin)(1+sum)−(1−α(y)δy,i+1)
wheresum=exp(−maxMargin)+∑iK−1exp(marginsi−maxMargin)−1
reference
In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (which can be downloaded from http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of multinomial logistic regression model)