第四节:神奇的吻合 —— 逻辑斯蒂回归的损失函数
假设预测对得1分,否则0分,label ∈ \in ∈ {1, -1}
那么,对于第i条训练数据,若真实 label = 1,得1分的概率为 1 1 + e x p ( − w ⃗ T x i ⃗ ) \frac{1}{1+exp(-\vec{w}^T\vec{x_i})} 1+exp(−wTxi)1
若真实 label = -1,得1分的概率为 e x p ( − w ⃗ T x i ⃗ ) 1 + e x p ( − w ⃗ T x i ⃗ ) = 1 1 + e x p ( w ⃗ T x i ⃗ ) \frac{exp(-\vec{w}^T\vec{x_i})}{1+exp(-\vec{w}^T\vec{x_i})} = \frac{1}{1+exp(\vec{w}^T\vec{x_i})} 1+exp(−wTxi)exp(−wTxi)=1+exp(wTxi)1
把这两种情况综合一下,得1分的概率为 P(accurate) = 1 1 + e x p ( − y i w ⃗ T x i ⃗ ) \frac{1}{1+exp(-\color{red}y_i\color{black}\vec{w}^T\vec{x_i})} 1+exp(−yiwTxi)1
L o s s = N e g a t i v e s u m o f l o g a c c u r a c y Loss = Negative\ sum\ of\ log\ accuracy Loss=Negative sum of log accuracy
= − ∑ i = 1 n l o g ( P ( a c c u r a t e ) ) \quad\quad\ =-\displaystyle\sum^{n}_{i=1}log(P(accurate)) =−i=1∑nlog(P(accurate))
= − ∑ i = 1 n l o g ( 1 1 + e x p ( − y i w ⃗ T x i ⃗ ) ) \quad\quad\ =-\displaystyle\sum^{n}_{i=1}log(\frac{1}{1+exp(-y_i\vec{w}^T\vec{x_i})}) =−i=1∑nlog(1+exp(−yiwTxi)1)
= ∑ i = 1 n l o g [ 1 + e x p ( − y i w ⃗ T x i ⃗ ) ] \quad\quad\ =\displaystyle\sum^{n}_{i=1}log[\ 1+exp(-y_i\vec{w}^T\vec{x_i})\ ] =i=1∑nlog[ 1+exp(−yiwTxi) ]
n 是 batch_size
如果我们用SGD(stochastic gradient descent)的话,n = 1。对 Loss 求关于 w ⃗ \vec w w 的导数:
∂ L o s s ∂ w ⃗ = e x p ( − y i w ⃗ T x i ⃗ ) ( − y i x i ⃗ ) 1 + e x p ( − y i w ⃗ T x i ⃗ ) \frac{\partial{Loss}}{\partial{\vec w}} = \frac{exp(-y_i\vec{w}^T\vec{x_i}) (-y_i\vec{x_i})}{1+exp(-y_i\vec{w}^T\vec{x_i})} ∂w∂Loss=1+exp(−yiwTxi)exp(−yiwTxi)(−yixi)
= ( − y i x i ⃗ ) P ( n o t a c c u r a t e ) \quad\quad\quad\ \ =(-y_i\vec{x_i})P(not\ accurate) =(−yixi)P(not accurate)
还记得梯度下降的权重更新方法吗? ⇒ w ⃗ = w ⃗ − α d \Rightarrow\ \ \vec w = \vec w-\alpha d ⇒ w=w−αd
其中, α \alpha α 是 learning rate, d d d 是gradient,也就是刚才算的 ( − y i x i ⃗ ) P ( n o t a c c u r a t e ) (-y_i\vec{x_i})P(not\ accurate) (−yixi)P(not accurate)
权重更新: w ⃗ = w ⃗ − α P ( n o t a c c u r a t e ) y i x i ⃗ \ \vec w = \vec w-\alpha P(not\ accurate)y_i\vec{x_i} w=w−αP(not accurate)yixi
一般情况下,logistic loss 的公式为 L ( y , f ( x ) ) = l o g [ 1 + e x p ( − y f ( x ) ) ] \color{#FF7256}L(y, f(x))=log[\ 1+exp(-yf(x))\ ] L(y,f(x))=log[ 1+exp(−yf(x)) ] 。也就是说,在logistic regression 中, f ( x ) = w ⃗ T x ⃗ f(x)=\vec w^T\vec x f(x)=