版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/junxing2018_wu/article/details/117520184
逻辑回归(logistic regression)
性质
- 针对二分类问题
- 基于条件概率出发
- 线性回归函数( y = w T x + b y=w^Tx+b y=wTx+b 和 逻辑函数( y = 1 1 + e − x y=\frac {1} {1+e^{-x}} y=1+e−x1)组合
- 是线性分类器(决策边界决定了LR是线性分类器)
计算过程
-
对于二分类问题
p ( y = 1 ∣ x , w ) = 1 1 + e − ( w T x + b ) p ( y = 0 ∣ x , w ) = e − ( w T x + b ) 1 + e − ( w T x + b ) p(y=1| x, w) = \frac {1} {1+e^{-(w^Tx+b)}} \\ p(y=0| x, w) = \frac {e^{-(w^Tx+b)}} {1+e^{-(w^Tx+b)}} p(y=1∣x,w)=1+e−(wTx+b)1p(y=0∣x,w)=1+e−(wTx+b)e−(wTx+b)
可以合并成:
p ( y ∣ x , w , b ) = p ( y = 1 ∣ x , w , b ) y [ ( 1 − p ( y = 1 ∣ x , w , b ) ) ] 1 − y p(y| x, w,b) = p(y=1| x, w,b) ^y [(1 - p(y=1| x, w,b) )]^{1-y} p(y∣x,w,b)=p(y=1∣x,w,b)y[(1−p(y=1∣x,w,b))]1−y -
目标函数(objective function)
假设我们拥有数据集 D = ( x i , y i ) i = 1 n D={(x_i, y_i)}_{i=1}^{n} D=(xi,yi)i=1n, x i ∈ R d x_i \in R^d xi∈Rd, y i ∈ { 0 , 1 } y_i \in \{0, 1\} yi∈{0,1}
此时我们需要最大化目标函数(最大似然估计MLE):
w ^ M L E , b ^ M L E = a r g m a x w , b ∏ i = 1 n p ( y i ∣ x i , w , b ) = a r g m a x w , b ∑ i = 1 n l o g ( p ( y i ∣ x i , w , b ) ) = a r g m i n w , b − ∑ i = 1 n l o g ( p ( y i ∣ x i , w , b ) ) = a r g m i n w , b − ∑ i = 1 n l o g ( p ( y i = 1 ∣ x , w , b ) y i [ ( 1 − p ( y i = 1 ∣ x , w , b ) ) ] 1 − y i ) = a r g m i n w , b − ( ∑ i = 1 n y i ⋅ l o g ( p ( y i = 1 ∣ x , w , b ) ) + ( 1 − y i ) ⋅ l o g ( 1 − p ( y i = 1 ∣ x , w , b ) ) ) = a r g m i n w , b − ( ∑ i = 1 n y i ⋅ l o g ( σ ( w T x + b ) ) + ( 1 − y i ) ⋅ l o g ( 1 − σ ( w T x + b ) ) ) \hat w_{MLE}, \hat b_{MLE} =argmax_{w, b} \prod_{i=1}^{n} p( y_ {i} |x_ {i} ,w,b) \\ = argmax_{w,b} \sum_{i=1}^{n} log(p( y_ {i} |x_ {i} ,w,b)) \\ = argmin_{w,b} -\sum_{i=1}^{n} log(p( y_ {i} |x_ {i} ,w,b)) \\ = argmin_{w,b} -\sum_{i=1}^{n} log(p(y_{i}=1| x, w,b) ^{y_{i}} [(1 - p(y_{i}=1| x, w,b) )]^{1-y_{i}}) \\ = argmin_{w,b} -(\sum_{i=1}^{n} y_{i} \cdot log(p(y_{i}=1| x, w,b))+(1-y_{i}) \cdot log(1 - p(y_{i}=1| x, w,b))) \\ = argmin_{w,b} -( \sum_{i=1}^{n} y_{i} \cdot log(\sigma(w^Tx+b)) +(1-y_{i}) \cdot log(1 - \sigma(w^Tx+b))) w^MLE,b^MLE=argmaxw,bi=1∏np(yi∣xi,w,b)=argmaxw,bi=1∑nlog(p(yi∣xi,w,b))=argminw,b−i=1∑nlog(p(yi∣xi,w,b))=argminw,b−i=1∑nlog(p(yi=1∣x,w,b)yi[(1−p(yi=1∣x,w,b))]1−yi)=argminw,b−(i=1∑nyi⋅log(p(yi=1∣x,w,b))+(1−yi)⋅log(1−p(yi=1∣x,w,b)))=argminw,b−(i=1∑nyi⋅log(σ(wTx+b))+(1−yi)⋅log(1−σ(wTx+b)))
记 L ( w , b ) = − ( ∑ i = 1 n y i ⋅ l o g ( σ ( w T x + b ) ) + ( 1 − y i ) ⋅ l o g ( 1 − σ ( w T x + b ) ) ) L(w,b) = -( \sum_{i=1}^{n} y_{i} \cdot log(\sigma(w^Tx+b)) +(1-y_{i}) \cdot log(1 - \sigma(w^Tx+b))) L(w,b)=−(∑i=1nyi⋅log(σ(wTx+b))+(1−yi)⋅log(1−σ(wTx+b))),
则 ∂ L ( w , b ) ∂ w = − ( ∑ i = 1 n y i ⋅ σ ( w T x + b ) ⋅ [ 1 − σ ( w T x + b ) ] σ ( w T x + b ) ⋅ x i + ( 1 − y i ) ⋅ − σ ( w T x + b ) ⋅ [ 1 − σ ( w T x + b ) ] 1 − σ ( w T x + b ) ⋅ x i ) = − ∑ i = 1 n y i ⋅ ( 1 − σ ( w T x + b ) ) ⋅ x i + ( y − 1 ) ⋅ σ ( w T x + b ) ⋅ x i ) = − ∑ i = 1 n [ y i − σ ( w T x + b ) ] ⋅ x i = ∑ i = 1 n [ σ ( w T x + b ) − y i ] ⋅ x i \frac {\partial L(w,b)}{ \partial w} = -(\sum_{i=1}^{n} y_{i} \cdot \frac {\sigma(w^Tx + b) \cdot [1-\sigma(w^Tx + b) ]}{\sigma(w^Tx + b)} \cdot x_{i}+(1-y_i) \cdot \frac {-\sigma(w^Tx + b) \cdot [1-\sigma(w^Tx + b)]} {1- \sigma(w^Tx + b)} \cdot x_i) \\ = -\sum_{i=1}^{n} y_{i} \cdot (1-\sigma(w^Tx + b)) \cdot x_i + (y-1) \cdot \sigma(w^Tx + b) \cdot x_i) \\ = -\sum_{i=1}^{n} [y_{i} - \sigma(w^Tx + b)] \cdot x_i \\ = \sum_{i=1}^{n} [\sigma(w^Tx + b) - y_{i}] \cdot x_i ∂w∂L(w,b)=−(i=1∑nyi⋅σ(wTx+b)σ(wTx+b)⋅[1−σ(wTx+b)]⋅xi+(1−yi)⋅1−σ(wTx+b)−σ(wTx+b)⋅[1−σ(wTx+b)]⋅xi)=−i=1∑nyi⋅(1−σ(wTx+b))⋅xi+(y−1)⋅σ(wTx+b)⋅xi)=−i=1∑n[yi−σ(wTx+b)]⋅xi=i=1∑n[σ(wTx+b)−yi]⋅xi
∂ L ( w , b ) ∂ b = − ( ∑ i = 1 n y i ⋅ σ ( w T x + b ) ⋅ [ 1 − σ ( w T x + b ) ] σ ( w T x + b ) + ( 1 − y i ) ⋅ − σ ( w T x + b ) ⋅ [ 1 − σ ( w T x + b ) ] 1 − σ ( w T x + b ) ) = − ∑ i = 1 n y i ⋅ ( 1 − σ ( w T x + b ) ) + ( y − 1 ) ⋅ σ ( w T x + b ) ) = − ∑ i = 1 n [ y i − σ ( w T x + b ) ] = ∑ i = 1 n [ σ ( w T x + b ) − y i ] \frac {\partial L(w,b)}{ \partial b} = -(\sum_{i=1}^{n} y_{i} \cdot \frac {\sigma(w^Tx + b) \cdot [1-\sigma(w^Tx + b) ]}{\sigma(w^Tx + b)} +(1-y_i) \cdot \frac {-\sigma(w^Tx + b) \cdot [1-\sigma(w^Tx + b)]} {1- \sigma(w^Tx + b)}) \\ = -\sum_{i=1}^{n} y_{i} \cdot (1-\sigma(w^Tx + b)) + (y-1) \cdot \sigma(w^Tx + b)) \\ = -\sum_{i=1}^{n} [y_{i} - \sigma(w^Tx + b)] \\ = \sum_{i=1}^{n} [\sigma(w^Tx + b) - y_{i}] ∂b∂L(w,b)=−(i=1∑nyi⋅σ(wTx+b)σ(wTx+b)⋅[1−σ(wTx+b)]+(1−yi)⋅1−σ(wTx+b)−σ(wTx+b)⋅[1−σ(wTx+b)])=−i=1∑nyi⋅(1−σ(wTx+b))+(y−1)⋅σ(wTx+b))=−i=1∑n[yi−σ(wTx+b)]=i=1∑n[σ(wTx+b)−yi]
注释:- σ ( x ) ′ = σ ( x ) ( 1 − σ ( x ) ) {\sigma(x)}' = \sigma(x)(1-\sigma(x)) σ(x)′=σ(x)(1−σ(x))
- 最后那个表达式中, σ ( w T x + b ) \sigma(w^Tx + b) σ(wTx+b) 是预测值, y i y_i yi是真实值,所以这就意味着我们在梯度下降法的时候会不断地去观测当前样本的预测值和真实值,考虑它们之间的差别,然后通过这样的差别不断地更新 W W W,使得最后学出一个很好的 w w w和 b b b,相当于预测值和真实值会不断接近。
- 注意,当给定的数据线性可分的时候,逻辑回归的参数有可能趋向于正无穷大。(过拟合现象,需要加上正则项)
-
梯度下降法
- 初始化 w 0 w^0 w0, b 0 b^0 b0
- 设置 epoch num: m m m,learning rate: η \eta η
- t 从 0开始迭代到m,
w t + 1 = w t − η ⋅ ∑ i = 1 n [ σ ( w T x + b ) − y i ] ⋅ x i w^{t+1} = w^{t} - \eta \cdot \sum_{i=1}^{n} [\sigma(w^Tx + b) - y_{i}] \cdot x_i wt+1=wt−η⋅∑i=1n[σ(wTx+b)−yi]⋅xi
b t + 1 = b t − η ⋅ ∑ i = 1 n [ σ ( w T x + b ) − y i ] b^{t+1} = b^{t} - \eta \cdot \sum_{i=1}^{n} [\sigma(w^Tx + b) - y_{i}] bt+1=bt−η⋅∑i=1n[σ(wTx+b)−yi] - 停止条件:
- ∣ L t ( w , b ) − L t + 1 ( w , b ) ∣ < ϵ \left | L_t(w,b) - L_{t+1}(w,b) \right | < \epsilon ∣Lt(w,b)−Lt+1(w,b)∣<ϵ
- ∣ w t − w t − 1 ∣ < ϵ \left | w^{t} - w^{t-1} \right| < \epsilon ∣∣wt−wt−1∣∣<ϵ
- validation data(early stop)
- fixed iteration(最大迭代次数)