逻辑回归
理论部分
虽然叫逻辑回归,但是它做的是分类任务,而不是回归任务。
二分类
假定 y i ∈ { 0 , 1 } y_i \in \{0, 1\} yi∈{0,1},那么逻辑回归的任务是预估 p ( y i = 1 ∣ x i ) = s i g m o i d ( ω x i + b ) = 1 1 + exp ( − ( ω x i + b ) ) p(y_i=1|x_i)=sigmoid(\omega x_i+b)=\frac{1}{1 + \exp{(-(\omega x_i + b))}} p(yi=1∣xi)=sigmoid(ωxi+b)=1+exp(−(ωxi+b))1
使用最大似然函数得到cost function
:
L
(
x
i
,
ω
)
=
−
log
(
p
(
y
i
=
1
)
m
∗
p
(
y
i
=
0
)
n
)
+
r
(
ω
)
=
min
ω
C
∑
i
=
1
n
(
−
y
i
log
(
p
(
y
i
=
1
)
)
−
(
1
−
y
i
)
log
(
1
−
p
(
y
i
=
1
)
)
)
+
r
(
ω
)
.
L(x_i, \omega) = -\log{(p(y_i=1)^m * p(y_i=0)^n)} + r(\omega) = \min_{\omega} C \sum_{i=1}^n \left(-y_i \log(p(y_i=1)) - (1 - y_i) \log(1 - p(y_i=1))\right) + r(\omega).
L(xi,ω)=−log(p(yi=1)m∗p(yi=0)n)+r(ω)=ωminCi=1∑n(−yilog(p(yi=1))−(1−yi)log(1−p(yi=1)))+r(ω).
- 其中
r
(
ω
)
r(\omega)
r(ω)代表正则化项:
- None, 0
- l1, ∥ w ∥ 1 \|w\|_1 ∥w∥1
- l2, 1 2 ∥ w ∥ 2 2 = 1 2 w T w \frac{1}{2}\|w\|_2^2 = \frac{1}{2}w^T w 21∥w∥22=21wTw
- ElasticNet, 1 − ρ 2 w T w + ρ ∥ w ∥ 1 \frac{1 - \rho}{2}w^T w + \rho \|w\|_1 21−ρwTw+ρ∥w∥1
多分类
二分类可以扩充到多分类情况,即multinomial logistic regression
假定 y i ∈ { 0 , 1 , . . . , K } y_i\in \{0, 1, ..., K\} yi∈{0,1,...,K}, 那么多分类逻辑回归的任务是预估 p ( y i = k ∣ x i ) = s o f t m a x ( ω x i + b ) = exp ( ω k x i + b ) ∑ k = 0 K − 1 exp ( ω k x i + b ) p(y_i=k|x_i)=softmax(\omega x_i + b) = \frac{\exp{(\omega_k x_i + b)}}{\sum_{k=0}^{K-1}{\exp{(\omega_k x_i + b)}}} p(yi=k∣xi)=softmax(ωxi+b)=∑k=0K−1exp(ωkxi+b)exp(ωkxi+b)
使用最大似然函数得到cost function
:
min
ω
−
C
∑
i
=
1
n
∑
k
=
0
K
−
1
[
y
i
=
k
]
log
(
p
(
y
i
=
k
∣
x
i
)
)
+
r
(
ω
)
.
\min_{\omega} -C \sum_{i=1}^n \sum_{k=0}^{K-1} [y_i = k] \log(p(y_i=k|x_i)) + r(\omega).
ωmin−Ci=1∑nk=0∑K−1[yi=k]log(p(yi=k∣xi))+r(ω).
参考
实践部分
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
X.shape, y.shape
((150, 4), (150,))
The “lbfgs” solver is used by default for its robustness. For large datasets the “saga” solver is usually faster.
- The choice of the algorithm depends on the penalty chosen: Supported penalties by solver:
- ‘newton-cg’ - [‘l2’, ‘none’]
- ‘lbfgs’ - [‘l2’, ‘none’]
- ‘liblinear’ - [‘l1’, ‘l2’]
- ‘sag’ - [‘l2’, ‘none’]
- ‘saga’ - [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]
clf = LogisticRegression(
penalty='l2', # {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
C=1.0, # default=1.0, smaller values specify stronger regularization
class_weight=None, # dict or ‘balanced’, default=None,
random_state=0,
solver='saga', # {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
max_iter=100, # int, default=100,
)
clf.fit(X, y)
clf.predict(X[:2, :])
array([0, 0])
clf.predict_proba(X[:2, :])
array([[9.82352427e-01, 1.76474401e-02, 1.32380082e-07],
[9.53945904e-01, 4.60532191e-02, 8.76762218e-07]])
clf.score(X, y)
0.9866666666666667
常见问题
- 为什么LR要用cross entropy loss,而不是mse作为损失函数
cross entropy的函数与导数为:
- 损失函数
C = − 1 n ∑ ( y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) ) C = -\frac{1}{n}\sum{(ylog(\hat{y}) + (1-y)log(1-\hat{y}))} C=−n1∑(ylog(y^)+(1−y)log(1−y^))
- 导数
d C d ω = 1 n ∑ ( y ^ − y ) x \frac{dC}{d\omega} = \frac{1}{n}\sum{(\hat{y}-y)x} dωdC=n1∑(y^−y)x
mse的函数与导数为:
- 损失函数
C = ( y − y ^ ) 2 2 C = \frac{(y-\hat{y})^2}{2} C=2(y−y^)2
- 导数
d C d ω = d C d y ^ ∗ d y ^ d z ∗ d z d ω = ( y ^ − y ) σ ′ ( z ) x \frac{dC}{d\omega} = \frac{dC}{d\hat{y}}*\frac{d\hat{y}}{dz}*\frac{dz}{d\omega} = (\hat{y}-y)\sigma^{'}(z)x dωdC=dy^dC∗dzdy^∗dωdz=(y^−y)σ′(z)x
用mse的话,导数跟sigmoid的导数有关,当 z z z很大或很小时,导数为0,损失函数收敛慢,而用cross entropy则不会
- LR的参数初始化可以全为0吗?
可以,因为导数跟 ω \omega ω无关,参数可以正常更新
而多层神经网络不行的原因是,w1的导数依赖w2,w3等等,这会导致梯度为0,影响参数更新
LR的优缺点?
- 优点
- 模型简单,可解释性强
- 缺点
- 线性模型,无法解决非线性问题
- 无法自动特征组合,需要大量人工参与