逻辑回归是一种用于分类的线性模型,注意到因变量target为分类型变量。
二分类
假设 Y = { 0 , 1 } Y=\{0,1\} Y={0,1},预测是正类 P ( y i = 1 ∣ X i ) P(y_i=1|X_i) P(yi=1∣Xi)的概率为:
p ^ ( X i ) = e x p i t ( X i w + w 0 ) = 1 1 + e x p ( − X i w − w 0 ) \hat{p}(X_i)=expit(X_iw+w_0)=\frac{1}{1+exp(-X_iw-w_0)} p^(Xi)=expit(Xiw+w0)=1+exp(−Xiw−w0)1
其目标函数:
min w C ∑ i = 1 n ( − y i l o g ( p ^ ( X i ) ) − ( 1 − y i ) l o g ( 1 − p ^ ( X i ) ) ) + r ( w ) \min_w C\sum_{i=1}^n(-y_ilog(\hat{p}(X_i))-(1-y_i)log(1-\hat{p}(X_i)))+r(w) wminCi=1∑n(−yilog(p^(Xi))−(1−yi)log(1−p^(Xi)))+r(w)
其中 r ( w ) r(w) r(w)是正则化参数。在sklearn中有4种选择:
对于 ElasticNet
,
ρ
\rho
ρ对应于l1_ratio
参数。当
ρ
=
1
\rho=1
ρ=1时 ElasticNet
等价于
l
1
l_1
l1;当
ρ
=
0
\rho=0
ρ=0时, ElasticNet
等价于
l
2
l_2
l2。
多分类
假设
Y
=
{
1
,
⋯
,
K
}
Y=\{1,\cdots,K\}
Y={1,⋯,K}。predict_proba
预测属于第
K
K
K类的概率
P
(
y
i
=
k
∣
X
i
)
P(y_i=k|X_i)
P(yi=k∣Xi)为:
p ^ k ( X i ) = e x p ( X i W k + W 0 , k ) ∑ l = 0 K − 1 e x p ( X i W l + W 0 , l \hat{p}_k(X_i)=\frac{exp(X_iW_k+W_{0,k})}{\sum_{l=0}^{K-1}exp(X_iW_l+W_{0,l}} p^k(Xi)=∑l=0K−1exp(XiWl+W0,lexp(XiWk+W0,k)
目标函数为:
min W − C ∑ i = 1 n ∑ k = 0 K − 1 [ y i = k ] l o g ( p ^ k ( X i ) ) + r ( W ) \min_W -C\sum_{i=1}^n\sum_{k=0}^{K-1}[y_i=k]log(\hat{p}_k(X_i))+r(W) Wmin−Ci=1∑nk=0∑K−1[yi=k]log(p^k(Xi))+r(W)
此时 r ( W ) r(W) r(W)对应的4种选择为:
sklearn中LogisticRegression的solvers
- 1、
solver=“liblinear”
使用coordinate descent (CD)算法,但是对于多分类的效果不好。 - 2、
solver=“sag”
使用Stochastic Average Gradient descent算法,对于大量数据集,它处理更快。 - 3、
solver=“saga”
是solver=“sag”
的一种优化,也支持 l 1 l_1 l1正则化,也支持penalty="elasticnet"
,对于大样本处理更快。 - 4、
solver=“lbfgs”
是一种优化算法(近似the Broyden–Fletcher–Goldfarb–Shanno算法),它是默认设置,因为它更适合广泛的不同训练集,但是它在处理0-1编码的分类特征数据时表现不好。 - 5、
solver=“newton-cholesky”
对于n_samples
>>n_features
的数据,它是非常好的选择,但是仅支持 l 2 l_2 l2正则化。
“lbfgs”, “newton-cg” and “sag”仅支持 l 2 l_2 l2正则化或无正则化,对于高维数据,处理更快,且多分类效果也更好。
总结一下见下表:
超参数选择
一般用LogisticRegressionCV
选择最优超参数C
和l1_ratio
。通常,“newton-cg”, “sag”, “saga” and “lbfgs” solvers
对高维数据处理更快。
MNIST 数字分类任务例子
使用SAGA
算法,处理大样本数据更快,且用
l
1
l_1
l1正则化进行分类。
import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state
# Turn down for faster convergence
t0 = time.time()
train_samples = 5000
# Load data from https://www.openml.org/d/554
X, y = fetch_openml(
"mnist_784", version=1, return_X_y=True, as_frame=False
)
random_state = check_random_state(0)
permutation = random_state.permutation(X.shape[0])
X = X[permutation]
y = y[permutation]
X = X.reshape((X.shape[0], -1))
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=train_samples, test_size=10000
)#划分训练集和测试集
scaler = StandardScaler()#建模之前标准化处理数据
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Turn up tolerance for faster convergence
clf = LogisticRegression(C=50.0 / train_samples, penalty="l1", solver="saga", tol=0.1)
clf.fit(X_train, y_train)
sparsity = np.mean(clf.coef_ == 0) * 100
score = clf.score(X_test, y_test)
# print('Best C % .4f' % clf.C_)
print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)
coef = clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale = np.abs(coef).max()
for i in range(10):
l1_plot = plt.subplot(2, 5, i + 1)
l1_plot.imshow(
coef[i].reshape(28, 28),
interpolation="nearest",
cmap=plt.cm.RdBu,
vmin=-scale,
vmax=scale,
)
l1_plot.set_xticks(())
l1_plot.set_yticks(())
l1_plot.set_xlabel("Class %i" % i)
plt.suptitle("Classification vector for...")
run_time = time.time() - t0
print("Example run in %.3f s" % run_time)
其结果为: