逻辑回归(Logistic Regression)
多摘自机器学习笔记(https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes)
1.概述
逻辑回归(Logistic Regression)又叫对数几率回归,适合数值型的二值型输出的拟合,它是一个分类模型.
2.模型
线性模型无法较好的完成分类问题,因此在线性模型的基础上引入逻辑函数(Logistic function)。
逻辑回归模型的假设是:
h
θ
(
x
)
=
g
(
θ
T
X
)
h_\theta(x) = g(\theta^TX)
hθ(x)=g(θTX)
其中:
X
X
X 代表特征向量,
g
g
g 代表逻辑函数, Sigmoid函数死一个常用的逻辑函数, 其公式为:
g
(
z
)
=
1
1
+
e
−
z
g(z) = \frac{1}{1 + e^{-z}}
g(z)=1+e−z1
在逻辑回归中,
- 当 h θ ( x ) ≥ 0.5 h_\theta(x) \geq 0.5 hθ(x)≥0.5 时, 预测 y = 1 y = 1 y=1
- 当 h θ ( x ) < 0.5 h_\theta(x) < 0.5 hθ(x)<0.5 时, 预测 $y = $0
根据Sigmoid图像不难看出,
- 当 z z z = 0 时, g ( z ) = 0.5 g(z) = 0.5 g(z)=0.5
- 当 z > 0 z > 0 z>0 时, g ( z ) > 0.5 g(z) > 0.5 g(z)>0.5
- 当 z < 0 z < 0 z<0 时, g ( z ) < 0.5 g(z) < 0.5 g(z)<0.5
3.代价函数
设数据集和模型如下:
{
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
,
.
.
.
,
x
(
m
)
,
y
(
m
)
)
}
x
∈
=
[
a
0
a
1
⋮
a
n
]
x
0
=
1
,
y
∈
{
0
,
1
}
h
θ
=
1
1
+
e
−
θ
T
x
\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)},...,x^{(m)},y^{(m)})\} \\ x \in = \begin{bmatrix} a_{0} \\ a_{1} \\ \vdots \\ a_{n} \end{bmatrix} x_0 = 1, y \in \{0,1\} \\ h_{\theta} = \frac{1}{1 + e^{-\theta^T x}}
{(x(1),y(1)),(x(2),y(2),...,x(m),y(m))}x∈=⎣⎢⎢⎢⎡a0a1⋮an⎦⎥⎥⎥⎤x0=1,y∈{0,1}hθ=1+e−θTx1
对于线性模型,其代价函数时所有误差的平方和。理论上来说,也可以对逻辑回归模型沿用这个方法,但是当把
h
θ
=
1
1
+
e
−
θ
T
x
h_{\theta} = \frac{1}{1 + e^{-\theta^T x}}
hθ=1+e−θTx1 带入这样定义的代价函数中时,得到的代价函数将是一个非凸函数(non-convex function).
从上图可以看出,这样得到的代价函数有许多局部最小值,这将影响梯度下降算法寻找全局最小值。线性回归的代价函数为:
J
(
θ
)
=
1
m
∑
i
=
1
m
1
2
(
h
θ
(
x
(
i
)
)
−
y
i
)
2
J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_{\theta} (x^{(i)}) - y^{i})^2
J(θ)=m1∑i=1m21(hθ(x(i))−yi)2 ,重新定义逻辑回归的代价函数为:
J
(
θ
)
=
1
m
∑
i
=
1
m
C
o
s
t
(
h
θ
(
x
(
i
)
)
,
y
i
)
J(\theta) = \frac{1}{m} \sum_{i=1}^{m} Cost(h_{\theta} (x^{(i)}), y^{i})
J(θ)=m1∑i=1mCost(hθ(x(i)),yi), 其中
$$
Cost(h_{\theta} (x^{(i)}), y^{i}) = \left{\begin{matrix}
-log(h_{\theta} (x)) \quad if \quad y=1\
-log(1-h_{\theta} (x)) \quad if \quad y=0
\end{matrix}\right.
$$
h
θ
(
x
)
h_{\theta} (x)
hθ(x)和
C
o
s
t
(
h
θ
(
x
)
,
y
)
Cost(h_{\theta} (x), y)
Cost(hθ(x),y) 的关系如下图:
这样构造 C o s t ( h θ ( x ) , y ) Cost(h_{\theta} (x), y) Cost(hθ(x),y) 函数的特点是: 当实际的 y = 1 y=1 y=1 且 h θ ( x ) h_{\theta} (x) hθ(x) 也为 1 时误差为 0,当 y = 1 y=1 y=1 但 h θ ( x ) h_{\theta} (x) hθ(x) 不为 1 时,误差随着 h θ ( x ) h_{\theta} (x) hθ(x) 的变小而变大;反之亦然,当实际的 y = 0 y=0 y=0, 且 h θ ( x ) h_{\theta} (x) hθ(x) 也为 0 时,误差为0,当 y = 0 y=0 y=0 且 h θ ( x ) h_{\theta} (x) hθ(x) 不为 0 时,误差随着 h θ ( x ) h_{\theta} (x) hθ(x) 的变大而变大。
将构建的
C
o
s
t
(
h
θ
(
x
)
,
y
)
Cost(h_{\theta} (x), y)
Cost(hθ(x),y) 简化如下:
C
o
s
t
(
h
θ
(
x
)
,
y
)
=
−
y
×
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
×
l
o
g
(
1
−
h
θ
(
x
)
)
Cost(h_{\theta} (x), y) = -y × log(h_{\theta} (x)) - (1-y) × log(1- h_{\theta} (x))
Cost(hθ(x),y)=−y×log(hθ(x))−(1−y)×log(1−hθ(x))
将上式代入代价函数得:
J
(
θ
)
=
1
m
∑
i
=
1
m
−
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
−
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
即
:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta) = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)} log(h_{\theta} (x^{(i)})) - (1-y^{(i)}) log(1- h_{\theta} (x^{(i)})) \\ 即:\\ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [ y^{(i)} log(h_{\theta} (x^{(i)})) + (1-y^{(i)}) log(1- h_{\theta} (x^{(i)})) ]
J(θ)=m1i=1∑m−y(i)log(hθ(x(i)))−(1−y(i))log(1−hθ(x(i)))即:J(θ)=−m1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
于是,便可以使用梯度下降算法来求得能使代价函数最小的参数了。算法为:
θ
j
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)
θj=θj−α∂θj∂J(θ)
求导和得:
θ
j
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta_j = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}
θj=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i)
推导如下:
4.简单代码实现回归
import numpy as np
# 手写构造LogisticRegression算法
class LogisticRegression():
def __init__(self,a=0.01,iter_num=1000):
self.w = None # 参数w
self.a = a # 参数学习率
self.iter_num=iter_num # 梯度下降迭代次数
# 处理数据 添加一列1,转化成矩阵
def data_process(self,X,Y):
X = np.mat(np.c_[np.ones(len(X)),X]) # shape(m,dim)
Y = np.mat(Y).reshape(-1,1) #shape (m,1)
return X,Y
# 初始化w参数 随机初始化w
def init_w(self,num_dim):
return np.mat(np.random.random(num_dim)).reshape(-1,1) # shape(num_dim,1)
# 构造Sigmoid函数
def sigmoid(self,z):
return 1 / (1 + np.exp(-z))
# 计算y 公式2
def predict_train(self,X):
z = X * self.w
y = self.sigmoid(z)
return y.reshape(-1,1) #shape(m,1)
# 更新参数w
def update_w(self,y_true,y_pred,X):
e = X.T * (y_pred - y_true) # shape(num_dim,1)
new_w = self.w - self.a * e # 公式8
return new_w.reshape(-1,1) # shape(num_dim,1)
# 构造拟合函数 默认迭代1000次
def fit(self,X,Y):
X,Y = self.data_process(X,Y) # 处理X和Y
self.w = self.init_w(num_dim=X.shape[1]).reshape(-1,1) # 初始化参数
for i in range(self.iter_num): # 循环迭代梯度下降
y_pred = self.predict_train(X).reshape(-1,1) # 计算y
old_w = self.w # 记录前一次w
self.w = self.update_w(Y,y_pred,X) # 更新w
if np.sum(old_w - self.w) <= 0.1: # 若w更新到最优则停止循环
break
# 构造预测函数
def predict(self,X):
X = np.mat(np.c_[np.ones(len(X)),X]) # 添加一列1 b
return self.predict_train(X) # 预测
# 构造score函数 输出accuracy
def score(self,X,Y):
y_pred = self.predict(X) # 预测
y_pred_0 = [1 if item >=0.5 else 0 for item in y_pred]
from sklearn.metrics import accuracy_score
return accuracy_score(Y,y_pred_0) # 给出精度
if __name__ == '__main__':
# 构造二分类数据集
X = np.random.randn(100,9)
Y = np.sum(X **2 + -20*X,axis=1) + np.random.randn(100)
Y = np.array([1 if y >= Y.mean() else 0 for y in Y])
# 初始化模型
clf = LogisticRegression(a=0.01,iter_num=1000)
clf.fit(X,Y) # 拟合
print(clf.score(X,Y)) # 测试
5.sklearn库实现LogisticRegression
5.1 简单实现
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer # 乳腺癌数据集
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import train_test_split
X = load_breast_cancer().data
Y = load_breast_cancer().target
X = pd.DataFrame(X)
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=420) # 切分数据集
# 实例化
lr = LR(penalty='l2', # l2正则
solver='liblinear', # 坐标下降法
C = 0.5,
max_iter=1000,
)
lr.fit(xtrain,ytrain)
lr.score(xtest,ytest)
print('w:',lr.coef_)
5.2 网格搜索寻找最优参数
# 网格搜索寻找最优参数
from sklearn.model_selection import GridSearchCV # 网格搜索
from sklearn.preprocessing import StandardScaler # 标准化
data = pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
data['label'] = Y
# 划分数据集
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size = 0.3,random_state = 420)
# 对训练集和测试集归一化
std = StandardScaler().fit(Xtrain)
Xtrain_ = std.transform(Xtrain)
Xtest_ = std.transform(Xtest)
# 在l2范式下,判断C和solver最优值
p = {
'C':list(np.linspace(0.05,1,19)),
'solver':['liblinear','sag','newton-cg','lbfgs'] # 后三种都是通过导数计算的方式,是不能被L1 正则化的
}
#实例化模型
model = LR(penalty='l2',max_iter=10000)
GS = GridSearchCV(model,p,cv=5)
GS.fit(Xtrain_,Ytrain)
print('best_score',GS.best_score_)
print('best_params',GS.best_params_)
'''
best_score 0.9874371859296482
best_params {'C': 0.10277777777777777, 'solver': 'liblinear'}
'''
5.3 参数设置
class sklearn.linear_model.LogisticRegression(penalty=‘l2’, ***, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=‘lbfgs’, max_iter=100, multi_class=‘auto’, verbose=0, warm_start=False, n_jobs=None, l1_ratio=None) (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression)
penalty :{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
'none'
: no penalty is added;'l2'
: add a L2 penalty term and it is the default choice;'l1'
: add a L1 penalty term;'elasticnet'
: both L1 and L2 penalty terms are added.
dual : bool, default=False
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.
tol : float, default=1e-4
Tolerance for stopping criteria.
C : float, default=1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
fit_intercept : bool, default=True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
class_weight : dict or ‘balanced’, default=None
Weights associated with classes in the form {class_label: weight}
. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
random_state : int, RandomState instance, default=None
Used when solver
== ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data.
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem. Default is ‘lbfgs’. To choose a solver, you might want to consider the following aspects:
- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;
- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
- ‘liblinear’ is limited to one-versus-rest schemes.
Warning: The choice of the algorithm depends on the penalty chosen: Supported penalties by solver:
- ‘newton-cg’ - [‘l2’, ‘none’]
- ‘lbfgs’ - [‘l2’, ‘none’]
- ‘liblinear’ - [‘l1’, ‘l2’]
- ‘sag’ - [‘l2’, ‘none’]
- ‘saga’ - [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]
Note : ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale.
max_iter : int, default=100
Maximum number of iterations taken for the solvers to converge.
multi_class {‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
仅作学习笔记使用,侵删