1. 逻辑回归与线性回归的联系与区别
2. 逻辑回归的原理
3. 逻辑回归损失函数推导及优化
4. 正则化与模型评估指标
5. 逻辑回归的优缺点
6. 样本不均衡问题解决办法
7. sklearn方法使用
附:代码
(如有错误,感谢指出!)
1.逻辑回归与线性回归的联系与区别
联系:将线性回归输出的标记y的对数作为线性模型逼近的目标,即 ln y = w T x + b \ln y=w^Tx+b lny=wTx+b就是“对数线性回归”或“逻辑回归”。其在形式上仍是线性回归,但其是在求取输入空间到输出空间的非线性函数映射,也是广义线性模型的一个特例。
区别:逻辑回归解决的是分类问题,线性回归解决的是回归问题。
2. 逻辑回归的原理
对数几率回归(逻辑回归(logit regression)):是在用线性回归模型的预测结果去逼近真实标记的对数几率。
线性回归模型为
y
=
w
T
x
+
b
y = w^Tx+b
y=wTx+b 加入sigmod函数:
y
=
1
1
+
e
−
z
,
z
=
w
T
+
b
y=\frac{1}{1+e^{-z}},z=w^T+b
y=1+e−z1,z=wT+b 后,得到
y
=
1
1
+
e
−
(
w
T
+
b
)
y=\frac{1}{1+e^{-(w^T+b)}}
y=1+e−(wT+b)1
3. 逻辑回归损失函数推导及优化
一般逻辑回归分类结果分为0、1两种,则有概率
p
(
y
∣
x
;
θ
)
=
ϕ
(
z
)
y
⋅
(
1
−
ϕ
(
z
)
)
(
1
−
y
)
p(y|x;\theta)=\phi(z)^y\cdot(1-\phi(z))^{(1-y)}
p(y∣x;θ)=ϕ(z)y⋅(1−ϕ(z))(1−y)其中,
z
=
θ
T
+
b
z=\theta^T+b
z=θT+b
于是,似然函数为
L
=
∏
i
=
1
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
1
−
y
(
i
)
L=\prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)})^{1-y^{(i)}}
L=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i))1−y(i)那么损失函数就是对数似然函数的负值
J
(
θ
)
=
−
ln
L
(
θ
)
=
−
∑
i
−
1
m
(
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
)
J(\theta)=-\ln L(\theta)=-\sum_{i-1}^m(y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)})))
J(θ)=−lnL(θ)=−i−1∑m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))对数似然函数
ln
L
(
θ
)
\ln L(\theta)
lnL(θ)的随机梯度为
∂
L
(
θ
)
∂
θ
=
∑
i
=
1
m
(
y
(
i
)
h
θ
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
h
θ
(
x
(
i
)
)
)
⋅
∂
h
θ
(
x
(
i
)
)
∂
θ
j
\frac{\partial L(\theta)}{\partial\theta}=\sum_{i=1}^m(\frac{y^{(i)}}{h_\theta(x^{(i)})}-\frac{1-y^{(i)}}{1-h_\theta(x^{(i)})})\cdot\frac{\partial h_\theta(x^{(i)})}{\partial\theta_j}
∂θ∂L(θ)=i=1∑m(hθ(x(i))y(i)−1−hθ(x(i))1−y(i))⋅∂θj∂hθ(x(i))
=
∑
i
=
1
m
(
y
(
i
)
g
(
θ
T
x
(
i
)
)
−
1
−
y
(
i
)
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
∂
g
(
θ
T
x
(
i
)
)
∂
θ
j
=\sum_{i=1}^m(\frac{y^{(i)}}{g(\theta^Tx^{(i)})}-\frac{1-y^{(i)}}{1-g(\theta^Tx^{(i)})})\cdot\frac{\partial g(\theta^Tx^{(i)})}{\partial\theta_j}
=i=1∑m(g(θTx(i))y(i)−1−g(θTx(i))1−y(i))⋅∂θj∂g(θTx(i))
=
∑
i
=
1
m
(
y
(
i
)
g
(
θ
T
x
(
i
)
)
−
1
−
y
(
i
)
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
g
(
θ
T
x
(
i
)
)
(
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
∂
θ
T
x
(
i
)
∂
θ
j
=\sum_{i=1}^m(\frac{y^{(i)}}{g(\theta^Tx^{(i)})}-\frac{1-y^{(i)}}{1-g(\theta^Tx^{(i)})})\cdot g(\theta^Tx^{(i)})(1-g(\theta^Tx^{(i)}))\cdot \frac{\partial \theta^Tx^{(i)}}{\partial\theta_j}
=i=1∑m(g(θTx(i))y(i)−1−g(θTx(i))1−y(i))⋅g(θTx(i))(1−g(θTx(i)))⋅∂θj∂θTx(i)
=
∑
i
=
1
m
(
y
(
i
)
(
1
−
g
(
θ
T
x
(
i
)
)
)
−
(
1
−
y
(
i
)
)
⋅
g
(
θ
T
x
(
i
)
)
)
⋅
x
j
(
i
)
=\sum_{i=1}^m(y^{(i)}(1-g(\theta^Tx^{(i)}))-(1-y^{(i)})\cdot g(\theta^Tx^{(i)}))\cdot x_j^{(i)}
=i=1∑m(y(i)(1−g(θTx(i)))−(1−y(i))⋅g(θTx(i)))⋅xj(i)
=
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
⋅
x
m
(
i
)
=\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))\cdot x_m^{(i)}
=i=1∑m(y(i)−hθ(x(i)))⋅xm(i)
即Logisti回归参数
θ
\theta
θ的求解过程为(类似梯度下降方法,往正梯度方向迭代)
θ
j
=
θ
j
+
α
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
⋅
x
j
(
i
)
\theta_j=\theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))\cdot x_j^{(i)}
θj=θj+αi=1∑m(y(i)−hθ(x(i)))⋅xj(i)
θ
j
=
θ
j
+
α
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
⋅
x
j
(
i
)
\theta_j=\theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))\cdot x_j^{(i)}
θj=θj+α(y(i)−hθ(x(i)))⋅xj(i)
4. 正则化与模型评估指标
为解决过拟合问题,常见的有L1正则化和L2正则化。
L1:
J
(
θ
)
=
−
ln
L
(
θ
)
+
α
∣
θ
∣
J(\theta)=-\ln L(\theta)+\alpha|\theta|
J(θ)=−lnL(θ)+α∣θ∣
L2:
J
(
θ
)
=
−
ln
L
(
θ
)
+
1
2
α
∣
θ
∣
2
J(\theta)=-\ln L(\theta)+\frac{1}{2}\alpha|\theta|^2
J(θ)=−lnL(θ)+21α∣θ∣2
模型评估指标:ROC与AUC
ROC曲线的纵轴是“真正例率TPR”,横轴是“假正例率FPR”
T
P
R
=
T
P
T
P
+
F
N
,
F
P
R
=
F
P
T
N
+
F
P
TPR=\frac{TP}{TP+FN},FPR=\frac{FP}{TN+FP}
TPR=TP+FNTP,FPR=TN+FPFP ROC曲线下的面积就是AUC,面积越大其学习性能越好。
5. 逻辑回归的优缺点
优点:逻辑回归是直接对分类可能性进行建模,无需事先假设数据分布,这样就避免了假设分布不准确所带来的问题;它不仅预测类别,还可得到近似概率预测;也是任意阶可导的凸函数,便于求取最优解。
缺点:异常值对模型有很大的干扰,不能处理缺失值,且分类准确度一般不高
6. 样本不均衡问题解决办法
解决类别不平衡学习的基本策略之一——“再缩放”(rescaling)。
其实现技术大体上有三类:第一类是直接对训练集里的反类样例进行“欠采样”(undersampling),即去除一些反例使得正、反例数目接近,然后再进行学习;第二类是对训练集里的正类样例进行“过采样”(oversampling),即增加一些正例使得正、反例数目接近,然后再进行学习;第三类则是直接基于原始训练集进行学习,但在用训练好的分类器进行预测时,将正反例数目比(观测几率 m + m − \frac{m^+}{m^-} m−m+)嵌入到决策过程中,称为“阈值移动”(threshold-moving)。
“再缩放”也是“代价敏感学习”(cost-sensitive learning)的基础,即 m − / m + m^-/m^+ m−/m+用 c o s t + / c o s t − cost^+/cost^- cost+/cost−代替,其中 c o s t + cost^+ cost+是将正例误分为反例的代价, c o s t − cost^- cost−是将反例误分为正例的代价。
7. sklearn方法使用
方法:
`LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True,intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’,verbose=0, warm_start=False, n_jobs=1)
使用试例:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
data = np.loadtxt("data.txt",delimiter=",")
train_x,test_x,train_y,test_y=train_test_split(data[:,0:2],data[:,2],test_size=0.3)
model = LogisticRegression()
print(train_x.shape,train_y.shape,test_x.shape,test_y.shape)
model.fit(train_x,train_y)
print(len(test_y[model.predict(test_x)==test_y]))
附:代码
逻辑回归代码:import matplotlib.pyplot as plt
import numpy as np
'''符号函数
y = x0*w0+x1*w1+.....+xn*wn
'''
def sigmoid(x):
return 1.0 / (1 + np.exp(-x))
'''
逻辑回归训练
'''
def train_logRegres(train_x, train_y, opts):
"""
:param train_x: 训练的样本数据 输入数据
:param train_y: 输出数据
:param opts: 参数 alpha包括步长 maxIter迭代次数 optimizeType是哪个一个算法
:return:
"""
numSamples, numFeatures = np.shape(train_x)
alpha = opts['alpha'] #步长
maxIter = opts['maxIter']#迭代次数
#权重
weights = np.ones((numFeatures, 1)) #初始化参数为1
for k in range(maxIter):
output = sigmoid(train_x.dot(weights))
diff = train_y - output
weights = weights + np.dot(alpha * train_x.T , diff)
return weights
'''逻辑回归测试'''
def test_LogRegres(weights, test_x, test_y):
numSamples, numFeatures = np.shape(test_x)
matchCount = 0
for i in range(numSamples):
predict = sigmoid(np.dot(test_x[i, :] , weights))[0] > 0.5
if predict == bool(test_y[i, 0]):
matchCount += 1
accuracy = float(matchCount) / numSamples
return accuracy
'''加载数据,如:
-0.017612,14.053064,0
-1.395634,4.662541,1
-0.752157,6.538620,0
-1.322371,7.152853,0
0.423363,11.054677,0
0.406704,7.067335,1
0.667394,12.741452,0
-2.460150,6.866805,1
0.569411,9.548755,0
'''
def loadFile():
return np.loadtxt("data.txt",delimiter=",")
'''逻辑回归测试'''
def logRegresMain():
print("step 1: loading data...")
data = loadFile()
m = data.shape[0]*2//3
train_x, train_y = data[:m,:2],data[:m,2:]
test_x,test_y = data[m:,:2],data[m:,2:]
print("step 2: training...")
alpha = 0.0000001
maxIter = 200000
opts = {'alpha': alpha, 'maxIter': maxIter}
optimalWeights = train_logRegres(train_x, train_y, opts)
print("weight",optimalWeights)
## step 3: testing
print("step 3: testing...")
accuracy = test_LogRegres(optimalWeights, test_x, test_y)
## step 4: show the result
print("step 4: show the result...")
print('The classify accuracy is: %.3f%%' % (accuracy * 100))
if __name__=='__main__':
logRegresMain()