《统计学习方法》书中对于logistic回归的解释有两个公式,看的并不是很直观,我个人很难理解该模型如何具体起到分类的作用。
学了Andrew Ng的深度学习课程后,吴老师对logstic regression讲的非常通俗易懂。这里梳理一下作为笔记。
1 logstic回归是分类问题
这一点是因为历史原因,不用为此烦恼, 既然是分类模型,假定如下:
数据
T
=
{
(
x
i
,
y
i
)
,
i
=
1
,
.
.
.
,
n
}
T=\{(x_i,y_i),i=1,...,n\}
T={(xi,yi),i=1,...,n},
x
i
∈
R
d
x_i\in R^d
xi∈Rd, 二分类问题中
y
=
{
0
,
1
}
y=\{0,1\}
y={0,1},那么我们看下面线性可分的的例子:
最简单的模型就是拟合一条直线,将两类分开。 该问题中
f
(
x
0
,
x
1
)
=
x
0
+
x
1
−
1
=
0
f(x^0,x^1)=x^0+x^1-1 =0
f(x0,x1)=x0+x1−1=0(红线)是一个较好的决策边界, 分类时对于样本
x
=
(
x
0
,
x
1
)
T
x=(x^0,x^1)^T
x=(x0,x1)T,如果
x
0
+
x
1
−
1
>
0
x^0+x^1-1 >0
x0+x1−1>0,则
y
=
1
y=1
y=1,否则为
y
=
0
y=0
y=0。
问题在于, 离决策边界越远的点,分类结果的置信度应当越高,而线性模型中分类器一侧的所有的点结果都一样。
2 sigmoid函数
为了解决这个问题,我们可以做一点改进,将
f
θ
(
x
)
f_\theta(x)
fθ(x)的值映射的0,1之间,而不仅仅是两点, sigmoid函数刚好可以完成这个任务。
s
i
g
m
o
i
d
(
z
)
=
1
1
+
e
x
p
(
−
z
)
sigmoid(z) = \frac{1}{1+exp(-z)}
sigmoid(z)=1+exp(−z)1
sigmoid函数形如,
z大于0则映射到[0.5,1), 值越大越接近于1. z小于0映射到(0,0.5], 越小越接近0.
调整后的假设函数形式为:
f
θ
(
x
)
=
1
1
+
e
x
p
−
(
θ
0
x
0
+
θ
1
x
1
)
f_\theta(x)= \frac{1}{1+exp-(\theta_0x^0+\theta_1x^1)}
fθ(x)=1+exp−(θ0x0+θ1x1)1
如果我们使用最小二乘法则, 模型则为:
arg
min
θ
l
o
s
s
(
θ
)
=
1
n
∑
i
=
1
n
(
y
i
−
f
θ
(
x
i
)
)
2
\arg\min\limits_{\theta} \quad loss(\theta)= \frac{1}{n}\sum_{i=1}^{n}(y_i-f_\theta(x_i))^2
argθminloss(θ)=n1i=1∑n(yi−fθ(xi))2
由于sigmoid函数是非线性函数,使得
l
o
s
s
(
θ
)
loss(\theta)
loss(θ)非凸,存在很多局部极值点,目标函数不易优化,为了解决这个问题,换一个目标函数。
3 logistic loss
非凸的原因是由于存在指数函数, 为了消除影响,加入对数函数。 形成新的目标函数:
l
o
s
s
(
θ
)
=
{
−
l
o
g
(
f
θ
(
x
)
)
if
y
i
=
1
−
l
o
g
(
1
−
f
θ
(
x
)
)
if
y
i
=
0
loss(\theta)=\begin{cases} -log(f_\theta(x)) & \text{if } y_i=1 \\ -log(1-f_\theta(x))&\text{if } y_i=0 \end{cases}
loss(θ)={−log(fθ(x))−log(1−fθ(x))if yi=1if yi=0
形如:
该损失函数可以使得预测越差,损失函数值越大,整理成一般形式:
l
o
s
s
(
θ
)
=
−
1
n
∑
i
=
1
n
[
y
i
l
o
g
(
f
θ
(
x
i
)
)
+
(
1
−
y
i
)
l
o
g
(
1
−
f
θ
(
x
i
)
)
]
loss(\theta) =-\frac{1}{n} \sum_{i=1}^{n}[y_i log(f_\theta(x_i))+(1-y_i)log(1-f_\theta(x_i))]
loss(θ)=−n1i=1∑n[yilog(fθ(xi))+(1−yi)log(1−fθ(xi))]
这就是logsitic回归的目标函数了。
4 求解
确定目标函数之后,下一步就是要求解模型参数
θ
\theta
θ,可以使用梯度下降法,梯度下降原理:
θ
=
θ
−
α
∗
∂
l
o
s
s
(
θ
)
∂
θ
\theta = \theta- \alpha*\frac{\partial loss(\theta)}{\partial \theta}
θ=θ−α∗∂θ∂loss(θ)
对于
∀
θ
j
,
j
=
1
,
⋯
,
d
\forall\theta_j, j=1,\cdots,d
∀θj,j=1,⋯,d
先计算
∂
f
θ
(
x
i
)
∂
θ
j
=
(
1
1
+
e
x
p
(
−
x
θ
)
)
′
=
(
f
θ
(
x
i
)
)
2
e
x
p
(
−
x
θ
)
x
j
\frac{\partial f_\theta(x_i)}{\partial \theta_j}=(\frac{1}{1+exp(-x\theta)})^{'}=\big(f_\theta( x_i)\big)^2exp(-x\theta)x^{j}
∂θj∂fθ(xi)=(1+exp(−xθ)1)′=(fθ(xi))2exp(−xθ)xj.
∂
l
o
s
s
(
θ
)
∂
θ
j
=
−
1
n
∑
i
=
1
n
[
y
i
(
l
o
g
f
θ
(
x
i
)
)
′
+
(
1
−
y
i
)
(
l
o
g
(
1
−
f
θ
(
x
i
)
)
)
′
]
=
−
1
n
∑
i
=
1
n
[
y
i
f
θ
(
x
i
)
(
f
θ
(
x
i
)
)
′
+
1
−
y
i
1
−
f
θ
(
x
i
)
(
1
−
f
θ
(
x
i
)
)
′
]
=
−
1
n
∑
i
=
1
n
[
y
i
f
θ
(
x
i
)
(
f
θ
(
x
i
)
)
2
e
x
p
(
−
x
θ
)
x
j
+
1
−
y
i
1
−
f
θ
(
x
i
)
(
−
(
f
θ
(
x
i
)
)
2
e
x
p
(
−
x
θ
)
x
j
)
]
=
−
1
n
∑
i
=
1
n
[
y
i
f
θ
(
x
i
)
f
θ
(
x
i
)
(
1
−
f
θ
(
x
i
)
)
−
1
−
y
i
1
−
f
θ
(
x
i
)
f
θ
(
x
i
)
(
1
−
f
θ
(
x
i
)
)
]
=
−
1
n
∑
i
=
1
n
[
y
i
x
j
−
y
i
f
θ
(
x
i
)
−
f
θ
(
x
i
)
x
j
+
y
i
f
θ
(
x
i
)
]
=
1
n
∑
i
=
1
n
[
(
f
θ
(
x
i
)
−
y
i
)
x
j
]
\begin{aligned} \frac{\partial loss(\theta)}{\partial \theta_j} & = -\frac{1}{n} \sum_{i=1}^{n}[y_i \big(logf_\theta(x_i)\big)^{'}+(1-y_i)\big(log(1-f_\theta(x_i))\big)^{'}]\\ & = -\frac{1}{n} \sum_{i=1}^{n}[\frac{y_i }{f_\theta(x_i)}\big(f_\theta(x_i)\big)^{'}+\frac{1-y_i }{1-f_\theta(x_i)}\big(1-f_\theta(x_i)\big)^{'}]\\ & = -\frac{1}{n} \sum_{i=1}^{n}[\frac{y_i }{f_\theta(x_i)}(f_\theta(x_i))^2exp(-x\theta)x^{j}+\frac{1-y_i }{1-f_\theta(x_i)}\big(-(f_\theta(x_i)\big)^2exp(-x\theta)x^{j})]\\ & =-\frac{1}{n} \sum_{i=1}^{n}[\frac{y_i }{f_\theta(x_i)}f_\theta(x_i)(1-f_\theta(x_i))-\frac{1-y_i }{1-f_\theta(x_i)}f_\theta(x_i)(1-f_\theta(x_i))]\\ & =-\frac{1}{n} \sum_{i=1}^{n}[y_ix^j-y_if_\theta(x_i)-f_\theta(x_i)x^j+y_if_\theta(x_i)]\\ & = \frac{1}{n} \sum_{i=1}^{n}[\big(f_\theta(x_i)-y_i\big)x^j]\\ \end{aligned}
∂θj∂loss(θ)=−n1i=1∑n[yi(logfθ(xi))′+(1−yi)(log(1−fθ(xi)))′]=−n1i=1∑n[fθ(xi)yi(fθ(xi))′+1−fθ(xi)1−yi(1−fθ(xi))′]=−n1i=1∑n[fθ(xi)yi(fθ(xi))2exp(−xθ)xj+1−fθ(xi)1−yi(−(fθ(xi))2exp(−xθ)xj)]=−n1i=1∑n[fθ(xi)yifθ(xi)(1−fθ(xi))−1−fθ(xi)1−yifθ(xi)(1−fθ(xi))]=−n1i=1∑n[yixj−yifθ(xi)−fθ(xi)xj+yifθ(xi)]=n1i=1∑n[(fθ(xi)−yi)xj]
故梯度下降形式为:
θ
j
=
θ
j
−
α
∗
1
n
∑
i
=
1
n
[
(
f
θ
(
x
i
)
−
y
i
)
x
i
j
]
\theta_j = \theta_j- \alpha*\frac{1}{n} \sum_{i=1}^{n}[\big(f_\theta(x_i)-y_i\big)x_i^j]
θj=θj−α∗n1i=1∑n[(fθ(xi)−yi)xij]
矩阵形式为:
θ
=
θ
−
α
∗
(
x
T
f
θ
(
x
)
−
x
T
y
)
\theta=\theta-\alpha*(x^Tf_\theta(x)-x^Ty)
θ=θ−α∗(xTfθ(x)−xTy)
5 实现
import numpy as np
import matplotlib.pyplot as plt
import random
n = 200
x_train = np.array([[random.uniform(-1,3) for i in range(n)],
[random.uniform(-1,3) for i in range(n)]]).T;
y_train = np.array([[1 if x+y>=1 else 0 for x,y in x_train]]).T
def sigmoid(z):
return 1/(1+np.exp(-z))
def LogisticReg(x_train, y_train):
Max_iterations = 10000;
Threshold = 1e-3;
Learing_rate = 0.0003;
n,d = x_train.shape;
design_mat = np.concatenate([np.ones((n,1)), x_train], axis=1);
beta = np.zeros((d+1,1));
iterations = 0;
mse = 1;
Mse_record = []
while(mse >= Threshold and iterations <= Max_iterations):
gradient = design_mat.T @ sigmoid(design_mat @ beta) - design_mat.T @ y_train;
beta = beta - Learing_rate * gradient;
fit = sigmoid(design_mat @ beta);
mse = sum(np.abs(fit - y_train))/n;
iterations = iterations +1 ;
Mse_record.append(mse);
print(Learing_rate,iterations, mse);
print(beta)
fig = plt.figure();
plt.plot(range(len(Mse_record)), Mse_record);
return beta, fit, Mse_record;
beta, fit, Mse_record = LogisticReg(x_train, y_train);
fig = plt.figure()
positive = x_train[(y_train==1)[:,0],:]
negative = x_train[(y_train==0)[:,0],:]
# positive.sum(axis=1)
# negative.sum(axis=1)
plt.scatter(positive[:,0], positive[:,1], color="red");
plt.scatter(negative[:,0], negative[:,1], color="green");
xtic = np.arange(-1,2,0.01)
plt.plot(xtic,-beta[0]/beta[2]-xtic*beta[1]/beta[2])
predict = np.array([[1 if x>0.5 else 0 for x in fit]]).T
accuracy = sum(1-np.abs(predict-y_train))/n
print(accuracy )
>>1.00 #线性可分问题准确率达到100%
6 总结
估计
θ
^
\hat{\theta}
θ^之后,决策函数为:
f
(
x
)
=
1
1
+
e
x
p
(
−
x
T
θ
^
)
f(x)=\frac{1}{1+exp(-x^T\hat{\theta})}
f(x)=1+exp(−xTθ^)1
f(x)是一个连续的函数,这或许是其被称为回归的原因(猜测), 那么再给定x时, y=1的条件概率即为
p
(
y
=
1
∣
x
)
=
f
(
x
)
=
1
1
+
e
x
p
(
−
x
θ
^
)
=
e
x
p
(
x
θ
^
)
1
+
e
x
p
(
x
θ
^
)
p(y=1|x)=f(x)= \frac{1}{1+exp(-x\hat{\theta})}= \frac{exp(x\hat{\theta})}{1+exp(x\hat{\theta})}
p(y=1∣x)=f(x)=1+exp(−xθ^)1=1+exp(xθ^)exp(xθ^)
由于y只有两种取值,则
p
(
y
=
0
∣
x
)
=
1
−
p
(
y
=
1
∣
x
)
=
1
1
+
e
x
p
(
x
θ
^
)
p(y=0|x)=1-p(y=1|x)=\frac{1}{1+exp(x\hat{\theta})}
p(y=0∣x)=1−p(y=1∣x)=1+exp(xθ^)1
这就解释了《统计学习方法》上的两个公式。