目录
Logistic Regression
1.1 Purpose
解决classification的算法(虽然名字是regression)。
1.2 Classes
目的:找一个模型,分类两个良性和恶性的肿瘤类别。(拟合一个分类边界)
Q:为什么是直线不是曲线
A:也可以是曲线 但是logistic 回归的结果是直线
x
1
x_1
x1: size
x
2
x_2
x2: prob
h
(
x
,
θ
)
h(x,\theta)
h(x,θ):“幕后推手”,真正能决定肿瘤是恶性还是良性
h
(
x
)
=
w
1
x
1
+
w
2
x
2
+
b
=
θ
T
x
h(x) = w_1x_1 + w_2x_2 + b=\boldsymbol{\theta}^T\boldsymbol{x}
h(x)=w1x1+w2x2+b=θTx
想把它映射到
[
0
,
1
]
[0,1]
[0,1]之间方便分类。
sigmoid函数:
y
=
1
1
+
e
−
x
y = \frac{1}{1+e^{-x}}
y=1+e−x1
值域
(
0
,
1
)
(0,1)
(0,1)
导数
y
′
=
y
(
1
−
y
)
y'=y(1-y)
y′=y(1−y)
1.3 Model选择:
h θ ( x ) = 1 1 + e − θ T x h_\theta(x) = \frac{1}{1+e^{-\boldsymbol{\theta}^T\boldsymbol{x}}} hθ(x)=1+e−θTx1
确定损失函数的形式:(目的:让预测和真实尽可能接近)
推导:
最大似然函数加负号
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P(y=1|x;\theta)=h_\theta(x)
P(y=1∣x;θ)=hθ(x)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
P(y=0|x;\theta)=1-h_\theta(x)
P(y=0∣x;θ)=1−hθ(x)
Trick:
L
=
∏
i
=
1
m
P
(
i
)
=
∏
{
h
θ
y
(
x
)
[
1
−
h
θ
(
x
)
]
1
−
y
}
L=\prod_{i=1}^m{P^{(i)}}=\prod{\{h_\theta^y(x)[1-h_\theta(x)]^{1-y}\}}
L=i=1∏mP(i)=∏{hθy(x)[1−hθ(x)]1−y}
l
o
g
(
L
)
=
∑
i
=
1
m
{
y
h
θ
(
x
)
+
(
1
−
y
)
[
1
−
h
θ
(
x
)
]
}
log(L)=\sum_{i=1}^m{\{yh_\theta(x)+(1-y)[1-h_\theta(x)]\}}
log(L)=i=1∑m{yhθ(x)+(1−y)[1−hθ(x)]}
L
o
s
s
=
−
l
o
g
(
L
)
=
−
∑
i
=
1
m
{
y
h
θ
(
x
)
+
(
1
−
y
)
[
1
−
h
θ
(
x
)
]
}
Loss=-log(L)=-\sum_{i=1}^m{\{yh_\theta(x)+(1-y)[1-h_\theta(x)]\}}
Loss=−log(L)=−i=1∑m{yhθ(x)+(1−y)[1−hθ(x)]}
Note:这是二分类情况下的结果,类别多的时候就成为了交叉熵。
Q:为什么不能用非凸函数?
A:SGD方法不支持非凸函数
1.4 求导过程:
结合上求得的形式,再加上log,得到最后的cost function:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
⋅
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
⋅
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta) = -\frac{1}{m}\sum_{i=1}^m{[y^{(i)} \cdot log(h_\theta(x^{(i)}))+(1-y^{(i)}) \cdot log(1-h_\theta(x^{(i)}))]}
J(θ)=−m1i=1∑m[y(i)⋅log(hθ(x(i)))+(1−y(i))⋅log(1−hθ(x(i)))]
求偏导
∂
J
∂
θ
\frac{\partial{J}}{\partial{\theta}}
∂θ∂J
第一项
y
(
i
)
y^{(i)}
y(i)是标签值,直接拿过来,在结合复合函数求导,
∂
J
∂
θ
=
−
∑
i
=
1
m
[
y
(
i
)
⋅
1
h
θ
(
x
(
i
)
)
⋅
∂
h
θ
(
x
(
i
)
)
∂
θ
+
(
1
−
y
(
i
)
)
⋅
−
1
1
−
h
θ
(
x
(
i
)
)
⋅
∂
h
θ
(
x
(
i
)
)
∂
θ
]
\frac{\partial{J}}{\partial{\theta}}=-\sum_{i=1}^m{[y^{(i)}\cdot\frac{1}{h_\theta(x^{(i)})}\cdot \frac{\partial{h_\theta(x^{(i)})}}{\partial\theta}+(1-y^{(i)})\cdot\frac{-1}{1-h_\theta(x^{(i)})}\cdot\frac{\partial{h_\theta(x^{(i)})}}{\partial\theta}]}
∂θ∂J=−i=1∑m[y(i)⋅hθ(x(i))1⋅∂θ∂hθ(x(i))+(1−y(i))⋅1−hθ(x(i))−1⋅∂θ∂hθ(x(i))]
=
−
∑
i
=
1
m
[
y
(
i
)
h
θ
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
h
θ
(
x
(
i
)
)
]
⋅
∂
h
θ
(
x
(
i
)
)
∂
θ
=-\sum_{i=1}^m{[\frac{y^{(i)}}{h_\theta(x^{(i)})}-\frac{1-y^{(i)}}{1-h_\theta(x^{(i)})}]\cdot\frac{\partial{h_\theta(x^{(i)})}}{\partial\theta}}
=−i=1∑m[hθ(x(i))y(i)−1−hθ(x(i))1−y(i)]⋅∂θ∂hθ(x(i))
根据sigmoid函数导数的特点,即
y
′
=
y
(
1
−
y
)
y'=y(1-y)
y′=y(1−y),再结合模型函数:
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x) = \frac{1}{1+e^{-\boldsymbol{\theta}^T\boldsymbol{x}}}
hθ(x)=1+e−θTx1可以得到:
∂
h
θ
(
x
(
i
)
)
∂
θ
=
h
θ
(
x
(
i
)
)
⋅
[
1
−
h
θ
(
x
(
i
)
)
]
⋅
x
(
i
)
\frac{\partial{h_\theta(x^{(i)})}}{\partial\theta}=h_\theta(x^{(i)})\cdot[1-h_\theta(x^{(i)})]\cdot x^{(i)}
∂θ∂hθ(x(i))=hθ(x(i))⋅[1−hθ(x(i))]⋅x(i)
注意这里是把
θ
\theta
θ当做自变量求的偏导,所以后面的
x
(
i
)
x^{(i)}
x(i)不能忘记乘。
把这个结论代进求导式中,通分后可以得到:
∂
J
∂
θ
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
⋅
x
(
i
)
\frac{\partial{J}}{\partial{\theta}}=\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x^{(i)}
∂θ∂J=i=1∑m(hθ(x(i))−y(i))⋅x(i)
这个梯度和线性回归的结果是一样的:
线性回归的梯度是:
∂
J
∂
θ
=
1
m
⋅
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
⋅
x
i
\frac{\partial{J}}{\partial{\theta}}=\frac{1}{m}\cdot\sum_{i=1}^m(h_\theta(x^i)-y^i)\cdot x^i
∂θ∂J=m1⋅i=1∑m(hθ(xi)−yi)⋅xi
2. Multi-Classes情况
-
one vs all 一对多:
model 1: 分出三角一类 剩下两个视为同一类
model 2:…
model 3:…
训练3个模型。
优点:K个模型
缺点:样本不均衡,只关注多数样本的正确率,三角的训练不好,因为三角的梯度贡献小 -
one vs one 一对一:
任意2类都训练一个model来分类
优点:训练数据比较少 一次就只有2个类别的训练数据,没有不均衡的问题
缺点:训练次数太多了 C K 2 C_K^2 CK2次model -
交叉熵,k个类别
L o s s = − ∑ j = 1 m ∑ i = 1 k y l o g ( P ) Loss = -\sum_{j=1}^m\sum_{i=1}^k{ylog(P)} Loss=−j=1∑mi=1∑kylog(P)
3. Implementation from Scratch (python)
import numpy as np
import matplotlib.pyplot as plt
import random
# 创造数据集
data_0 = np.random.multivariate_normal(mean=[3,4],cov=[[3,0],[0,1]],size=100)
data_1 = np.random.multivariate_normal(mean=[7,6],cov=[[3,0],[0,2]],size=200)
y_0 = np.array([0]*100)
y_1 = np.array([1]*200)
data_x = np.vstack((data_0,data_1))
data_y = np.hstack((y_0,y_1))
plt.scatter(data_x[:,0],data_x[:,1],c=data_y)
plt.show()
# 打乱顺序
data = list(zip(data_x,data_y)) #可以用来打包
random.shuffle(data)
data_x, data_y = zip(*data)
data_x = np.array(data_x)
data_y = np.array(data_y)
# 求训练集预测准确率
def acc(y_pred):
y_predict = np.around(y_pred) # 四舍五入取巧,默认threshold为0.5
## data_y:(300,) y_predict: (300,1)
correct_rate = np.mean(np.equal(y_predict,data_y[:,np.newaxis])) #np.equal判断对应位置是否相等
return correct_rate
# sigmoid
def sigmoid(x):
return np.array(1/(1+np.exp(-x)))
# 模型正向推理过程
def predict(w,b):
#data_x:(300,2) 模型维度一定要注意
#w:(1,2)
#b:(1,1)
hidden = np.dot(data_x,w.T)+b
return sigmoid(hidden)
# 模型梯度反传
def gradients(y_pred):
#data_y:(300,)->(300,1) 通过data_y[:,np.newaxis]
#y_pred:(300,1)
#data_x:(300,2)
grad_w = np.sum((y_pred-data_y[:,np.newaxis])*data_x,axis=0,keepdims=True)
grad_b = np.sum((y_pred-data_y[:,np.newaxis]),axis=0,keepdims=True)
assert grad_w.shape==(1,2)
assert grad_b.shape==(1,1)
return grad_w, grad_b
learning_rate = 1e-2 #不可以太大 exp之后会overflow 太小也不行 会不动
iterations = 1000
w = np.random.rand(1,2)
b = np.random.rand(1,1)
# 可视化
from IPython import display
for i in range(iterations):
y_pred = predict(w,b)
grad_w,grad_b = gradients(y_pred)
w = w - learning_rate * grad_w
b = b - learning_rate * grad_b
if i%10==0:
w1 = w[0][0]
w2 = w[0][1]
b = b[0][0]
line_x = np.linspace(-10,20,100)
line_y = (0.5-b-w1*line_x)/w2
display.clear_output(wait=True)#每次画之前清除一下图
plt.plot(line_x,line_y)
plt.xlim((-5,15))# 限制范围避免图一直动
plt.ylim((0,10))
plt.scatter(data_x[:,0],data_x[:,1],c=data_y)
plt.title(f'iteration: {i}, acc: {round(acc(y_pred),3)}')
plt.pause(0.1)
plt.show()
绘图函数:
y
=
0.5
−
b
−
w
1
∗
x
w
2
y= \frac{0.5-b-w_1*x}{w_2}
y=w20.5−b−w1∗x