理论
逻辑函数: (1) g ( z ) = 1 1 + e − z g(z)=\frac 1 {1+e^{-z}}\tag{1} g(z)=1+e−z1(1)
逻辑函数可视化:
import matplotlib.pyplot as plt
import numpy as np
import math
e = math.e
x = np.linspace(-10,10,1e6)
y = 1 / (1 + np.exp(-x))
plt.plot(x, y)
plt.show()
假设函数:
(2)
h
θ
(
x
)
=
g
(
θ
T
X
)
=
1
1
+
e
−
θ
T
X
h_{\theta}(x)=g(\theta^TX)=\frac 1 {1+e^{-\theta^TX}}\tag{2}
hθ(x)=g(θTX)=1+e−θTX1(2)
代价函数:
(3)
J
(
θ
)
=
1
m
∑
i
=
1
m
C
o
s
t
(
h
θ
(
x
(
i
)
)
,
y
(
i
)
)
J(\theta)=\frac 1 m \sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)}) \tag{3}
J(θ)=m1i=1∑mCost(hθ(x(i)),y(i))(3)
其中,
(4)
C
o
s
t
(
h
θ
(
x
(
i
)
)
,
y
(
i
)
)
=
{
−
l
o
g
(
h
θ
(
x
)
)
,
if
y
=
1
−
l
o
g
(
1
−
h
θ
(
x
)
)
,
if
y
=
0
Cost(h_{\theta}(x^{(i)}),y^{(i)}) = \begin{cases} -log(h_{\theta}(x)), & \text {if $y=1$} \\ -log(1-h_{\theta}(x)), & \text{if $y=0$} \end{cases}\tag{4}
Cost(hθ(x(i)),y(i))={−log(hθ(x)),−log(1−hθ(x)),if y=1if y=0(4)
(5) J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] J\left( \theta \right)=-\frac{1}{m}\sum\limits_{i=1}^{m}{[{{y}^{(i)}}\log \left( {h_\theta}\left( {{x}^{(i)}} \right) \right)+\left( 1-{{y}^{(i)}} \right)\log \left( 1-{h_\theta}\left( {{x}^{(i)}} \right) \right)]}\tag{5} J(θ)=−m1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))](5)
其函数图像为:
从图中可以看出:
- y = 1 y=1 y=1,当预测值 h θ ( x ) = 1 h_{\theta}(x)=1 hθ(x)=1时,代价函数 C o s t Cost Cost的值为0,这是我们想要的(模型预测完全正确时,代价达到最小)。当预测值离1越远,其代价函数越大,这也是我们想要的
- 同理 y = 0 y=0 y=0,当预测值=0时,代价函数达到最小值;预测值离0越远,其代价函数越大。
代价函数推导过程(采用极大似然估计):
假设函数
h
θ
(
x
)
h_{\theta}(x)
hθ(x)表示预测结果为1的概率,则:
(6)
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
P(y=1|x;\theta)=h_{\theta}(x)\\ P(y=0|x;\theta)=1-h_{\theta}(x)\tag{6}
P(y=1∣x;θ)=hθ(x)P(y=0∣x;θ)=1−hθ(x)(6)
将公式6合并为一个公式:
(7)
P
(
y
∣
x
;
θ
)
=
h
θ
(
x
)
y
∗
(
1
−
h
θ
(
x
)
)
1
−
y
P(y|x;\theta)=h_{\theta}(x)^y*(1-h_{\theta}(x))^{1-y}\tag{7}
P(y∣x;θ)=hθ(x)y∗(1−hθ(x))1−y(7)
取似然函数:
(8)
L
(
θ
)
=
∏
i
=
1
m
P
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
L(\theta)=\prod_{i=1}^mP(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}}\tag{8}
L(θ)=i=1∏mP(y(i)∣x(i);θ)=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)(8)
对数似然函数:
(9)
l
θ
=
l
o
g
(
L
(
θ
)
)
=
∑
i
=
1
m
[
y
(
i
)
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
l_{\theta}=log(L(\theta))=\sum\limits_{i=1}^{m}{[{{y}^{(i)}}\log \left( {h_\theta}\left( {{x}^{(i)}} \right) \right)+\left( 1-{{y}^{(i)}} \right)\log \left( 1-{h_\theta}\left( {{x}^{(i)}} \right) \right)]}\tag{9}
lθ=log(L(θ))=i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))](9)
最大似然估计是取使得似然函数最大化的 θ \theta θ,令损失函数 J ( θ ) = − 1 m l ( θ ) J(\theta)=-\frac 1 m l(\theta) J(θ)=−m1l(θ),则最大化的 l ( θ ) l(\theta) l(θ)即为最小化的 J ( θ ) J(\theta) J(θ)
梯度下降法迭代公式:
(10)
θ
j
=
θ
j
−
α
(
∂
∂
θ
j
)
J
(
θ
)
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta_j=\theta_j-\alpha(\frac \partial {\partial \theta_j})J(\theta)=\theta_j-\alpha \frac 1 m \sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}\tag{10}
θj=θj−α(∂θj∂)J(θ)=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i)(10)
矩阵形式:
(11)
θ
=
θ
−
α
1
m
x
T
(
g
(
x
θ
)
−
y
)
\theta=\theta-\alpha \frac 1 m x^T(g(x\theta)-y)\tag{11}
θ=θ−αm1xT(g(xθ)−y)(11)
推导如下:
带惩罚项的逻辑回归
(12) J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{{y}^{(i)}}\log \left( {h_\theta}\left( {{x}^{(i)}} \right) \right)-\left( 1-{{y}^{(i)}} \right)\log \left( 1-{h_\theta}\left( {{x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}}\tag{12} J(θ)=m1i=1∑m[−y(i)log(hθ(x(i)))−(1−y(i))log(1−hθ(x(i)))]+2mλj=1∑nθj2(12)
重复直至收敛:
θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) ) {\theta_0}:={\theta_0}-a\frac{1}{m}\sum\limits_{i=1}^{m}{(({h_\theta}({{x}^{(i)}})-{{y}^{(i)}})x_{0}^{(i)}}) θ0:=θ0−am1i=1∑m((hθ(x(i))−y(i))x0(i))
θ j : = θ j − a [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j ] {\theta_j}:={\theta_j}-a[\frac{1}{m}\sum\limits_{i=1}^{m}{({h_\theta}({{x}^{(i)}})-{{y}^{(i)}})x_{j}^{\left( i \right)}}+\frac{\lambda }{m}{\theta_j}] θj:=θj−a[m1i=1∑m(hθ(x(i))−y(i))xj(i)+mλθj]
j = 1 , 2 , . . . n j=1,2,...n j=1,2,...n
Python实现
import numpy as np
X = np.array([[1, 2], [3, 2], [1, 3], [2, 3], [3, 3], [3, 4], [10, 11], [9, 10], [12, 13], [14, 14], [13, 12]])
y = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
n_samples, n_features = X.shape
X = np.concatenate((np.ones(n_samples).reshape((n_samples, 1)), X), axis=1)
y = y.reshape((n_samples, 1))
max_iter = 1e4 # 最大迭代次数
epsilon = 1e-4 # θ迭代前后变化最大误差不能超过epsilon
theta = np.zeros((n_features + 1, 1)) # 初始化theta
alpha = 0.0001
for iter in range(int(max_iter)):
theta_next = theta - alpha * (X.T) @ (1 / (1 + np.exp(-X@theta)) - y) / n_samples
print(theta_next)
if np.abs(theta - theta_next).sum() < epsilon:
theta = theta_next
print("merge")
break
theta = theta_next
else:
print("get the max_iter, stop iter.")