假设
{
x
(
1
)
,
y
(
1
)
}
\left\{x^{(1)},y^{(1)}\right\}
{x(1),y(1)},
{
x
(
2
)
,
y
(
2
)
}
\left\{x^{(2)},y^{(2)}\right\}
{x(2),y(2)}…
{
x
(
m
)
,
y
(
m
)
}
\left\{x^{(m)},y^{(m)}\right\}
{x(m),y(m)}是m个训练样本,
其中
x
(
i
)
x^{(i)}
x(i)是n维向量(
∀
i
∈
1
,
2
,
.
.
.
m
\forall i\in{1,2,...m}
∀i∈1,2,...m).设损失函数为
J
(
θ
)
J(\theta)
J(θ) 其中
θ
\theta
θ是模型训练的参数。则logistic
regression模型的损失函数为:
J
(
θ
)
=
∑
i
=
1
m
[
y
(
i
)
l
o
g
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
∑
i
=
1
m
J
i
(
θ
)
\begin{aligned} J(\theta) &=\sum_{i=1}^{m}[y^{(i)}logh_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] \\ &=\sum_{i=1}^{m}J_{i}(\theta) \end{aligned}
J(θ)=i=1∑m[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]=i=1∑mJi(θ)
其中 h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) h_{\theta}(x^{(i)})=\frac{1}{1+e^{-\theta^{T}x^{(i)}}} hθ(x(i))=1+e−θTx(i)1
梯度上升公式为:
θ
j
:
=
θ
j
+
η
∂
J
(
θ
)
∂
θ
j
\theta_{j}:=\theta_{j}+\eta\frac{\partial{J(\theta)}}{\partial{\theta_{j}}}
θj:=θj+η∂θj∂J(θ)
其中
η
\eta
η称为学习率
∂
J
(
θ
)
∂
θ
j
=
∑
i
=
1
m
∂
J
i
(
θ
)
∂
h
θ
(
x
(
i
)
)
×
∂
h
θ
(
x
(
i
)
)
∂
(
θ
T
x
(
i
)
)
×
∂
θ
T
x
(
i
)
∂
θ
j
=
∑
i
=
1
m
[
y
(
i
)
1
h
θ
(
x
(
i
)
)
+
(
y
(
i
)
−
1
)
1
1
−
h
θ
(
x
(
i
)
)
]
×
[
h
θ
(
x
(
i
)
)
(
1
−
h
θ
(
x
(
i
)
)
]
×
x
j
(
i
)
=
∑
i
=
1
m
[
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
]
x
j
(
i
)
\begin{aligned} \frac{\partial{J(\theta)}}{\partial{\theta_{j}}} & =\sum_{i=1}^{m}\frac{\partial{J_{i}(\theta)}}{\partial{h_{\theta}(x^{(i)})}}\times\frac{\partial{h_{\theta}(x^{(i)})}}{\partial{(\theta^{T}x^{(i)}})}\times\frac{\partial{\theta^{T}x^{(i)}}}{\partial{\theta_{j}}} \\ & =\sum_{i=1}^{m}[y^{(i)}\frac{1}{h_{\theta}(x^{(i)})}+(y^{(i)}-1)\frac{1}{1-h_{\theta}(x^{(i)})}]\times[h_{\theta}(x^{(i)})(1-h_{\theta}(x^{(i)})]\times x_{j}^{(i)} \\ & =\sum_{i=1}^{m}[(y^{(i)}-h_{\theta}(x^{(i)}))]x_{j}^{(i)} \end{aligned}
∂θj∂J(θ)=i=1∑m∂hθ(x(i))∂Ji(θ)×∂(θTx(i))∂hθ(x(i))×∂θj∂θTx(i)=i=1∑m[y(i)hθ(x(i))1+(y(i)−1)1−hθ(x(i))1]×[hθ(x(i))(1−hθ(x(i))]×xj(i)=i=1∑m[(y(i)−hθ(x(i)))]xj(i)
其中
x
j
(
i
)
x_{j}^{(i)}
xj(i)表示第i个训练样本输入向量的第j个分量。
故 θ j : = θ j + η ∑ i = 1 m [ ( y ( i ) − h θ ( x ( i ) ) ) ] x j ( i ) \theta_{j}:=\theta_{j}+\eta\sum_{i=1}^{m}[(y^{(i)}-h_{\theta}(x^{(i)}))]x_{j}^{(i)} θj:=θj+η∑i=1m[(y(i)−hθ(x(i)))]xj(i)
关于为什么要用log做梯度上升:
1.减少计算量。取对数可以将连乘的形式化为求和的形式,在求导的时候也会更加方便
2.防止浮点数下溢。当数据集很大的时候,连乘的结果会趋向0,非常不利于后续的计算
3.取对数不影响单调性。