根据吴恩达老师机器学习课程中在 Logistics Regression 中定义的损失函数:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]
J(θ)=−m1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
对其中每个参数 θ j \theta_{j} θj 求偏导数,过程如下:
∂ ∂ θ j J ( θ ) = ∂ ∂ θ j − 1 m ∑ i = 1 m [ y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) ∂ ∂ θ j log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) ∂ ∂ θ j log ( 1 − h θ ( x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) ∂ ∂ θ j h θ ( x ( i ) ) h θ ( x ( i ) ) + ( 1 − y ( i ) ) ∂ ∂ θ j ( 1 − h θ ( x ( i ) ) ) 1 − h θ ( x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) ∂ ∂ θ j σ ( θ T x ( i ) ) h θ ( x ( i ) ) + ( 1 − y ( i ) ) ∂ ∂ θ j ( 1 − σ ( θ T x ( i ) ) ) 1 − h θ ( x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) σ ( θ T x ( i ) ) ( 1 − σ ( θ T x ( i ) ) ) ∂ ∂ θ j θ T x ( i ) h θ ( x ( i ) ) + − ( 1 − y ( i ) ) σ ( θ T x ( i ) ) ( 1 − σ ( θ T x ( i ) ) ) ∂ ∂ θ j θ T x ( i ) 1 − h θ ( x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ∂ ∂ θ j θ T x ( i ) h θ ( x ( i ) ) − ( 1 − y ( i ) ) h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ∂ ∂ θ j θ T x ( i ) 1 − h θ ( x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) ( 1 − h θ ( x ( i ) ) ) x j ( i ) − ( 1 − y ( i ) ) h θ ( x ( i ) ) x j ( i ) ] = − 1 m ∑ i = 1 m [ y ( i ) ( 1 − h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) h θ ( x ( i ) ) ] x j ( i ) = − 1 m ∑ i = 1 m [ y ( i ) − y ( i ) h θ ( x ( i ) ) − h θ ( x ( i ) ) + y ( i ) h θ ( x ( i ) ) ] x j ( i ) = − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) ] x j ( i ) = 1 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] x j ( i ) \begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) &=\frac{\partial}{\partial \theta_{j}} \frac{-1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{\partial}{\partial \theta_{j}} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \frac{\partial}{\partial \theta_{j}} \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} \frac{\partial}{\partial \theta_{j}} h_{\theta}\left(x^{(i)}\right)}{h_{\theta}\left(x^{(i)}\right)}+\frac{\left(1-y^{(i)}\right) \frac{\partial}{\partial \theta_{j}}\left(1-h_{\theta}\left(x^{(i)}\right)\right)}{1-h_{\theta}\left(x^{(i)}\right)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} \frac{\partial}{\partial \theta_{j}} \sigma\left(\theta^{T} x^{(i)}\right)}{h_{\theta}\left(x^{(i)}\right)}+\frac{\left(1-y^{(i)}\right) \frac{\partial}{\partial \theta_{j}}\left(1-\sigma\left(\theta^{T} x^{(i)}\right)\right)}{1-h_{\theta}\left(x^{(i)}\right)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} \sigma\left(\theta^{T} x^{(i)}\right)\left(1-\sigma\left(\theta^{T} x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{h_{\theta}\left(x^{(i)}\right)}+\frac{-\left(1-y^{(i)}\right) \sigma\left(\theta^{T} x^{(i)}\right)\left(1-\sigma\left(\theta^{T} x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{1-h_{\theta}\left(x^{(i)}\right)}\right]\\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} h_{\theta}\left(x^{(i)}\right)\left(1-h_{\theta}\left(x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{h_{\theta}\left(x^{(i)}\right)}-\frac{\left(1-y^{(i)}\right) h_{\theta}\left(x^{(i)}\right)\left(1-h_{\theta}\left(x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{1-h_{\theta}\left(x^{(i)}\right)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}\left(1-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)}-\left(1-y^{(i)}\right) h_{\theta}\left(x^{(i)}\right) x_{j}^{(i)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}\left(1-h_{\theta}\left(x^{(i)}\right)\right)-\left(1-y^{(i)}\right) h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)} \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-y^{(i)} h_{\theta}\left(x^{(i)}\right)-h_{\theta}\left(x^{(i)}\right)+y^{(i)} h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)} \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)} \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right] x_{j}^{(i)} \end{aligned} ∂θj∂J(θ)=∂θj∂m−1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]=−m1i=1∑m[y(i)∂θj∂log(hθ(x(i)))+(1−y(i))∂θj∂log(1−hθ(x(i)))]=−m1i=1∑m[hθ(x(i))y(i)∂θj∂hθ(x(i))+1−hθ(x(i))(1−y(i))∂θj∂(1−hθ(x(i)))]=−m1i=1∑m[hθ(x(i))y(i)∂θj∂σ(θTx(i))+1−hθ(x(i))(1−y(i))∂θj∂(1−σ(θTx(i)))]=−m1i=1∑m[hθ(x(i))y(i)σ(θTx(i))(1−σ(θTx(i)))∂θj∂θTx(i)+1−hθ(x(i))−(1−y(i))σ(θTx(i))(1−σ(θTx(i)))∂θj∂θTx(i)]=−m1i=1∑m[hθ(x(i))y(i)hθ(x(i))(1−hθ(x(i)))∂θj∂θTx(i)−1−hθ(x(i))(1−y(i))hθ(x(i))(1−hθ(x(i)))∂θj∂θTx(i)]=−m1i=1∑m[y(i)(1−hθ(x(i)))xj(i)−(1−y(i))hθ(x(i))xj(i)]=−m1i=1∑m[y(i)(1−hθ(x(i)))−(1−y(i))hθ(x(i))]xj(i)=−m1i=1∑m[y(i)−y(i)hθ(x(i))−hθ(x(i))+y(i)hθ(x(i))]xj(i)=−m1i=1∑m[y(i)−hθ(x(i))]xj(i)=m1i=1∑m[hθ(x(i))−y(i)]xj(i)
可以发现,该偏导数和线性回归中损失函数对参数 θ \theta θ 的偏导数形式是一致的,线性回归的损失函数定义为:
J
(
θ
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
J(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}
J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2
其偏导数为:
∂
J
(
θ
)
∂
θ
j
=
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
⋅
x
j
(
i
)
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_{j}} &=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x_{j}^{(i)} \end{aligned}
∂θj∂J(θ)=m1i=1∑m(hθ(x(i))−y(i))⋅xj(i)
将其进行向量化:
∂
J
(
θ
)
∂
θ
j
=
1
m
x
j
→
T
(
X
θ
−
y
⃗
)
\frac{\partial J(\theta)}{\partial \theta_{j}} \quad=\frac{1}{m} \overrightarrow{x_{j}}^{T}(X \theta-\vec{y})
∂θj∂J(θ)=m1xjT(Xθ−y)
进一步得到损失函数的梯度:
∇
J
(
θ
)
=
1
m
X
T
(
X
θ
−
y
⃗
)
\nabla J(\theta) \quad=\frac{1}{m} X^{T}(X \theta-\vec{y})
∇J(θ)=m1XT(Xθ−y)
然后通过该梯度进行参数更新:
θ
:
=
θ
−
α
m
X
T
(
X
θ
−
y
⃗
)
\theta:=\theta-\frac{\alpha}{m} X^{T}(X \theta-\vec{y})
θ:=θ−mαXT(Xθ−y)
其他内容可参考:吴恩达机器学习