3.逻辑回归(二元分类)
逻辑回归是一个二分类问题,所以我们需要将预测的结果,映射到{0,1}之上。所以针对
θ
T
x
θ^Tx
θTx的值,我们修改为:
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_θ(x)=\frac{1}{1+e^{-θ^Tx}}
hθ(x)=1+e−θTx1
当
θ
T
x
θ^Tx
θTx趋近正无穷时,
h
θ
(
x
)
h_θ(x)
hθ(x)趋近于1。当
θ
T
x
θ^Tx
θTx趋近负无穷时,
h
θ
(
x
)
h_θ(x)
hθ(x)趋近于0。则概率可写成如下形式:
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
\begin{aligned} P(y=1|x;θ)&=h_θ(x)\\ P(y=0|x;θ)&=1-h_θ(x) \end{aligned}
P(y=1∣x;θ)P(y=0∣x;θ)=hθ(x)=1−hθ(x)
也可写成:
P
(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
1
−
y
P(y|x;θ)=(h_θ(x))^{y}(1-h_θ(x))^{1-y}
P(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y
此时我们使用,最大似然值来计算
θ
θ
θ(条件概率的连乘)。并且对似然函数取对数,将连乘化为相加。
l
(
θ
)
=
l
n
[
L
(
θ
)
]
=
∑
i
=
1
m
{
y
(
i
)
l
o
g
[
h
(
x
(
i
)
]
+
(
1
−
y
(
i
)
)
l
o
g
[
1
−
h
(
x
(
i
)
)
]
}
\begin{aligned} l(θ)&=ln[L(θ)]\\ &=\sum^m_{i=1}\left\{y^{(i)}log[h(x^{(i)}]+(1−y^{(i)})log[1−h(x^{(i)})]\right\} \end{aligned}
l(θ)=ln[L(θ)]=i=1∑m{y(i)log[h(x(i)]+(1−y(i))log[1−h(x(i))]}
为了求似然函数的最大值,我们使用梯度上升法(沿着梯度的方向向上是增长最快的方向,下降也是如此)。
θ
j
=
θ
j
+
α
∂
l
(
θ
)
∂
θ
j
θ_j=θ_j+α\frac{\partial l(θ)}{\partial θ_j}
θj=θj+α∂θj∂l(θ)
所以我们需要求
l
(
θ
)
l(θ)
l(θ)对
θ
θ
θ的梯度:
其中
∂
h
(
x
(
i
)
)
∂
θ
j
=
e
−
θ
T
x
(
i
)
(
1
+
e
−
θ
T
x
(
i
)
)
2
x
j
(
i
)
=
h
(
x
(
i
)
)
[
1
−
h
(
x
(
i
)
)
]
x
j
(
i
)
\begin{aligned} \frac{\partial h(x^{(i)})}{\partial θ_j}&=\frac{e^{-θ^Tx^{(i)}}}{(1+e^{-θ^Tx^{(i)}})^2}x^{(i)}_j\\ &=h(x^{(i)})[1-h(x^{(i)})]x^{(i)}_j \end{aligned}
∂θj∂h(x(i))=(1+e−θTx(i))2e−θTx(i)xj(i)=h(x(i))[1−h(x(i))]xj(i)
-
直接对元素求导
∂ l ( θ ) ∂ θ j = ∑ i = 1 m ∂ ∂ θ j { y ( i ) l o g [ h ( x ( i ) ] + ( 1 − y ( i ) ) l o g [ 1 − h ( x ( i ) ) ] } = ∑ i = 1 m [ ( y ( i ) h ( x ( i ) ) − 1 − y ( i ) 1 − h ( x ( i ) ) ) ∂ h ( x ( i ) ) ∂ θ j ] = ∑ i = 1 m [ ( y ( i ) ( 1 − h ( x ( i ) ) − ( 1 − y ( i ) ) h ( x ( i ) ) ) x j ( i ) ] = ∑ i = 1 m [ ( y ( i ) − h ( x ( i ) ) ) x j ( i ) ] \begin{aligned} \frac{\partial l(θ)}{\partial θ_j} &=\sum^m_{i=1}\frac{\partial }{\partial θ_j}\left\{y^{(i)}log[h(x^{(i)}]+(1−y^{(i)})log[1−h(x^{(i)})]\right\}\\ &=\sum^m_{i=1}\left[\left(\frac{y^{(i)}}{h(x^{(i)})}-\frac{1-y^{(i)}}{1-h(x^{(i)})}\right)\frac{\partial h(x^{(i)})}{\partial θ_j}\right]\\ &=\sum^m_{i=1}\left[\left(y^{(i)}(1-h(x^{(i)})-(1-y^{(i)})h(x^{(i)})\right)x^{(i)}_j\right]\\ &=\sum^m_{i=1}\left[\left(y^{(i)}-h(x^{(i)})\right)x^{(i)}_j\right]\\ \end{aligned} ∂θj∂l(θ)=i=1∑m∂θj∂{y(i)log[h(x(i)]+(1−y(i))log[1−h(x(i))]}=i=1∑m[(h(x(i))y(i)−1−h(x(i))1−y(i))∂θj∂h(x(i))]=i=1∑m[(y(i)(1−h(x(i))−(1−y(i))h(x(i)))xj(i)]=i=1∑m[(y(i)−h(x(i)))xj(i)] -
对矩阵求导
令:
X = [ — ( x ( 1 ) ) T — — ( x ( 2 ) ) T — ⋮ — ( x ( m ) ) T — ] , θ = [ θ 0 θ 1 ⋮ θ n ] , y = [ y ( 1 ) y ( 2 ) ⋮ y ( m ) ] X=\left[ \begin{matrix} —(x^{(1)})^T—\\ —(x^{(2)})^T—\\ \vdots\\ —(x^{(m)})^T— \end{matrix} \right] ,θ=\left[ \begin{matrix} θ_0\\ θ_1\\ \vdots\\ θ_n \end{matrix} \right], y=\left[ \begin{matrix} y^{(1)}\\ y^{(2)}\\ \vdots\\ y^{(m)} \end{matrix} \right] X=⎣⎢⎢⎢⎡—(x(1))T——(x(2))T—⋮—(x(m))T—⎦⎥⎥⎥⎤,θ=⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤,y=⎣⎢⎢⎢⎡y(1)y(2)⋮y(m)⎦⎥⎥⎥⎤
则我们可以知道:
h θ ( x ) = 1 1 + e − X θ h_{θ}(x)=\frac{1}{1+e^{-Xθ}} hθ(x)=1+e−Xθ1
所以 l ( θ ) l(θ) l(θ)可以写成:
l ( θ ) = y T l o g [ h θ ( x ) ] + ( 1 − y ) T l o g [ 1 − h θ ( x ) ] = ( y − 1 ) T X θ − 1 T l o g ( 1 + e − X θ ) \begin{aligned} l(θ)&=y^Tlog[h_{θ}(x)]+(1-y)^Tlog[1-h_{θ}(x)]\\ &=(y-1)^TXθ-\mathbf 1^Tlog(1+e^{-Xθ}) \end{aligned} l(θ)=yTlog[hθ(x)]+(1−y)Tlog[1−hθ(x)]=(y−1)TXθ−1Tlog(1+e−Xθ)
我们令 l 1 = ( y − 1 ) T X θ , l 2 = 1 T l o g ( 1 + e − X θ ) l_1=(y-1)^TXθ,l_2=\mathbf 1^Tlog(1+e^{-Xθ}) l1=(y−1)TXθ,l2=1Tlog(1+e−Xθ),则微分为:
d ( l ) = d ( l 1 ) − d ( l 2 ) \begin{aligned} d(l)&=d(l_1)-d(l_2)\\ \end{aligned} d(l)=d(l1)−d(l2)
所以:
d ( l 1 ) = ( y − 1 ) T X d ( θ ) d(l_1)=(y-1)^TXd(θ)\\ d(l1)=(y−1)TXd(θ)
下面我们来求 d ( l 2 ) d(l_2) d(l2),令 w = 1 + e a , a = − X θ w=1+e^{a},a=-Xθ w=1+ea,a=−Xθ:
d ( l 2 ) = t r [ 1 T d [ l o g ( w ) ] ] = t r [ 1 T ( 1 w ⊙ d ( w ) ) ] = t r [ ( 1 ⊙ 1 w ) T d ( w ) ] = t r [ ( 1 w ) T d ( w ) ] = t r [ ( ∂ l 2 ∂ w ) T d ( w ) ] \begin{aligned} d(l_2)&=tr\left[1^Td[log(w)]\right]\\ &=tr\left[1^T\left(\frac{1}{w}\odot d(w)\right)\right]\\ &=tr\left[\left(1\odot\frac{1}{w}\right)^T d(w)\right]\\ &=tr\left[(\frac{1}{w})^T d(w)\right]=tr\left[(\frac{\partial l_2}{\partial w})^T d(w)\right]\\ \end{aligned} d(l2)=tr[1Td[log(w)]]=tr[1T(w1⊙d(w))]=tr[(1⊙w1)Td(w)]=tr[(w1)Td(w)]=tr[(∂w∂l2)Td(w)]
所以我们可以得出
∂ l 2 ∂ w = 1 w \frac{\partial l_2}{\partial w}=\frac{1}{w} ∂w∂l2=w1
又因为:
d ( l 2 ) = t r [ ( ∂ l 2 ∂ w ) T d ( w ) ] = t r [ ( ∂ l 2 ∂ w ) T ( e a ⊙ d ( w ) ) ] = t r [ ( ∂ l 2 ∂ w ⊙ e a ) T d ( a ) ] = t r [ ( ∂ l 2 ∂ a ) T d ( a ) ] \begin{aligned} d(l_2)&=tr\left[(\frac{\partial l_2}{\partial w})^T d(w)\right]\\ &=tr\left[(\frac{\partial l_2}{\partial w})^T\left( e^a \odot d(w)\right)\right]\\ &=tr\left[\left(\frac{\partial l_2}{\partial w}\odot e^a\right)^T d(a)\right]=tr\left[(\frac{\partial l_2}{\partial a})^T d(a)\right]\\ \end{aligned} d(l2)=tr[(∂w∂l2)Td(w)]=tr[(∂w∂l2)T(ea⊙d(w))]=tr[(∂w∂l2⊙ea)Td(a)]=tr[(∂a∂l2)Td(a)]
所以我们可以得出
∂ l 2 ∂ a = ∂ l 2 ∂ w ⊙ e a = e a w \frac{\partial l_2}{\partial a}=\frac{\partial l_2}{\partial w}\odot e^a=\frac{e^a}{w} ∂a∂l2=∂w∂l2⊙ea=wea
又因为:
d ( l 2 ) = t r [ ( ∂ l 2 ∂ a ) T d ( a ) ] = t r [ ( ∂ l 2 ∂ a ) T ( − X ) d ( θ ) ] \begin{aligned} d(l_2)&=tr\left[(\frac{\partial l_2}{\partial a})^T d(a)\right]\\ &=tr\left[(\frac{\partial l_2}{\partial a})^T (-X)d(θ)\right]\\ \end{aligned} d(l2)=tr[(∂a∂l2)Td(a)]=tr[(∂a∂l2)T(−X)d(θ)]
因此我们可以得出:
d ( l 2 ) = − ( e − X θ 1 + e − X θ ) T X d ( θ ) d(l_2)=-\left(\frac{e^{-Xθ}}{1+e^{-Xθ}}\right)^TXd(θ) d(l2)=−(1+e−Xθe−Xθ)TXd(θ)
所以:
d ( l ) = d ( l 1 ) − d ( l 2 ) = ( y − 1 ) T X d ( θ ) + ( e − X θ 1 + e − X θ ) T X d ( θ ) = t r [ ( y − 1 1 + e − X θ ) T X d ( θ ) ] = t r [ ( ∂ l ∂ θ ) T d ( θ ) ] \begin{aligned} d(l)=d(l_1)-d(l_2)&=(y-1)^TXd(θ)+\left(\frac{e^{-Xθ}}{1+e^{-Xθ}}\right)^TXd(θ)\\ &=tr\left[\left(y-\frac{1}{1+e^{-Xθ}}\right)^TXd(θ)\right]=tr\left[\left(\frac{\partial l}{\partial θ}\right)^Td(θ)\right] \end{aligned} d(l)=d(l1)−d(l2)=(y−1)TXd(θ)+(1+e−Xθe−Xθ)TXd(θ)=tr[(y−1+e−Xθ1)TXd(θ)]=tr[(∂θ∂l)Td(θ)]
最终我们可以得到:
∂ l ∂ θ = X T ( y − 1 1 + e − X θ ) \frac{\partial l}{\partial θ}=X^T\left(y-\frac{1}{1+e^{-Xθ}}\right) ∂θ∂l=XT(y−1+e−Xθ1)