逻辑回归公式详细推导(LR推导)
0 逻辑回归介绍及准备知识
逻辑回归本质上是线性回归,只是在特征映射到结果的过程中加入了 σ \sigma σ(z)函数,其中 σ ( z ) = 1 1 + e − z \sigma(z)=\frac{1}{1+e^{-z}} σ(z)=1+e−z1.
逻辑回归线性边界的形式如下: f ( x ) = θ 0 + θ 1 x 1 + . . . + θ n x n = ∑ i = 1 n θ i x i = θ T X f(x) = \theta_0 + \theta_1x_1 + ...+ \theta_nx_n= \sum_{i=1}^n{\theta_ix_i}=\theta^TX f(x)=θ0+θ1x1+...+θnxn=∑i=1nθixi=θTX
构造假设函数(预测函数): h θ ( x ) = σ ( f ( x ) ) = σ ( θ T X ) = 1 1 + e − θ T X h_\theta(x)=\sigma(f(x))=\sigma(\theta^TX)=\frac{1}{1+e^{-\theta^TX}} hθ(x)=σ(f(x))=σ(θTX)=1+e−θTX1
对于 σ ( z ) \sigma(z) σ(z)函数求导(后面会用到):
σ
′
(
z
)
=
d
d
z
1
1
+
e
−
z
\sigma'(z)=\frac{d}{dz}{\frac{1}{1+e^{-z}}}
σ′(z)=dzd1+e−z1
=
−
1
(
1
1
+
e
−
z
)
2
⋅
(
1
+
e
−
z
)
′
=
e
−
z
(
1
+
e
−
z
)
2
=\frac{-1}{(\frac{1}{1+e^{-z}})^2}\cdot(1+e^{-z})'=\frac{e^{-z}}{(1+e^{-z})^2}
=(1+e−z1)2−1⋅(1+e−z)′=(1+e−z)2e−z
=
e
−
z
1
+
e
−
z
⋅
1
1
+
e
−
z
=
1
1
+
e
−
z
⋅
(
1
−
1
1
+
e
−
z
)
=
σ
(
z
)
⋅
(
1
−
σ
(
z
)
)
=\frac{e^{-z}}{1+e^{-z}}\cdot\frac{1}{1+e^{-z}}=\frac{1}{1+e^{-z}}\cdot(1-\frac{1}{1+e^{-z}})=\sigma(z)\cdot(1-\sigma(z))
=1+e−ze−z⋅1+e−z1=1+e−z1⋅(1−1+e−z1)=σ(z)⋅(1−σ(z))
1 预测函数
因为逻辑回归用于二分类问题,故满足二重伯努利公式:
p ( y = 1 ∣ x ; θ ) = h θ ( x ) p(y=1|x;\theta)=h_\theta(x) p(y=1∣x;θ)=hθ(x) \;\;\;\quad\quad 给定 θ 和 x \theta和x θ和x,对于正样本 y = 1 y=1 y=1,希望所得概率接近1
p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=0|x;\theta)=1-h_\theta(x) p(y=0∣x;θ)=1−hθ(x) \quad 给定 θ 和 x \theta和x θ和x,对于负样本 y = 0 y=0 y=0,希望所得概率接近0
⟹ \Longrightarrow ⟹ p ( y ∣ x ; θ ) = [ h θ ( x ) ] y [ 1 − h θ ( x ) ] 1 − y p(y|x;\theta)=[h_\theta(x)]^y[1-h_\theta(x)]^{1-y} p(y∣x;θ)=[hθ(x)]y[1−hθ(x)]1−y
2 损失函数
损失函数可以根据最大似然函数得到:
对于m个样本 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) } \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), ..., (x^{(m)},y^{(m)})\} {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
似然函数为: L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m [ h θ ( x ( i ) ) ] y ( i ) [ 1 − h θ ( x ( i ) ) ] 1 − y ( i ) L(\theta)=\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^{m}[h_\theta(x^{(i)})]^{y^{(i)}}[1-h_\theta(x^{(i)})]^{1-y^{(i)}} L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m[hθ(x(i))]y(i)[1−hθ(x(i))]1−y(i)
对数似然函数为:
l ( θ ) = l o g L ( θ ) = l o g [ ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) ] = ∑ i = 1 m l o g [ p ( y ( i ) ∣ x ( i ) ; θ ) ] l(\theta)=logL(\theta)=log[\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)]=\sum_{i=1}^mlog[p(y^{(i)}|x^{(i)};\theta)] l(θ)=logL(θ)=log[∏i=1mp(y(i)∣x(i);θ)]=∑i=1mlog[p(y(i)∣x(i);θ)]
= ∑ i = 1 m l o g { [ h θ ( x ( i ) ) ] y ( i ) [ 1 − h θ ( x ( i ) ) ] 1 − y ( i ) } =\sum_{i=1}^mlog{\{[h_\theta(x^{(i)})]^{y^{(i)}}[1-h_\theta(x^{(i)})]^{1-y^{(i)}}\}} =∑i=1mlog{[hθ(x(i))]y(i)[1−hθ(x(i))]1−y(i)}
= ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] =\sum_{i=1}^m[y^{(i)}log(h\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] =∑i=1m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
因为最大似然函数求最大,而损失函数要求最小,故将
l
(
θ
)
l(\theta)
l(θ)乘以
−
1
-1
−1,同时为了样本规模对损失函数造成影响,在乘以
1
m
\frac{1}{m}
m1,即损失函数为:
J
(
θ
)
=
−
1
m
l
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta)=-\frac{1}{m}l(\theta)=-\frac{1}{m}\sum_{i=1}^m[y^{(i)}log(h\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]
J(θ)=−m1l(θ)=−m1∑i=1m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
3 更新参数
使用梯度下降法更新参数 θ \theta θ,即 θ j = θ j − α ∂ ∂ θ j J ( θ ) ( j = 0 , 1 , . . . , n ) \theta_j=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\quad(j=0,1,...,n) θj=θj−α∂θj∂J(θ)(j=0,1,...,n),其中n为特征数量,n+1个参数指n个特征加上一个偏置项(吴恩达老师将这部分的n个特征的参数更新与偏置项更新分别开了)。
计算 ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) ∂θj∂J(θ)部分:
∂ ∂ θ j J ( θ ) = − 1 m ∂ l ( θ ) ∂ θ j \frac{\partial}{\partial\theta_j}J(\theta)=-\frac{1}{m}\frac{\partial l(\theta)}{\partial\theta_j} ∂θj∂J(θ)=−m1∂θj∂l(θ)
则(用到了链式求导法则):
∂
l
(
θ
)
∂
θ
j
=
∂
l
(
θ
)
∂
h
θ
(
x
(
i
)
)
∂
h
θ
(
x
(
i
)
)
∂
θ
j
\frac{\partial l(\theta)}{\partial\theta_j}=\frac{\partial l(\theta)}{\partial h_\theta(x^{(i)})}\frac{\partial h_\theta(x^{(i)})}{\partial\theta_j}
∂θj∂l(θ)=∂hθ(x(i))∂l(θ)∂θj∂hθ(x(i))
= ∂ l ( θ ) ∂ h θ ( x ( i ) ) ⋅ ∂ h θ ( x ( i ) ) ∂ ( θ T x ( i ) ) ⋅ ∂ ( θ T x ( i ) ) ∂ θ j =\frac{\partial l(\theta)}{\partial h_\theta(x^{(i)})}\cdot\frac{\partial h_\theta(x^{(i)})}{\partial(\theta^Tx^{(i)})}\cdot\frac{\partial(\theta^Tx^{(i)})}{\partial\theta_j} =∂hθ(x(i))∂l(θ)⋅∂(θTx(i))∂hθ(x(i))⋅∂θj∂(θTx(i))
= ∂ l ( θ ) ∂ h θ ( x ( i ) ) ⋅ ∂ σ ( θ T x ( i ) ) ∂ ( θ T x ( i ) ) ⋅ ∂ θ T x ( i ) ∂ θ j =\frac{\partial l(\theta)}{\partial h_\theta(x^{(i)})}\cdot\frac{\partial\sigma(\theta^Tx^{(i)})}{\partial(\theta^Tx^{(i)})}\cdot\frac{\partial\theta^Tx^{(i)}}{\partial\theta_j} =∂hθ(x(i))∂l(θ)⋅∂(θTx(i))∂σ(θTx(i))⋅∂θj∂θTx(i)
= ∑ i = 1 m { [ y ( i ) 1 h θ ( x ( i ) ) − ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ) ] ⋅ σ ( θ T x ( i ) ) [ 1 − σ ( θ T x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}\frac{1}{h_\theta(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}]\cdot\sigma(\theta^Tx^{(i)})[1-\sigma(\theta^Tx^{(i)})]\cdot x_j^{(i)}\} =∑i=1m{[y(i)hθ(x(i))1−(1−y(i))1−hθ(x(i))1]⋅σ(θTx(i))[1−σ(θTx(i))]⋅xj(i)}
= ∑ i = 1 m { [ y ( i ) 1 h θ ( x ( i ) ) − ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ) ] ⋅ h θ ( x ( i ) ) [ 1 − h θ ( x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}\frac{1}{h_\theta(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}]\cdot h_\theta(x^{(i)})[1-h_\theta(x^{(i)})]\cdot x_j^{(i)}\} =∑i=1m{[y(i)hθ(x(i))1−(1−y(i))1−hθ(x(i))1]⋅hθ(x(i))[1−hθ(x(i))]⋅xj(i)}
= ∑ i = 1 m { [ y ( i ) ( 1 − h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) h θ ( x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}(1-h_\theta(x^{(i)}))-(1-y^{(i)})h_\theta(x^{(i)})]\cdot x_j^{(i)}\} =∑i=1m{[y(i)(1−hθ(x(i)))−(1−y(i))hθ(x(i))]⋅xj(i)}
= ∑ i = 1 m { [ y ( i ) − h θ ( x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}-h_\theta(x^{(i)})]\cdot x_j^{(i)}\} =∑i=1m{[y(i)−hθ(x(i))]⋅xj(i)}
所以, ∂ ∂ θ j J ( θ ) = − 1 m ∑ i = 1 m { [ y ( i ) − h θ ( x ( i ) ) ] ⋅ x j ( i ) } \frac{\partial}{\partial\theta_j}J(\theta)=-\frac{1}{m}\sum_{i=1}^m\{[y^{(i)}-h_\theta(x^{(i)})]\cdot x_j^{(i)}\} ∂θj∂J(θ)=−m1∑i=1m{[y(i)−hθ(x(i))]⋅xj(i)}
故参数更新为
θ j = θ j − α ⋅ ( − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) ] ⋅ x j ( i ) ) \theta_j=\theta_j-\alpha\cdot(-\frac{1}{m}\sum_{i=1}^m[y^{(i)}-h_\theta(x^{(i)})]\cdot x_j^{(i)}) θj=θj−α⋅(−m1∑i=1m[y(i)−hθ(x(i))]⋅xj(i))
= θ j − α ⋅ 1 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] ⋅ x j ( i ) ( j = 0 , 1 , . . . , n ) =\theta_j-\alpha\cdot\frac{1}{m}\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]\cdot x_j^{(i)}\quad(j=0,1,...,n) =θj−α⋅m1∑i=1m[hθ(x(i))−y(i)]⋅xj(i)(j=0,1,...,n)
遍历训练数据中所有的样本进行计算,将参数进行更新,这种算法叫做批梯度下降(Batch Gradient Descent),上式便是批梯度下降。但是,如果样本规模非常大,则计算量也将十分巨大。因此,比较实用的算法是随机梯度下降(Stochastic Gradient Descent)。在SGD算法中,每次更新的迭代,只计算一个样本。这样对于一个具有数百万样本的训练数据,完成一次遍历就会对更新数百万次,效率大大提升。
随机梯度下降公式如下:
θ j = θ j − α ⋅ 1 m [ h θ ( x ( i ) ) − y ( i ) ] ⋅ x j ( i ) ( j = 0 , 1 , . . . , n ) \theta_j=\theta_j-\alpha\cdot\frac{1}{m}[h_\theta(x^{(i)})-y^{(i)}]\cdot x_j^{(i)}\quad(j=0,1,...,n) θj=θj−α⋅m1[hθ(x(i))−y(i)]⋅xj(i)(j=0,1,...,n)
以上就是逻辑回归的详细推导,用到了大量的概率论知识,如果感到不熟悉的读者,可以去看一下概率论与统计学相关的知识。