逻辑回归公式详细推导(LR推导)

逻辑回归公式详细推导(LR推导)

0 逻辑回归介绍及准备知识

逻辑回归本质上是线性回归,只是在特征映射到结果的过程中加入了 σ \sigma σ(z)函数,其中 σ ( z ) = 1 1 + e − z \sigma(z)=\frac{1}{1+e^{-z}} σ(z)=1+ez1.

逻辑回归线性边界的形式如下: f ( x ) = θ 0 + θ 1 x 1 + . . . + θ n x n = ∑ i = 1 n θ i x i = θ T X f(x) = \theta_0 + \theta_1x_1 + ...+ \theta_nx_n= \sum_{i=1}^n{\theta_ix_i}=\theta^TX f(x)=θ0+θ1x1+...+θnxn=i=1nθixi=θTX

构造假设函数(预测函数): h θ ( x ) = σ ( f ( x ) ) = σ ( θ T X ) = 1 1 + e − θ T X h_\theta(x)=\sigma(f(x))=\sigma(\theta^TX)=\frac{1}{1+e^{-\theta^TX}} hθ(x)=σ(f(x))=σ(θTX)=1+eθTX1

对于 σ ( z ) \sigma(z) σ(z)函数求导(后面会用到):

σ ′ ( z ) = d d z 1 1 + e − z \sigma'(z)=\frac{d}{dz}{\frac{1}{1+e^{-z}}} σ(z)=dzd1+ez1
= − 1 ( 1 1 + e − z ) 2 ⋅ ( 1 + e − z ) ′ = e − z ( 1 + e − z ) 2 =\frac{-1}{(\frac{1}{1+e^{-z}})^2}\cdot(1+e^{-z})'=\frac{e^{-z}}{(1+e^{-z})^2} =(1+ez1)21(1+ez)=(1+ez)2ez
= e − z 1 + e − z ⋅ 1 1 + e − z = 1 1 + e − z ⋅ ( 1 − 1 1 + e − z ) = σ ( z ) ⋅ ( 1 − σ ( z ) ) =\frac{e^{-z}}{1+e^{-z}}\cdot\frac{1}{1+e^{-z}}=\frac{1}{1+e^{-z}}\cdot(1-\frac{1}{1+e^{-z}})=\sigma(z)\cdot(1-\sigma(z)) =1+ezez1+ez1=1+ez1(11+ez1)=σ(z)(1σ(z))

1 预测函数

因为逻辑回归用于二分类问题,故满足二重伯努利公式:

p ( y = 1 ∣ x ; θ ) = h θ ( x ) p(y=1|x;\theta)=h_\theta(x) p(y=1x;θ)=hθ(x)        \;\;\;\quad\quad 给定 θ 和 x \theta和x θx,对于正样本 y = 1 y=1 y=1,希望所得概率接近1

p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=0|x;\theta)=1-h_\theta(x) p(y=0x;θ)=1hθ(x) \quad 给定 θ 和 x \theta和x θx,对于负样本 y = 0 y=0 y=0,希望所得概率接近0

⟹ \Longrightarrow p ( y ∣ x ; θ ) = [ h θ ( x ) ] y [ 1 − h θ ( x ) ] 1 − y p(y|x;\theta)=[h_\theta(x)]^y[1-h_\theta(x)]^{1-y} p(yx;θ)=[hθ(x)]y[1hθ(x)]1y


2 损失函数

损失函数可以根据最大似然函数得到:

对于m个样本 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) } \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), ..., (x^{(m)},y^{(m)})\} {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}

似然函数为: L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m [ h θ ( x ( i ) ) ] y ( i ) [ 1 − h θ ( x ( i ) ) ] 1 − y ( i ) L(\theta)=\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^{m}[h_\theta(x^{(i)})]^{y^{(i)}}[1-h_\theta(x^{(i)})]^{1-y^{(i)}} L(θ)=i=1mp(y(i)x(i);θ)=i=1m[hθ(x(i))]y(i)[1hθ(x(i))]1y(i)

对数似然函数为:

l ( θ ) = l o g L ( θ ) = l o g [ ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) ] = ∑ i = 1 m l o g [ p ( y ( i ) ∣ x ( i ) ; θ ) ] l(\theta)=logL(\theta)=log[\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)]=\sum_{i=1}^mlog[p(y^{(i)}|x^{(i)};\theta)] l(θ)=logL(θ)=log[i=1mp(y(i)x(i);θ)]=i=1mlog[p(y(i)x(i);θ)]

= ∑ i = 1 m l o g { [ h θ ( x ( i ) ) ] y ( i ) [ 1 − h θ ( x ( i ) ) ] 1 − y ( i ) } =\sum_{i=1}^mlog{\{[h_\theta(x^{(i)})]^{y^{(i)}}[1-h_\theta(x^{(i)})]^{1-y^{(i)}}\}} =i=1mlog{[hθ(x(i))]y(i)[1hθ(x(i))]1y(i)}

= ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] =\sum_{i=1}^m[y^{(i)}log(h\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] =i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]

因为最大似然函数求最大,而损失函数要求最小,故将 l ( θ ) l(\theta) l(θ)乘以 − 1 -1 1,同时为了样本规模对损失函数造成影响,在乘以 1 m \frac{1}{m} m1,即损失函数为:
J ( θ ) = − 1 m l ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J(\theta)=-\frac{1}{m}l(\theta)=-\frac{1}{m}\sum_{i=1}^m[y^{(i)}log(h\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] J(θ)=m1l(θ)=m1i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]


3 更新参数

使用梯度下降法更新参数 θ \theta θ,即 θ j = θ j − α ∂ ∂ θ j J ( θ ) ( j = 0 , 1 , . . . , n ) \theta_j=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\quad(j=0,1,...,n) θj=θjαθjJ(θ)(j=0,1,...,n),其中n为特征数量,n+1个参数指n个特征加上一个偏置项(吴恩达老师将这部分的n个特征的参数更新与偏置项更新分别开了)。

计算 ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) θjJ(θ)部分:

∂ ∂ θ j J ( θ ) = − 1 m ∂ l ( θ ) ∂ θ j \frac{\partial}{\partial\theta_j}J(\theta)=-\frac{1}{m}\frac{\partial l(\theta)}{\partial\theta_j} θjJ(θ)=m1θjl(θ)

则(用到了链式求导法则):
∂ l ( θ ) ∂ θ j = ∂ l ( θ ) ∂ h θ ( x ( i ) ) ∂ h θ ( x ( i ) ) ∂ θ j \frac{\partial l(\theta)}{\partial\theta_j}=\frac{\partial l(\theta)}{\partial h_\theta(x^{(i)})}\frac{\partial h_\theta(x^{(i)})}{\partial\theta_j} θjl(θ)=hθ(x(i))l(θ)θjhθ(x(i))

= ∂ l ( θ ) ∂ h θ ( x ( i ) ) ⋅ ∂ h θ ( x ( i ) ) ∂ ( θ T x ( i ) ) ⋅ ∂ ( θ T x ( i ) ) ∂ θ j =\frac{\partial l(\theta)}{\partial h_\theta(x^{(i)})}\cdot\frac{\partial h_\theta(x^{(i)})}{\partial(\theta^Tx^{(i)})}\cdot\frac{\partial(\theta^Tx^{(i)})}{\partial\theta_j} =hθ(x(i))l(θ)(θTx(i))hθ(x(i))θj(θTx(i))

= ∂ l ( θ ) ∂ h θ ( x ( i ) ) ⋅ ∂ σ ( θ T x ( i ) ) ∂ ( θ T x ( i ) ) ⋅ ∂ θ T x ( i ) ∂ θ j =\frac{\partial l(\theta)}{\partial h_\theta(x^{(i)})}\cdot\frac{\partial\sigma(\theta^Tx^{(i)})}{\partial(\theta^Tx^{(i)})}\cdot\frac{\partial\theta^Tx^{(i)}}{\partial\theta_j} =hθ(x(i))l(θ)(θTx(i))σ(θTx(i))θjθTx(i)

= ∑ i = 1 m { [ y ( i ) 1 h θ ( x ( i ) ) − ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ) ] ⋅ σ ( θ T x ( i ) ) [ 1 − σ ( θ T x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}\frac{1}{h_\theta(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}]\cdot\sigma(\theta^Tx^{(i)})[1-\sigma(\theta^Tx^{(i)})]\cdot x_j^{(i)}\} =i=1m{[y(i)hθ(x(i))1(1y(i))1hθ(x(i))1]σ(θTx(i))[1σ(θTx(i))]xj(i)}

= ∑ i = 1 m { [ y ( i ) 1 h θ ( x ( i ) ) − ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ) ] ⋅ h θ ( x ( i ) ) [ 1 − h θ ( x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}\frac{1}{h_\theta(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}]\cdot h_\theta(x^{(i)})[1-h_\theta(x^{(i)})]\cdot x_j^{(i)}\} =i=1m{[y(i)hθ(x(i))1(1y(i))1hθ(x(i))1]hθ(x(i))[1hθ(x(i))]xj(i)}

= ∑ i = 1 m { [ y ( i ) ( 1 − h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) h θ ( x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}(1-h_\theta(x^{(i)}))-(1-y^{(i)})h_\theta(x^{(i)})]\cdot x_j^{(i)}\} =i=1m{[y(i)(1hθ(x(i)))(1y(i))hθ(x(i))]xj(i)}

= ∑ i = 1 m { [ y ( i ) − h θ ( x ( i ) ) ] ⋅ x j ( i ) } =\sum_{i=1}^m\{[y^{(i)}-h_\theta(x^{(i)})]\cdot x_j^{(i)}\} =i=1m{[y(i)hθ(x(i))]xj(i)}

所以, ∂ ∂ θ j J ( θ ) = − 1 m ∑ i = 1 m { [ y ( i ) − h θ ( x ( i ) ) ] ⋅ x j ( i ) } \frac{\partial}{\partial\theta_j}J(\theta)=-\frac{1}{m}\sum_{i=1}^m\{[y^{(i)}-h_\theta(x^{(i)})]\cdot x_j^{(i)}\} θjJ(θ)=m1i=1m{[y(i)hθ(x(i))]xj(i)}

故参数更新为

θ j = θ j − α ⋅ ( − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) ] ⋅ x j ( i ) ) \theta_j=\theta_j-\alpha\cdot(-\frac{1}{m}\sum_{i=1}^m[y^{(i)}-h_\theta(x^{(i)})]\cdot x_j^{(i)}) θj=θjα(m1i=1m[y(i)hθ(x(i))]xj(i))

= θ j − α ⋅ 1 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] ⋅ x j ( i ) ( j = 0 , 1 , . . . , n ) =\theta_j-\alpha\cdot\frac{1}{m}\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]\cdot x_j^{(i)}\quad(j=0,1,...,n) =θjαm1i=1m[hθ(x(i))y(i)]xj(i)(j=0,1,...,n)

遍历训练数据中所有的样本进行计算,将参数进行更新,这种算法叫做批梯度下降(Batch Gradient Descent),上式便是批梯度下降。但是,如果样本规模非常大,则计算量也将十分巨大。因此,比较实用的算法是随机梯度下降(Stochastic Gradient Descent)。在SGD算法中,每次更新的迭代,只计算一个样本。这样对于一个具有数百万样本的训练数据,完成一次遍历就会对更新数百万次,效率大大提升。

随机梯度下降公式如下:

θ j = θ j − α ⋅ 1 m [ h θ ( x ( i ) ) − y ( i ) ] ⋅ x j ( i ) ( j = 0 , 1 , . . . , n ) \theta_j=\theta_j-\alpha\cdot\frac{1}{m}[h_\theta(x^{(i)})-y^{(i)}]\cdot x_j^{(i)}\quad(j=0,1,...,n) θj=θjαm1[hθ(x(i))y(i)]xj(i)(j=0,1,...,n)

以上就是逻辑回归的详细推导,用到了大量的概率论知识,如果感到不熟悉的读者,可以去看一下概率论与统计学相关的知识。

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值