ML:逻辑回归的梯度下降算法

ML:极大似然估计

概率密度(质量)函数:用来描述随机变量取某个值的时候,取值点对应的概率的函数。

概率:已知概率分布,推断样本的概率值

似然:已经有观测样本,寻找最符合当前数据分布的参数

似然函数: L ( μ , σ ∣ X ) = ∏ i = 1 N P ( x i ∣ μ , σ ) \mathcal{L}(\mu, \sigma | X)=\prod_{i=1}^{N} P\left(x_{i} | \mu, \sigma\right) L(μ,σX)=i=1NP(xiμ,σ)

对数似然函数: L ( μ , σ ∣ X ) = ∑ i = 1 N log ⁡ P ( x i ∣ μ , σ ) \mathcal{L}(\mu, \sigma | X)=\sum_{i=1}^{N} \log P\left(x_{i} | \mu, \sigma\right) L(μ,σX)=i=1NlogP(xiμ,σ)

损失函数: J ( θ ) = − ∑ i m Y log ⁡ ( Y ^ ) − ( 1 − Y ) log ⁡ ( 1 − Y ^ ) J(\theta)=-\sum_{i}^{m} Y \log (\hat{Y})-(1-Y) \log (1-\hat{Y}) J(θ)=imYlog(Y^)(1Y)log(1Y^),需要求 J ( θ ) J(\theta) J(θ)对于 θ i \theta_{i} θi的导数,式中 Y ^ = 1 1 + e − θ T X \hat{Y}=\frac{1}{1+e^{-\theta^{T} X}} Y^=1+eθTX1

利用 d d x log ⁡ a ( f ( x ) ) = 1 f ( x ) ln ⁡ a f ′ ( x ) \frac{d}{d x} \log _{a}(f(x))=\frac{1}{f(x) \ln a} f^{\prime}(x) dxdloga(f(x))=f(x)lna1f(x),将 Y ^ = 1 1 + e − θ T X \hat{Y}=\frac{1}{1+e^{-\theta^{T} X}} Y^=1+eθTX1代入 l o g ( Y ^ ) log(\hat{Y}) log(Y^):

∂ ∂ θ j log ⁡ ( Y ^ ) = ∂ ∂ θ j log ⁡ ( 1 1 + e − θ T x ) = ∂ ∂ θ j ( log ⁡ ( 1 ) − log ⁡ ( 1 + e − θ T x ) ) \frac{\partial}{\partial \theta_{j}} \log (\hat{Y})=\frac{\partial}{\partial \theta_{j}} \log \left(\frac{1}{1+e^{-\theta^{T} x}}\right)=\frac{\partial}{\partial \theta_{j}}(\log (1)-\log \left(1+e^{-\theta^{T} x}\right)) θjlog(Y^)=θjlog(1+eθTx1)=θj(log(1)log(1+eθTx))

∂ ∂ θ j log ⁡ ( Y ^ ) = ∂ ∂ θ j ( − log ⁡ ( 1 + e − θ T x ) ) = − 1 1 + e − θ T x ⋅ e − θ T x ⋅ − x j = ( 1 − 1 1 + e − θ T x ) x \frac{\partial}{\partial \theta_{j}} \log (\hat{Y})=\frac{\partial}{\partial \theta_{j}}(-\log \left(1+e^{-\theta^{T} x}\right))=-\frac{1}{1+e^{-\theta^{T} x}} \cdot e^{-\theta^{T} x} \cdot-x_{j}=\left(1-\frac{1}{1+e^{-\theta^{T} x}}\right) x θjlog(Y^)=θj(log(1+eθTx))=1+eθTx1eθTxxj=(11+eθTx1)x

∂ ∂ θ j log ⁡ ( 1 − Y ^ ) = ∂ ∂ θ j log ⁡ ( e − θ T x 1 + e − θ T x ) = ∂ ∂ θ j ( − θ T x − log ⁡ ( 1 + e − θ T x ) ) \frac{\partial}{\partial \theta_{j}} \log (1-\hat{Y})=\frac{\partial}{\partial \theta_{j}} \log \left(\frac{e^{-\theta^{T} x}}{1+e^{-\theta^{T} x}}\right)=\frac{\partial}{\partial \theta_{j}}(-\theta^{T} x-\log \left(1+e^{-\theta^{T} x}\right)) θjlog(1Y^)=θjlog(1+eθTxeθTx)=θj(θTxlog(1+eθTx))

∂ ∂ θ j log ⁡ ( 1 − Y ^ ) = − x j + x j ( 1 − 1 1 + e − θ T x ) = − 1 1 + e − θ T x x j \frac{\partial}{\partial \theta_{j}} \log (1-\hat{Y})=-x_{j}+x_{j}\left(1-\frac{1}{1+e^{-\theta^{T} x}}\right)=-\frac{1}{1+e^{-\theta^{T} x}} x_{j} θjlog(1Y^)=xj+xj(11+eθTx1)=1+eθTx1xj

综上可得, ∂ ∂ θ j J ( θ ) = − ∑ i m y i x i j ( 1 − 1 1 + e − θ T x i ) − ( 1 − y i ) x i j 1 1 + e − θ T x i \frac{\partial}{\partial \theta_{j}} J(\theta)=-\sum_{i}^{m} y_{i} x_{i j}\left(1-\frac{1}{1+e^{-\theta^{T} x_{i}}}\right)-\left(1-y_{i}\right) x_{i j} \frac{1}{1+e^{-\theta^{T} x_{i}}} θjJ(θ)=imyixij(11+eθTxi1)(1yi)xij1+eθTxi1

其中, i i i是数据点的序号, j j j是特征的数量,输入 X X X可以表示为:

X = [ x i = 1 , j = 1 x i = 2 , j = 1 x i = 3 , j = 1 x i = 1 , j = 2 x i = 2 , j = 2 x i = 3 , j = 2 x i = 1 , j = 3 x i = 2 , j = 3 x i = 3 , j = 3 ] X=\left[\begin{array}{ll}{x_{i=1, j=1}} & {x_{i=2, j=1} x_{i=3, j=1}} \\ {x_{i=1, j=2}} & {x_{i=2, j=2} x_{i=3, j=2}} \\ {x_{i=1, j=3}} & {x_{i=2, j=3} x_{i=3, j=3}}\end{array}\right] X=xi=1,j=1xi=1,j=2xi=1,j=3xi=2,j=1xi=3,j=1xi=2,j=2xi=3,j=2xi=2,j=3xi=3,j=3,举个例子,一个batch的图片, x i j x_{i j} xij表示第 i i i张图片的第 j j j个像素

展开整理得: ∂ ∂ θ j J ( θ ) = ∑ i m ( 1 1 + e − θ T x i − y i ) x i j = ∑ i m ( y ^ i − y i ) x i j \frac{\partial}{\partial \theta_{j}} J(\theta)=\sum_{i}^{m}\left(\frac{1}{1+e^{-\theta^{T} x_{i}}}-y_{i}\right) x_{i j}=\sum_{i}^{m}\left(\hat{y}_{i}-y_{i}\right) x_{i j} θjJ(θ)=im(1+eθTxi1yi)xij=im(y^iyi)xij,式中 Y ^ = 1 1 + e − θ T X \hat{Y}=\frac{1}{1+e^{-\theta^{T} X}} Y^=1+eθTX1

之前 θ \theta θ X X X的向量表示形式为: θ T = [  bias  θ 1 θ 2 ] \theta^{T}=\left[\begin{array}{lll}{\text { bias }} & {\theta_{1}} & {\theta_{2}}\end{array}\right] θT=[ bias θ1θ2] X = [ 1 x 1 x 2 ] X=\left[\begin{array}{c}{1} \\ {x_{1}} \\ {x_{2}}\end{array}\right] X=1x1x2

由于 θ \theta θ中的 b i a s bias bias对应着 X X X里面的1,所以可以得到: ∂ ∂ b i a s J ( θ ) = ∑ i m ( y ^ i − y i ) \frac{\partial}{\partial b i a s} J(\theta)=\sum_{i}^{m}\left(\hat{y}_{i}-y_{i}\right) biasJ(θ)=im(y^iyi)

设定学习率 η \eta η,迭代下面的步骤直至收敛:

θ j ← θ j − η ∂ ∂ θ j J ( θ ) \theta_{j} \leftarrow \theta_{j}-\eta \frac{\partial}{\partial \theta_{j}} J(\theta) θjθjηθjJ(θ)

b i a s ← b i a s − η ∂ ∂  bias  J ( θ ) bias \leftarrow bias -\eta \frac{\partial}{\partial \text { bias }} J(\theta) biasbiasη bias J(θ)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值