手撕梯度下降

逻辑回归的梯度下降法
Gradient Descent for Logistic Regressian

P ( y − 1 ∣ x , w ) = 1 1 + e − w T x + b a r g m i n w , b = − ∑ i = 1 n y log ⁡ P ( y = 1 ∣ x , w ) + ( 1 − y ) log ⁡ ( 1 − P ( y = 1 ∣ x , w ) ) P(y-1|x,w)=\frac{1}{1+e^{-w^Tx+b}}\\ argmin_{w,b}=-\sum_{i=1}^ny\log P(y=1|x,w)+(1-y)\log (1-P(y=1|x,w)) P(y1x,w)=1+ewTx+b1argminw,b=i=1nylogP(y=1x,w)+(1y)log(1P(y=1x,w))

迭代式的方法

w t − 1 = w t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] x i b t − 1 = b t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] w^{t-1}=w^t-\eta _t\sum_{i=1}^n[\sigma(w^Tx_i+b)-y_i]x_i\\ b^{t-1}=b^t-\eta _t\sum_{i=1}^n[\sigma(w^Tx_i+b)-y_i] wt1=wtηti=1n[σ(wTxi+b)yi]xibt1=btηti=1n[σ(wTxi+b)yi]

梯度下降法的复杂度
Q: What is the complexity of gradient descr

快排,归并排序时间复杂度 O ( n log ⁡ n ) O(n\log n) O(nlogn),那么梯度下降法呢,他是迭代的过程(需要多长时间保最优解,初始值,)

梯度下降法的收敛分析 Gradient Descent Algorithm

梯度下降法的过程可以表示为

  • 选择初始值 x 0 ∈ R d x_0∈R^d x0Rd, 和步长(step-size) η t > 0 \eta_t>0 ηt>0
  • f o r i = 0 , 1 , … for i =0,1, … fori=0,1,

x i + 1 = x i ー η ▽ f ( x i ) x_{i+1}=x_iー\eta \triangledown f(x_i) xi+1=xiηf(xi)

Convergence Analysis of Gradient Descent

定理1

假设函数满足$ Lー Lipschitz 条 件 , 并 且 是 凸 函 数 , 设 定 条件,并且是凸函数,设定 x^*= argminf(x) 那 么 对 于 步 长 那么对于步长 t\leq\frac{1}{L}$,满足
f ( x k ) ≤ f ( x ∗ ) + ∣ ∣ x 0 − x ∗ ∣ ∣ 2 2 2 η t k f(x_k)\leq f(x^*)+\frac{||x_0-x^*||_2^2}{2\eta_tk} f(xk)f(x)+2ηtkx0x22

当我们迭代 k = L ∣ ∣ x 0 − x ∗ ∣ ∣ 2 2 ϵ k=\frac{L||x_0-x^*||_2^2}{\epsilon } k=ϵLx0x22次之后我们可以保证得到 ϵ − a p p r o x i m a t i o n o p t i m a l v a l u e x ( η t = 1 / L ) \epsilon -approximation \quad optimal \quad value x (\eta_t= 1/L) ϵapproximationoptimalvaluex(ηt=1/L)

误差优化

A算法:$f(x_{10})\leq f(x^)+0.1\f(x_{100})\leq f(x^)+0.0001\$

B算法:$f(x_{10})\leq f(x^)+0.1\f(x_{100})\leq f(x^)+0.00000000000001\$

凸函数性质以及 L- Lipschitz 条件

f ( x k ) ≤ f ( x ∗ ) + ∣ ∣ x 0 − x ∗ ∣ ∣ 2 2 2 η t k k = L ∣ ∣ x 0 − x ∗ ∣ ∣ 2 ϵ f(x_k)\leq f(x^*)+\frac{||x_0-x^*||_2^2}{2\eta_tk}\quad \quad k=\frac{L||x_0-x^*||^2}{\epsilon} f(xk)f(x)+2ηtkx0x22k=ϵLx0x2

k = L ∣ ∣ x 0 − x ∗ ∣ ∣ 2 ϵ k=\frac{L||x_0-x^*||^2}{\epsilon} k=ϵLx0x2带入原式,注 η t = 1 L \eta_t =\frac{1}{L} ηt=L1
f ( x k ) ≤ f ( x ∗ ) + O ( ϵ ) f(x_k)\leq f(x^*)+O(\epsilon ) f(xk)f(x)+O(ϵ)

f ( x ) f(x) f(x)是convex function.任意 x , y ∈ R d , 0 ≤ λ ≤ 1 x,y\in R^d,0\leq\lambda \leq1 x,yRd,0λ1,对于first ordey convexity f ( x ) + ▽ f ( x ) ( y − 1 ) ≤ f ( y ) f(x)+\triangledown f(x)(y-1)\le f(y) f(x)+f(x)(y1)f(y)

L- Lipschitz条件以及定理

一个光滑函数(smooth function) f f f满足 L- Lipschitz 条件,则对于任意 x , y ∈ R d x,y \in R^d x,yRd,我们有
∣ ∣ ▽ f ( x ) − ▽ f ( y ) ∣ ∣ ≤ L ∣ ∣ x − y ∣ ∣ ||\triangledown f(x)-\triangledown f(y)||\le L||x-y|| f(x)f(y)Lxy

i.e: linear Regrssion

L = 1 n ∣ ∣ X w − y ∣ ∣ 2 L=\frac{1}{n}||Xw-y||^2 L=n1Xwy2

∣ ∣ ▽ f ( x ) − ▽ f ( y ) ∣ ∣ = 2 n ∣ ∣ x T ( x w 1 − y ) − x T ( x w 2 ) − y ∣ ∣ = 1 n ∣ ∣ x T x ( w 1 − w 2 ) ∣ ∣ ≤ 2 n ∣ ∣ x T x ∣ ∣ ⏟ L ⋅ ∣ ∣ w 1 − w 2 ∣ ∣ \begin{aligned} ||\triangledown f(x)-\triangledown f(y)||&=\frac{2}{n}||x^T(xw_1-y)-x^T(xw_2)-y||\\ &=\frac{1}{n}||x^Tx(w_1-w_2)||\\ &\leq\frac{2}{n}\underbrace{||x^Tx||}_L·||w_1-w_2|| \end{aligned} f(x)f(y)=n2xT(xw1y)xT(xw2)y=n1xTx(w1w2)n2L xTxw1w2

假设一个函数满足 L- Lipschittz 条件,并且是凸函数,对于任意 x , y ∈ R d x, y \in R^d x,yRd,我们有

f ( y ) ≤ f ( x ) + ▽ f ( x ) f ( y ー x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 f(y)\leq f(x)+\triangledown f(x)f(yーx)+\frac{L}{2}||y-x||^2 f(y)f(x)+f(x)f(yx)+2Lyx2

h ( x ) : h ( 1 ) = h ( 0 ) + ∫ 0 1 h ‘ ( τ ) d τ h(x):h(1)=h(0)+\int_{0}^{1}h^`(\tau)d\tau h(x):h(1)=h(0)+01h(τ)dτ$

定义函数 h ( τ ) = f ( x + τ ( y − x ) ) h(\tau)=f(x+\tau(y-x)) h(τ)=f(x+τ(yx))

h ( 1 ) = f ( y ) , h ( 0 ) = f ( x ) h(1)=f(y),h(0)=f(x) h(1)=f(y),h(0)=f(x)

f ( y ) = f ( x ) + ∫ 0 1 ▽ f ( x + τ ( y − x ) ) ( y − x ) d τ f(y)=f(x)+\int_0^1\triangledown f(x+\tau(y-x))(y-x)d\tau f(y)=f(x)+01f(x+τ(yx))(yx)dτ

f ( y ) = f ( x ) + ∫ 0 1 ▽ f ( x + τ ( y − x ) ) ( y − x ) d τ = f ( x ) + ▽ f ( x ) ( y − x ) + ∫ 0 1 ( ▽ f ( x + τ ( y − x ) ) − ▽ f ( x ) ) ( y − x ) d τ ≤ f ( x ) + ▽ f ( x ) ( y − x ) + ∫ 0 1 L ∣ ∣ τ ( y − x ) ∣ ∣ ⋅ ∣ ∣ y − x ∣ ∣ d τ = f ( x ) + ▽ f ( x ) ( y − x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 \begin{aligned} f(y)&=f(x)+\int_0^1\triangledown f(x+\tau(y-x))(y-x)d\tau \\ &=f(x)+\triangledown f(x)(y-x)+\int_0^1(\triangledown f(x+\tau(y-x))-\triangledown f(x))(y-x)d\tau\\ &\leq f(x)+\triangledown f(x)(y-x)+\int_0^1L||\tau(y-x)||·||y-x||d\tau\\ &=f(x)+\triangledown f(x)(y-x)+\frac{L}{2}||y-x||^2 \end{aligned} f(y)=f(x)+01f(x+τ(yx))(yx)dτ=f(x)+f(x)(yx)+01(f(x+τ(yx))f(x))(yx)dτf(x)+f(x)(yx)+01Lτ(yx)yxdτ=f(x)+f(x)(yx)+2Lyx2

a b ≤ ∣ a ∣ + ∣ b ∣ ab\leq |a|+|b| aba+b

Dervation

f ( y ) ≤ f ( x ) + ▽ f ( x ) ( y − x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 f(y)\leq f(x)+\triangledown f(x)(y-x)+\frac{L}{2}||y-x||^2 f(y)f(x)+f(x)(yx)+2Lyx2

那么使用 f ( x i + 1 ) , f ( x i ) 来 进 行 证 明 f(x_{i+1}),f(x_i)来进行证明 f(xi+1),f(xi)
f ( x i + 1 ) ≤ f ( x i ) + ▽ f ( x i ) ( x i + 1 − x i ) + L 2 ∣ ∣ x i + 1 − x i ∣ ∣ 2 = f ( x i ) − ▽ f ( x i ) ⋅ ( − 1 ) ⋅ η t ▽ f ( x i ) + L 2 ▽ f ( x i ) 2 = f ( x i ) − η t ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 + L ⋅ η t 2 2 ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 = f ( x i ) = η t ( 1 − L η t 2 ) ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 ≤ f ( x i ) − η t 2 ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 \begin{aligned} f(x_{i+1})&\leq f(x_i)+\triangledown f(x_i)(x_{i+1}-x_i)+\frac{L}{2}||x_{i+1}-x_i||^2\\ &=f(x_i)-\triangledown f(x_i)·(-1)·\eta_t \triangledown f(x_i)+\frac{L}{2}\triangledown f(x_i)^2\\ &=f(x_i)-\eta_t ||\triangledown f(x_i)||^2+\frac{L·\eta_t^2}{2}||\triangledown f(x_i)||^2\\ &=f(x_i)=\eta_t(1-\frac{L\eta_t}{2})||\triangledown f(x_i)||^2\\ &\leq f(x_i)-\frac{\eta_t}{2}||\triangledown f(x_i)||^2 \end{aligned} f(xi+1)f(xi)+f(xi)(xi+1xi)+2Lxi+1xi2=f(xi)f(xi)(1)ηtf(xi)+2Lf(xi)2=f(xi)ηtf(xi)2+2Lηt2f(xi)2=f(xi)=ηt(12Lηt)f(xi)2f(xi)2ηtf(xi)2
注解

η t ≤ 1 L \eta_t\leq \frac{1}{L} ηtL1

Derivation

f ( x i + 1 ) ≤ f ( x i ) − η t 2 ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 = f ( x ∗ ) + ▽ f ( x i ) ( x i − x ∗ ) − η t 2 ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 2 = f ( x ∗ ) + x i − x i + 1 η t ⋅ ( x i − x ∗ ) − 1 2 η t ∣ ∣ x i − x i + 1 ∣ ∣ 2 = f ( x ∗ ) + 1 2 η t ∣ ∣ x i − x ∗ ∣ ∣ 2 − 1 2 η t ( ∣ ∣ x i − x ∗ ∣ ∣ 2 − 2 η t ▽ f ( x i ) ( x i − x ∗ ) + ∣ ∣ η t ▽ f ( x i ) ∣ ∣ 2 ) = f ( x ∗ ) + 1 2 η t ∣ ∣ x i − x ∗ ∣ ∣ 2 − 1 2 η t ∣ ∣ x i − x ∗ − η t ▽ f ( x ) i ) ∣ ∣ 2 = f ( x ∗ ) + 1 2 η t ( ∣ ∣ x i − x ∗ ∣ ∣ 2 − ∣ ∣ x i + 1 − x ∗ ∣ ∣ 2 ) \begin{aligned} f(x_{i+1})&\leq f(x_i)-\frac{\eta_t}{2}||\triangledown f(x_i)||^2\\ &=f(x^*)+\triangledown f(x_i)(x_i-x^*)-\frac{\eta _t}{2}||\triangledown f(x_i)||_2^2\\ &=f(x^*)+\frac{x_i-x_{i+1}}{\eta_t}·(x_i-x^*)-\frac{1}{2\eta_t}||x_i-x_{i+1}||^2\\ &=f(x^*)+\frac{1}{2\eta_t}||x_i-x^*||^2-\frac{1}{2\eta_t}(||x_i-x^*||^2-2\eta_t\triangledown f(x_i)(x_i-x^*)+||\eta_t\triangledown f(x_i)||^2)\\ &=f(x^*)+\frac{1}{2\eta_t}||x_i-x^*||^2-\frac{1}{2\eta_t}||x_i-x^*-\eta_t\triangledown f(x)i) ||^2 \\ &=f(x^*)+\frac{1}{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2) \end{aligned} f(xi+1)f(xi)2ηtf(xi)2=f(x)+f(xi)(xix)2ηtf(xi)22=f(x)+ηtxixi+1(xix)2ηt1xixi+12=f(x)+2ηt1xix22ηt1(xix22ηtf(xi)(xix)+ηtf(xi)2)=f(x)+2ηt1xix22ηt1xixηtf(x)i)2=f(x)+2ηt1(xix2xi+1x2)

注解

( a + b ) 2 = a 2 + 2 a b + b 2 x i + 1 = x i − η t ⋅ ▽ f ( x i ) (a+b)^2=a^2+2ab+b^2\\x_{i+1}=x_i-\eta_t·\triangledown f(x_i) (a+b)2=a2+2ab+b2xi+1=xiηtf(xi)

Derivation

f ( x i + 1 ) ≤ f ( x ∗ ) + 1 2 η t ( ∣ ∣ x i − x ∗ ∣ ∣ 2 − ∣ ∣ x i + 1 − x ∗ ∣ ∣ 2 ) f ( x i + 1 ) − f ( x ∗ ) ≤ 1 2 η t ( ∣ ∣ x i − x ∗ ∣ ∣ 2 − ∣ ∣ x i + 1 − x ∗ ∣ ∣ 2 ) \begin{aligned} f(x_{i+1})&\leq f(x^*)+\frac{1}{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)\\ f(x_{i+1})- f(x^*)&\leq \frac{1}{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)\\ \end{aligned} f(xi+1)f(xi+1)f(x)f(x)+2ηt1(xix2xi+1x2)2ηt1(xix2xi+1x2)

袋鼠数据,急性绩效

f ( x 1 ) − f ( x ∗ ) ≤ 1 2 η t ( ∣ ∣ x 0 − x ∗ ∣ ∣ 2 − ∣ ∣ x 1 − x ∗ ∣ ∣ 2 ) f ( x 2 ) − f ( x ∗ ) ≤ 1 2 η t ( ∣ ∣ x 1 − x ∗ ∣ ∣ 2 − ∣ ∣ x 2 − x ∗ ∣ ∣ 2 ) f ( x 3 ) − f ( x ∗ ) ≤ 1 2 η t ( ∣ ∣ x 2 − x ∗ ∣ ∣ 2 − ∣ ∣ x 3 − x ∗ ∣ ∣ 2 ) . . . f ( x k ) − f ( x ∗ ) ≤ 1 2 η t ( ∣ ∣ x k − 1 − x ∗ ∣ ∣ 2 − ∣ ∣ x k − x ∗ ∣ ∣ 2 ) f(x_{1})- f(x^*)\leq \frac{1}{2\eta_t}(||x_0-x^*||^2-||x_{1}-x^*||^2)\\ f(x_{2})- f(x^*)\leq \frac{1}{2\eta_t}(||x_1-x^*||^2-||x_{2}-x^*||^2)\\ f(x_{3})- f(x^*)\leq \frac{1}{2\eta_t}(||x_2-x^*||^2-||x_{3}-x^*||^2)\\ ...\\ f(x_{k})- f(x^*)\leq \frac{1}{2\eta_t}(||x_{k-1}-x^*||^2-||x_k-x^*||^2)\\ f(x1)f(x)2ηt1(x0x2x1x2)f(x2)f(x)2ηt1(x1x2x2x2)f(x3)f(x)2ηt1(x2x2x3x2)...f(xk)f(x)2ηt1(xk1x2xkx2)

相加之后
∑ i = 1 k f ( x i ) − k f ( x ∗ ) ≤ 1 2 η t ( ∣ ∣ x 0 − x ∗ ∣ ∣ 2 − ∣ ∣ x k − x ∗ ∣ ∣ 2 ) ∑ i = 1 k f ( x i ) − k ⋅ f ( x ∗ ) ≤ 1 2 η t ∣ ∣ x + 0 − x ∗ ∣ ∣ 2 \begin{aligned} \sum_{i=1}^kf(x_i)-kf(x^*)&\leq \frac{1}{2\eta_t}(||x_0-x^*||^2-||x_k-x^*||^2)\\ \sum_{i=1}^kf(x_i)-k·f(x^*)&\leq \frac{1}{2\eta_t}||x+0-x^*||^2\\ \end{aligned} i=1kf(xi)kf(x)i=1kf(xi)kf(x)2ηt1(x0x2xkx2)2ηt1x+0x2

注解,在梯度下降之下才成立,否则不成立,尤其是随机梯度下降

f ( x i + 1 ) ≤ f ( x i ) − η t 2 ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 f ( x i + 1 ) ≤ f ( x i ) f ( x k ) ≤ f ( x k − 1 ) ≤ f ( x k − 2 ) ≤ f ( x k − 3 ) ≤ f ( x 0 ) k ⋅ f ( x k ) − k f ( x ∗ ) ≤ ∑ i = 1 k f ( x i ) − k f ( x ∗ ) ≤ 1 2 η t ∣ ∣ x 0 − x ∗ ∣ ∣ 2 f ( x k ) − f ( x ∗ ) ≤ 1 2 η t k ∣ ∣ x 0 − x ∗ ∣ ∣ 2 f(x_{i+1})\leq f(x_i)-\frac{\eta_t}{2}||\triangledown f(x_i)||^2\\f(x_{i+1})\leq f(x_i)\\f(x_k)\leq f(x_{k-1})\leq f(x_{k-2})\leq f(x_{k-3})\leq f(x_{0}) \\ k·f(x_k)-kf(x^*)\leq\sum_{i=1}^kf(x_i)-kf(x^*)\leq \frac{1}{2\eta_t}||x_0-x^*||^2\\f(x_k)-f(x^*)\leq \frac{1}{2\eta_tk}||x_0-x^*||^2 f(xi+1)f(xi)2ηtf(xi)2f(xi+1)f(xi)f(xk)f(xk1)f(xk2)f(xk3)f(x0)kf(xk)kf(x)i=1kf(xi)kf(x)2ηt1x0x2f(xk)f(x)2ηtk1x0x2

这个歌问题是Stocklast

收敛性推导
Linear classifier

D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) , } , y i ∈ { − 1 , 1 } , i = 1 , 2 , . . . , n D=\{(x_1,y_1),(x_2,y_2),...,(x_n,y_n),\},y_i\in \{-1,1\},i=1,2,...,n D={(x1,y1),(x2,y2),...,(xn,yn),},yi{1,1},i=1,2,...,n

对于线性模型的参数 θ = { w , b } , w T ⋅ x + b = 0 \theta =\{w,b\},w^T·x+b=0 θ={w,b},wTx+b=0

w T x i + b ≥ 0. y i = 1 w T x i + b ≤ 0. y i = − 1 w^Tx_i+b\geq 0.y_i=1\\w^Tx_i+b\leq 0.y_i=-1 wTxi+b0.yi=1wTxi+b0.yi=1

上面的可以写成
( w T x i + b ) ⋅ y i ≥ 0 (w^Tx_i+b)·y_i\geq 0 (wTxi+b)yi0

错误还请大佬们指出,推导不一定正确,爱好,纯属爱好,记录美好生活,从点滴做起!


  1. 注意两个条件 ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值