Optimization Week 12: Proximal gradient method and newton method

1 Constrained descent

1.1 Projected gradient descent (PGD)

Problem

min ⁡ x f ( x ) s . t . x ∈ C \begin{aligned} \min_x& f(x)\\ s.t.&\quad x\in C \end{aligned} xmins.t.f(x)xC

Algorithm

Then
x t + 1 = P c ( x t − η ∇ f ( x t ) ) x_{t+1}=P_c(x_t-\eta \nabla f(x_t)) xt+1=Pc(xtηf(xt))

Convergence

  • Smooth: O ( 1 / t ) O(1/t) O(1/t)
  • Smooth and strongly convex: O ( ( 1 − m M ) t ) O((1-\frac{m}{M})^t) O((1Mm)t)
  • Step size η = 1 M \eta=\frac{1}{M} η=M1

1.2 Frank Wolfe method

Algorithm

S k − 1 = arg min ⁡ s ∈ C ∇ f ( x k − 1 ) T ( S − x k − 1 ) = arg min ⁡ s ∈ C ∇ f ( x k − 1 ) T S S_{k-1}=\argmin_{s\in C}\nabla f(x_{k-1})^T(S-x_{k-1})=\argmin_{s\in C}\nabla f(x_{k-1})^TS Sk1=sCargminf(xk1)T(Sxk1)=sCargminf(xk1)TS x k = ( 1 − r k ) x k − 1 + r k S k − 1 x_k=(1-r_k)x_{k-1}+r_k S_{k-1} xk=(1rk)xk1+rkSk1

  • No projection
  • r k = 2 k + 1 r_k=\frac{2}{k+1} rk=k+12
  • x k x_k xk always feasible
  • The objective function to obtain S k − 1 S_{k-1} Sk1 is linear
  • Affine invariant

Convergence

f ( x t ) − f ( x ∗ ) ≤ 2 m k + 1 f(x_t)-f(x^*)\leq \frac{2m}{k+1} f(xt)f(x)k+12m m m m is the parameter defining the non-linearity, more non-linear, larger m m m.

Examples

Examples 1-norm constraints
Examples polytope constraints
see note

2 Proximal gradient method

2.1 Motivation

Accelerate the slow convergence of nonsmooth objective function for some decomposable functions.
f ( x ) = g ( x ) + h ( x ) f(x)=g(x)+h(x) f(x)=g(x)+h(x) Where g ( x ) g(x) g(x) is convex and smooth (M-Lipschitz gradient), h ( x ) h(x) h(x) is convex, not smooth, but is seperable.

2.2 Idea of proximal gradient

x + = arg min ⁡ y g ( x ) + ∇ g ( x ) T ( y − x ) + 1 2 η ∣ ∣ y − x ∣ ∣ 2 + h ( y ) x_+=\argmin_y g(x)+\nabla g(x)^T(y-x)+\frac{1}{2\eta}||y-x||^2+h(y) x+=yargming(x)+g(x)T(yx)+2η1yx2+h(y) g ( x ) g(x) g(x) can be approximated using quadratic, h h h is directly used.
x + = arg min ⁡ y 1 2 η ∣ ∣ y − ( x − η ∇ g ( x ) ) ∣ ∣ 2 + h ( y ) x_+=\argmin_y \frac{1}{2\eta}||y-(x-\eta \nabla g(x))||^2+h(y) x+=yargmin2η1y(xηg(x))2+h(y) x + ( u ) = arg min ⁡ y 1 2 η ∣ ∣ y − u ∣ ∣ 2 + h ( y ) x_+(u)=\argmin_y \frac{1}{2\eta}||y-u||^2+h(y) x+(u)=yargmin2η1yu2+h(y)

2.3 Proximal gradient

P r o x η h ( u ) = arg min ⁡ y 1 2 η ∣ ∣ y − u ∣ ∣ 2 + h ( y ) Prox_{\eta h}(u)=\argmin_y \frac{1}{2\eta}||y-u||^2+h(y) Proxηh(u)=yargmin2η1yu2+h(y) x + = P r o x η h ( x − η ∇ g ( x ) ) x_+=Prox_{\eta h}(x-\eta \nabla g(x)) x+=Proxηh(xηg(x))

2.4 Convergence

Smooth and convex:
s t e p , η < 1 M , f ( x t ) − f ∗ ≤ O ( 1 T ) , O ( 1 ε ) step, \eta<\frac{1}{M}, f(x_t)-f^*\leq O(\frac{1}{T}), O(\frac{1}{\varepsilon}) step,η<M1,f(xt)fO(T1),O(ε1)

Acceleration
O ( 1 T 2 ) , O ( 1 ε ) O(\frac{1}{T^2}),O(\frac{1}{\sqrt{\varepsilon}}) O(T21),O(ε 1)

Smooth and strongly convex:
O ( C T ) , O ( log ⁡ ε ) O(C^T), O(\log{\varepsilon}) O(CT),O(logε)

2.5 Examples

Prox grad as a generalization of projected gradient descent PGD
12.4 example, Example of prox operator: L1 norm

P r o x η h ( u ) = { u i − η , u i > η 0 , − η ≤ u i ≤ η u i + η , u i < − η \begin{aligned} Prox_{\eta h}(u)=\left\{ \begin{array}{lr} u_i-\eta, &u_i>\eta\\ 0,&-\eta\leq u_i\leq \eta\\ u_i+\eta,& u_i<-\eta \end{array} \right. \end{aligned} Proxηh(u)=uiη,0,ui+η,ui>ηηuiηui<η

Example of prox operator: Quadratic
Example of prox operator: indicator=proj

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值