Week 12: Constraint GD, Proximal gradient method
1 Constrained descent
1.1 Projected gradient descent (PGD)
Problem
min x f ( x ) s . t . x ∈ C \begin{aligned} \min_x& f(x)\\ s.t.&\quad x\in C \end{aligned} xmins.t.f(x)x∈C
Algorithm
Then
x
t
+
1
=
P
c
(
x
t
−
η
∇
f
(
x
t
)
)
x_{t+1}=P_c(x_t-\eta \nabla f(x_t))
xt+1=Pc(xt−η∇f(xt))
Convergence
- Smooth: O ( 1 / t ) O(1/t) O(1/t)
- Smooth and strongly convex: O ( ( 1 − m M ) t ) O((1-\frac{m}{M})^t) O((1−Mm)t)
- Step size η = 1 M \eta=\frac{1}{M} η=M1
1.2 Frank Wolfe method
Algorithm
S k − 1 = arg min s ∈ C ∇ f ( x k − 1 ) T ( S − x k − 1 ) = arg min s ∈ C ∇ f ( x k − 1 ) T S S_{k-1}=\argmin_{s\in C}\nabla f(x_{k-1})^T(S-x_{k-1})=\argmin_{s\in C}\nabla f(x_{k-1})^TS Sk−1=s∈Cargmin∇f(xk−1)T(S−xk−1)=s∈Cargmin∇f(xk−1)TS x k = ( 1 − r k ) x k − 1 + r k S k − 1 x_k=(1-r_k)x_{k-1}+r_k S_{k-1} xk=(1−rk)xk−1+rkSk−1
- No projection
- r k = 2 k + 1 r_k=\frac{2}{k+1} rk=k+12
- x k x_k xk always feasible
- The objective function to obtain S k − 1 S_{k-1} Sk−1 is linear
- Affine invariant
Convergence
f ( x t ) − f ( x ∗ ) ≤ 2 m k + 1 f(x_t)-f(x^*)\leq \frac{2m}{k+1} f(xt)−f(x∗)≤k+12m m m m is the parameter defining the non-linearity, more non-linear, larger m m m.
Examples
Examples 1-norm constraints
Examples polytope constraints
see note
2 Proximal gradient method
2.1 Motivation
Accelerate the slow convergence of nonsmooth objective function for some decomposable functions.
f
(
x
)
=
g
(
x
)
+
h
(
x
)
f(x)=g(x)+h(x)
f(x)=g(x)+h(x) Where
g
(
x
)
g(x)
g(x) is convex and smooth (M-Lipschitz gradient),
h
(
x
)
h(x)
h(x) is convex, not smooth, but is seperable.
2.2 Idea of proximal gradient
x
+
=
arg min
y
g
(
x
)
+
∇
g
(
x
)
T
(
y
−
x
)
+
1
2
η
∣
∣
y
−
x
∣
∣
2
+
h
(
y
)
x_+=\argmin_y g(x)+\nabla g(x)^T(y-x)+\frac{1}{2\eta}||y-x||^2+h(y)
x+=yargming(x)+∇g(x)T(y−x)+2η1∣∣y−x∣∣2+h(y)
g
(
x
)
g(x)
g(x) can be approximated using quadratic,
h
h
h is directly used.
x
+
=
arg min
y
1
2
η
∣
∣
y
−
(
x
−
η
∇
g
(
x
)
)
∣
∣
2
+
h
(
y
)
x_+=\argmin_y \frac{1}{2\eta}||y-(x-\eta \nabla g(x))||^2+h(y)
x+=yargmin2η1∣∣y−(x−η∇g(x))∣∣2+h(y)
x
+
(
u
)
=
arg min
y
1
2
η
∣
∣
y
−
u
∣
∣
2
+
h
(
y
)
x_+(u)=\argmin_y \frac{1}{2\eta}||y-u||^2+h(y)
x+(u)=yargmin2η1∣∣y−u∣∣2+h(y)
2.3 Proximal gradient
P r o x η h ( u ) = arg min y 1 2 η ∣ ∣ y − u ∣ ∣ 2 + h ( y ) Prox_{\eta h}(u)=\argmin_y \frac{1}{2\eta}||y-u||^2+h(y) Proxηh(u)=yargmin2η1∣∣y−u∣∣2+h(y) x + = P r o x η h ( x − η ∇ g ( x ) ) x_+=Prox_{\eta h}(x-\eta \nabla g(x)) x+=Proxηh(x−η∇g(x))
2.4 Convergence
Smooth and convex:
s
t
e
p
,
η
<
1
M
,
f
(
x
t
)
−
f
∗
≤
O
(
1
T
)
,
O
(
1
ε
)
step, \eta<\frac{1}{M}, f(x_t)-f^*\leq O(\frac{1}{T}), O(\frac{1}{\varepsilon})
step,η<M1,f(xt)−f∗≤O(T1),O(ε1)
Acceleration
O
(
1
T
2
)
,
O
(
1
ε
)
O(\frac{1}{T^2}),O(\frac{1}{\sqrt{\varepsilon}})
O(T21),O(ε1)
Smooth and strongly convex:
O
(
C
T
)
,
O
(
log
ε
)
O(C^T), O(\log{\varepsilon})
O(CT),O(logε)
2.5 Examples
Prox grad as a generalization of projected gradient descent PGD
12.4 example, Example of prox operator: L1 norm
P r o x η h ( u ) = { u i − η , u i > η 0 , − η ≤ u i ≤ η u i + η , u i < − η \begin{aligned} Prox_{\eta h}(u)=\left\{ \begin{array}{lr} u_i-\eta, &u_i>\eta\\ 0,&-\eta\leq u_i\leq \eta\\ u_i+\eta,& u_i<-\eta \end{array} \right. \end{aligned} Proxηh(u)=⎩⎨⎧ui−η,0,ui+η,ui>η−η≤ui≤ηui<−η
Example of prox operator: Quadratic
Example of prox operator: indicator=proj