Week 11: Coordinate descent, Subgradient
- 1 Coordinate descent
- 2 Subgradients
- 3 Gradient, subgradient, proximal
- 3.1 Convex: Subgradient, O ( 1 t ) O(\frac{1}{\sqrt t}) O(t1), O ( 1 ε 2 ) O(\frac{1}{\varepsilon^2}) O(ε21)
- 3.2 Convex + decomposable to smooth and nonsmooth but seperable functions: Proximal gradient descent, O ( 1 t ) O(\frac{1}{ t}) O(t1), O ( 1 ε ) O(\frac{1}{\varepsilon}) O(ε1)
- 3.3 Convex + Smooth: Gradient descent, O ( 1 t ) O(\frac{1}{ t}) O(t1), O ( 1 ε ) O(\frac{1}{\varepsilon}) O(ε1)
- 3.4 Strongly convex: Subgradient descent, O ( 1 t ) O(\frac{1}{ t}) O(t1), O ( 1 ε ) O(\frac{1}{\varepsilon}) O(ε1)
- 3.5 Smooth + Strongly convex: Gradient descent, O ( ( 1 − m / M ) t ) O((1-m/M)^t) O((1−m/M)t), O ( l o g ( 1 / ε ) ) O(log(1/\varepsilon)) O(log(1/ε))
- No smooth, use subgradient or proximal (decomposable)
1 Coordinate descent
1.1 Will it be optimal?
When f ( x + δ e i ) ≥ f ( x ) , ∀ e i , δ f(x+\delta e_i)\geq f(x),\forall e_i, \delta f(x+δei)≥f(x),∀ei,δ
Optimal when f f f is convex and smooth
Not necessary when f f f is nonconvex and smooth
Not necessary when f f f is convex and nonsmooth
Optimal when f f f can be decomposed into a convex smooth function and a set of convex, nonsmooth and seperable functions
1.2 Algorithm
x
i
t
+
1
=
arg min
x
i
f
(
x
i
,
x
/
i
t
)
x_i^{t+1}=\argmin_{x_i} f(x_i,x_{/ i}^t)
xit+1=xiargminf(xi,x/it)
x
i
t
+
1
=
x
i
t
−
η
t
∇
i
f
(
x
i
t
,
x
/
i
t
)
x_i^{t+1}=x_i^{t}-\eta_t\nabla_i f(x_i^t,x_{/ i}^t)
xit+1=xit−ηt∇if(xit,x/it)
Order can be arbitrary, only when every coordinate is updated infinite times.
2 Subgradients
2.1 Subgradients
Convex functions always have subgradients.
g
g
g is the subgradient of
f
f
f at
x
x
x when:
f
(
y
)
≥
f
(
x
)
+
g
T
(
y
−
x
)
,
∀
y
\quad f(y)\geq f(x)+g^T(y-x),\forall y
f(y)≥f(x)+gT(y−x),∀y
2.2 Subdifferential
∂ f ( x ) = { g : g is the subgradient of f at x } \partial f(x)=\{g: \text{g is the subgradient of f at x}\} ∂f(x)={g:g is the subgradient of f at x}
2.3 Property
- Linearity
∂ ( a 1 f 1 + a 2 f 2 ) = a 1 ∂ f 1 + a 2 ∂ f 2 \partial (a_1f_1+a_2f_2)=a_1\partial f_1 +a_2\partial f_2 ∂(a1f1+a2f2)=a1∂f1+a2∂f2 - Affine composition
g ( x ) = f ( A x + b ) g(x)=f(Ax+b) g(x)=f(Ax+b) ∂ g ( x ) = ∂ A f ( A x + b ) \partial g(x)=\partial A f(Ax+b) ∂g(x)=∂Af(Ax+b)
2.4 Optimality conditions
When
f
f
f is convex:
f
(
x
∗
)
=
min
x
f
(
x
)
⇔
0
∈
f
(
x
∗
)
f(x^*)=\min_x f(x) \Leftrightarrow 0\in f(x^*)
f(x∗)=xminf(x)⇔0∈f(x∗) where
x
x
x is not constrained.
For constrained problems, KKT conditions should be stasfied (or normal cone?OL, 0 ∈ f ( x ∗ ) + N ( x ∗ ) 0\in f(x^*)+N(x^*) 0∈f(x∗)+N(x∗)), the gradient of lagrangian is converted into the subgradient of the lagrangian.
But the subgradient select is not necessarily a descent direction!
2.5 Subgradient method
Problem of GD for non-smooth fucntions
It may oscillate around the nondifferentiable point.
Subgradient method
x
t
+
1
=
x
t
−
η
t
g
t
x_{t+1}=x_t-\eta_t g_t
xt+1=xt−ηtgt where
g
t
∈
∂
f
(
x
t
)
g_t \in \partial f(x_t)
gt∈∂f(xt),
g
t
g_t
gt may not be a descent direction.
f
(
x
b
e
s
t
t
)
=
min
s
≤
t
f
(
x
s
)
f(x_{best}^t)=\min_{s\leq t}f(x_s)
f(xbestt)=s≤tminf(xs)
Therom of error
For Lipschitz
G
G
G that
∣
∣
f
(
x
)
−
f
(
y
)
∣
∣
≤
G
∣
∣
x
−
y
∣
∣
||f(x)-f(y)||\leq G||x-y||
∣∣f(x)−f(y)∣∣≤G∣∣x−y∣∣
For fixed
η
\eta
η:
lim
t
→
∞
f
(
x
b
e
s
t
t
)
≤
f
∗
+
η
G
2
2
\lim_{t\rightarrow \infin} f(x_{best}^t)\leq f^*+\frac{\eta G^2}{2}
t→∞limf(xbestt)≤f∗+2ηG2 So, the we need step size
η
→
0
\eta \rightarrow 0
η→0 and
∑
η
i
→
∞
\sum\eta_i \rightarrow \infin
∑ηi→∞
Step and convergence
- η = 1 t \eta =\frac{1}{\sqrt{t}} η=t1
- f ( x b e s t t ) − f ∗ ≤ O ( R 2 + G 2 t ) f(x_{best}^t)-f^*\leq O(\frac{R^2+G^2}{\sqrt t}) f(xbestt)−f∗≤O(tR2+G2), where R R R is the initial distance.
- Need O ( 1 ε 2 ) O(\frac{1}{\varepsilon^2}) O(ε21) steps to reach ε \varepsilon ε accuracy, while GD only needs O ( 1 ε ) O(\frac{1}{\varepsilon}) O(ε1) steps.
- In general, cannot do better than O ( 1 ε 2 ) O(\frac{1}{\varepsilon^2}) O(ε21), for the update x t ∈ x 0 + s p a n ( g 0 . . . g t − 1 ) x_t\in x_0+span(g_0...g_{t-1}) xt∈x0+span(g0...gt−1) will have f ( x t ) − f ( x ∗ ) ≥ R G 2 ( 1 + ( t + 1 ) ) f(x_t)-f(x^*)\geq \frac{RG}{2(1+\sqrt(t+1))} f(xt)−f(x∗)≥2(1+(t+1))RG