Optimization Week 11: Constrained descent, Coordinate descent, Subgradient

xiwang_chn

于 2021-01-21 14:11:42 发布

阅读量181

点赞数

分类专栏： Optimization

本文链接：https://blog.csdn.net/weixin_42017454/article/details/109687405

版权

Optimization 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

Week 11: Coordinate descent, Subgradient

1 Coordinate descent
2 Subgradients
3 Gradient, subgradient, proximal

1 Coordinate descent

1.1 Will it be optimal?

When $f(x+\delta e_i)\geq f(x),\forall e_i, \delta$

Optimal when $f$ is convex and smooth

Not necessary when $f$ is nonconvex and smooth

Not necessary when $f$ is convex and nonsmooth

Optimal when $f$ can be decomposed into a convex smooth function and a set of convex, nonsmooth and seperable functions

1.2 Algorithm

$x_i^{t+1}=\argmin_{x_i} f(x_i,x_{/ i}^t)$
$x_i^{t+1}=x_i^{t}-\eta_t\nabla_i f(x_i^t,x_{/ i}^t)$
Order can be arbitrary, only when every coordinate is updated infinite times.

2 Subgradients

2.1 Subgradients

Convex functions always have subgradients.

$g$ is the subgradient of $f$ at $x$ when:
$\quad f(y)\geq f(x)+g^T(y-x),\forall y$

2.2 Subdifferential

$\partial f(x)=\{g: \text{g is the subgradient of f at x}\}$

2.3 Property

Linearity
$\partial (a_1f_1+a_2f_2)=a_1\partial f_1 +a_2\partial f_2$
Affine composition
$g (x) = f (A x + b)$ $\partial g(x)=\partial A f(Ax+b)$

2.4 Optimality conditions

When $f$ is convex:
$f(x^*)=\min_x f(x) \Leftrightarrow 0\in f(x^*)$ where $x$ is not constrained.

For constrained problems, KKT conditions should be stasfied (or normal cone?OL, $0\in f(x^*)+N(x^*)$ ), the gradient of lagrangian is converted into the subgradient of the lagrangian.

But the subgradient select is not necessarily a descent direction!

2.5 Subgradient method

Problem of GD for non-smooth fucntions

It may oscillate around the nondifferentiable point.

Subgradient method

$x_{t+1}=x_t-\eta_t g_t$ where $g_t \in \partial f(x_t)$ , $g_t$ may not be a descent direction.
$f(x_{best}^t)=\min_{s\leq t}f(x_s)$

Therom of error

For Lipschitz $G$ that $||f(x)-f(y)||\leq G||x-y||$
For fixed $\eta$ :
$\lim_{t\rightarrow \infin} f(x_{best}^t)\leq f^*+\frac{\eta G^2}{2}$ So, the we need step size $\eta \rightarrow 0$ and $\sum\eta_i \rightarrow \infin$

Step and convergence

$\eta =\frac{1}{\sqrt{t}}$
$f(x_{best}^t)-f^*\leq O(\frac{R^2+G^2}{\sqrt t})$ , where $R$ is the initial distance.
Need $O(\frac{1}{\varepsilon^2})$ steps to reach $\varepsilon$ accuracy, while GD only needs $O(\frac{1}{\varepsilon})$ steps.
In general, cannot do better than $O(\frac{1}{\varepsilon^2})$ , for the update $x_t\in x_0+span(g_0...g_{t-1})$ will have $f(x_t)-f(x^*)\geq \frac{RG}{2(1+\sqrt(t+1))}$

3 Gradient, subgradient, proximal

3.1 Convex: Subgradient, $O(\frac{1}{\sqrt t})$ , $O(\frac{1}{\varepsilon^2})$

3.2 Convex + decomposable to smooth and nonsmooth but seperable functions: Proximal gradient descent, $O(\frac{1}{ t})$ , $O(\frac{1}{\varepsilon})$

3.3 Convex + Smooth: Gradient descent, $O(\frac{1}{ t})$ , $O(\frac{1}{\varepsilon})$

3.4 Strongly convex: Subgradient descent, $O(\frac{1}{ t})$ , $O(\frac{1}{\varepsilon})$

3.5 Smooth + Strongly convex: Gradient descent, $O((1-m/M)^t)$ , $O(log(1/\varepsilon))$

No smooth, use subgradient or proximal (decomposable)