Convex Optimization 读书笔记 (8)

最新推荐文章于 2022-12-07 05:57:29 发布

来碗拿铁️

最新推荐文章于 2022-12-07 05:57:29 发布

阅读量481

点赞数

分类专栏：读书笔记凸优化文章标签：算法线性代数

本文链接：https://blog.csdn.net/qq_39337332/article/details/109649431

版权

读书笔记同时被 2 个专栏收录

11 篇文章 1 订阅

订阅专栏

凸优化

10 篇文章 0 订阅

订阅专栏

Chapter9: Unconstrained minimization

9.1 Unconstrained minimization problems

The unconstrained optimization problem is
$\begin{aligned} {\rm minimize} \ \ \ \ & f(x) \\ \end{aligned}$ where $\mathbf{R}^n → \mathbf{R}$ is convex and twice continuously differentiable (which implies that $\mathbf{dom} \ f$ is open).
Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^⋆$ to be optimal is
$\nabla f(x^*)=0$

9.1.1 Examples

9.1.2 Strong convexity and implications

The objective function is strongly convex on $S$ , which means that there exists an $m > 0$ such that
$\nabla^2f(x)\succeq mI$ for all $\in S$ . For $x, y \in S$ we have
$f(y)=f(x)+\nabla f(x)^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2f(z)(y-x)$ which means
$f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{m}{2}||y-x||_2^2$ for all $x$ and $y$ in $S$ .

Upper bound on

There exists a constant $M$ such that
$∇^2f(x)\preceq MI$ which implies that
$f(y)\leq f(x)+\nabla f(x)^T(y-x)+\frac{M}{2}||y-x||_2^2$

9.2 Descent methods

The algorithms described in this chapter produce a minimizing sequence $x^{(k)}, k = 1,...,$ where
$x^{(k+1)}=x^{(k)}+t^{(k)}\Delta x^{(k)}$ and $t^{(k)} > 0$ (except when $x^{(k)}$ is optimal). Here the concatenated symbols $∆$ and $x$ that form $∆ x$ are to be read as a single entity, a vector in $\mathbf{R}^n$ called the step or search direction (even though it need not have unit norm), and $k = 0, 1, . . .$ denotes the iteration number. The scalar $t^{(k)} ≥ 0$ is called the step size or step length at iteration $k$ (even though it is not equal to $x^{(k+1)} − x^{(k)}∥$ unless $x^{(k)}∥ = 1$ ).

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OA0JXZJA-1605173138341)(/Users/apple/Library/Application Support/typora-user-images/image-20201112155335401.png)]

Exact line search

One line search method sometimes used in practice is exact line search, in which $t$ is chosen to minimize $f$ along the ray ${x+t∆x|t≥0\}$ :
$t=\arg\min_{s\geq0}f(x+s\Delta x)$ Many inexact line search methods have been proposed. One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants $α, β$ with $0 < α < 0.5, 0 < β < 1$ .
在这里插入图片描述

9.3 Gradient descent method

A natural choice for the search direction is the negative gradient $∆ x = - \nabla f (x)$ . The resulting algorithm is called the gradient algorithm or gradient descent method.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mN6ZnhYI-1605173138344)(/Users/apple/Library/Application Support/typora-user-images/image-20201112160817761.png)]

9.3.1 Convergence analysis

Analysis for exact line search

We assume $f$ is strongly convex on $S$ , so there are positive constants $m$ and $M$ such that $\preceq ∇ f(x) \preceq MI$ for all $x \in S$ . Define the function $\tilde{f} : \mathbf{R} → \mathbf{R}$ by $\tilde{f} (t) = f (x − t∇f (x))$ . We obtain a quadratic upper bound on $\tilde{f}$ :
$\tilde{f}\leq f(x)-t||\nabla f(x)||_2^2+\frac{Mt^2}{2}||\nabla f(x)||_2^2$ The RHS has minimum value about $t$ : $f(x)-\frac{1}{2M}||\nabla f(x)||_2^2$ , then we have
$f(x^+)-p^*\leq f(x)-p^*-\frac{1}{2M}||\nabla f(x)||_2^2$ Combine with $||\nabla f(x)||_2^2\geq 2m(f(x)-p^*)$ , we get
$f(x^+) − p^⋆ ≤ (1 − \frac{m}{M})(f(x) − p^⋆)$ Applying this inequality recursively, we find that
$f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*)$ where $c=1-\frac{m}{M}<1$ , which shows that $f(x^{(k)})$ converges to $p^⋆$ as $k \to \infty$ .

Analysis for backtracking line search

Analogous to exact line search we get
$f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*)$ where
$c=1-\min\{ 2m\alpha,2\beta\alpha\frac{m}{M} \}<1$

9.3.2 Examples

9.4 Steepest descent method

The first-order Taylor approximation of $f (x + v)$ around $x$ is
$f(x+v)\approx\hat{f}(x+v)=f(x)+\nabla f(x)^Tv$ Let $∥ \cdot ∥$ be any norm on $\mathbf{R}$ . We define a normalized steepest descent direction (with respect to the norm $∥ \cdot ∥$ ) as
$\Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||=1 \}$ It is also convenient to consider a steepest descent step $\Delta x_{\rm nsd}$ that is unnormalized, by scaling the normalized steepest descent direction in a particular way:
$\Delta x_{\rm sd} = ||\nabla f(x)||_*\Delta x_{\rm nsd}$ where $∥_∗$ denotes the dual norm.

9.4.1 Steepest descent for Euclidean and quadratic norms

Steepest descent for Euclidean norm

If we take the norm $∥ \cdot ∥$ to be the Euclidean norm we find that the steepest descent direction is simply the negative gradient $∆x_{\rm sd} = −∇f(x)$ .

Steepest descent for quadratic norm

We consider the quadratic norm
$||z||_P=(z^TPz)^{\frac{1}{2}}=||P^{\frac{1}{2}}z||_2$ where $P\in\mathbf{S}_{++}^n$ . The normalized steepest descent is
$∆x_{\rm nsd} =−(f(x)^TP^{−1}∇f(x))^{-\frac{1}{2}}P^{−1}∇f(x) \\ ∆x_{\rm sd} = -P^{-1}\nabla f(x)$
在这里插入图片描述

9.4.2 Steepest descent for ${l}_1$ -norm

As another example, we consider the steepest descent method for the $l_1$ -norm. A normalized steepest descent direction,
$\Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||_1\leq1 \}$ Then a normalized steepest descent direction $∆x_{\rm nsd}$ for the $l_1$ -norm is given by
$∆x_{\rm nsd} = −{\rm sign}(\frac{∂f(x)}{∂x_i})e_i$ An unnormalized steepest descent step is then
$∆x_{\rm sd} = ∆x_{\rm nsd}∥∇f(x)∥_∞ = −\frac{∂f(x)}{∂x_i}e_i$
在这里插入图片描述

9.4.3 Convergence analysis

We have
$f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*)$ where $c=1-2m\alpha\tilde{\gamma}^2\min\{ 1,\frac{\beta\gamma^2}{M} \}<1$ .

9.4.4 Discussion and examples

9.5 Newton’s method

9.5.1 The Newton step

For $\mathbf{dom} \ f$ , the vector
$\Delta x_{\rm nt}=-\nabla^2f(x)^{-1}\nabla f(x)$ is called the Newton step. Positive definiteness of $_2f(x)$ implies that
$∇f(x)^T ∆x_{\rm nt} = −∇f(x)^T ∇^2f(x)^{−1}∇f(x) < 0$ unless $\nabla f (x) = 0$ , so the Newton step is a descent direction (unless $x$ is optimal).

9.5.2 Newton’s method

Newton’s method, as outlined below, is sometimes called the damped Newton method or guarded Newton method, to distinguish it from the pure Newton method, which uses a fixed step size $t = 1$ .
在这里插入图片描述

9.5.3 Convergence analysis

There are numbers $η$ and $γ$ with $\frac{m^2}{L}$ and $γ > 0$ such that the following hold:

If $f(x^{(k)})∥_2 ≥ η$ , then
$f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ$ If $f(x^{(k)})∥_2 < η$ , then the backtracking line search selects $t^{(k)}=1$ and
$\frac{L}{2m^2} ∥∇f(x)^{(k+1)}∥_2 ≤ (\frac{L}{2m^2}||\nabla f(x^{(k)})||_2)^2$ The second case gives the result:
$f(x^{(l)})-p^*\leq \frac{1}{2m}||\nabla f(x^{(l)})||_2^2 \leq\frac{2m^3}{L^2}(\frac{1}{2})^{2^{l-k+1}}$ This last inequality shows that convergence is extremely rapid once the second condition is satisfied. This phenomenon is called quadratic convergence.

9.6 Self-concordance

9.6.1 Definition and examples

We start by considering functions on $\mathbf{R}$ . A convex function $\mathbf{R} → \mathbf{R}$ is self-concordant if
$|f^{\prime\prime\prime}(x)|\leq 2f^{\prime\prime}(x)^{\frac{3}{2}}$

9.6.2 Self-concordant calculus

Scaling and sum

Self-concordance is preserved by scaling by a factor exceeding one: If $f$ is self-concordant and $a \geq 1$ , then $a f$ is self-concordant. Self-concordance is also preserved by addition: If $f_1, f_2$ are self-concordant, then $f_1 + f_2$ is self-concordant.

Composition with affine function

If $\mathbf{R}^n → \mathbf{R}$ is self-concordant, and $\mathbf{R}^{n\times m}, b ∈ \mathbf{R}^n$ , then $f (A x + b)$ is self-concordant.

Composition with logarithm

Let $\mathbf{R} → \mathbf{R}$ be a convex function with $\mathbf{dom} \ g = \mathbf{R}_{++}$ , and
$|g^{\prime\prime\prime}(x)| ≤ 3\frac{g^{\prime\prime}(x)}{x}$ for all $x$ . Then
$f(x)=-\log (-g(x))-\log x$ is self-concordant on ${x | x > 0, g(x) < 0\}$ .

9.6.3 Properties of self-concordant functions

9.6.4 Analysis of Newton’s method for self-concordant functions

We will show that there are numbers $η$ and $γ > 0$ , with $\frac{1}{4}$ , that depend only on the line search parameters $α$ and $β$ , such that the following hold:
If $\lambda(x^{(k)}) ≥ η$ , then
$f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ$ If $\lambda(x^{(k)}) < η$ , then the backtracking line search selects $t^{(k)}=1$ and
$2\lambda(x^{(k+1)})\leq (2\lambda(x^{(k)}))^2$ As a consequence,
$f(x^{(l)})-p^*\leq\lambda(x^{(l)})^2\leq(\frac{1}{2})^{2^{l-k+1}}$

9.6.5 Discussion and numerical examples

9.7 Implementation

9.7.1 Pre-computation for line searches

In the simplest implementation of a line search, $f (x + t ∆ x)$ is evaluated for each value of $t$ in the same way that $f (z)$ is evaluated for any $\mathbf{dom} \ f$ . But in some cases we can exploit the fact that $f$ (and its derivatives, in an exact line search) are to be evaluated at many points along the ray ${x + t∆x | t ≥ 0\}$ to reduce the total computational effort. This usually requires some pre-computation, which is often on the same order as computing $f$ at any point, after which $f$ (and its derivatives) can be computed more efficiently along the ray.

9.7.2 Computing the Newton step

To compute the Newton step $∆x_{\rm nt}$ , we first evaluate and form the Hessian matrix $H = ∇^2f(x)$ and the gradient $g = \nabla f (x)$ at $x$ . Then we solve the system of linear equations $H∆x_{\rm nt} = −g$ to find the Newton step. This set of equations is sometimes called the Newton system (since its solution gives the Newton step) or the normal equations, since the same type of equation arises in solving a least-squares problem.