Convex Optimization 读书笔记 (8)

Chapter9: Unconstrained minimization

9.1 Unconstrained minimization problems

The unconstrained optimization problem is
m i n i m i z e      f ( x ) \begin{aligned} {\rm minimize} \ \ \ \ & f(x) \\ \end{aligned} minimize    f(x)where f : R n → R f : \mathbf{R}^n → \mathbf{R} f:RnR is convex and twice continuously differentiable (which implies that d o m   f \mathbf{dom} \ f dom f is open).
Since f f f is differentiable and convex, a necessary and sufficient condition for a point x ⋆ x^⋆ x to be optimal is
∇ f ( x ∗ ) = 0 \nabla f(x^*)=0 f(x)=0

9.1.1 Examples

9.1.2 Strong convexity and implications

The objective function is strongly convex on S S S, which means that there exists an m > 0 m > 0 m>0 such that
∇ 2 f ( x ) ⪰ m I \nabla^2f(x)\succeq mI 2f(x)mIfor all x ∈ S x \in S xS. For x , y ∈ S x, y ∈ S x,yS we have
f ( y ) = f ( x ) + ∇ f ( x ) T ( y − x ) + 1 2 ( y − x ) T ∇ 2 f ( z ) ( y − x ) f(y)=f(x)+\nabla f(x)^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2f(z)(y-x) f(y)=f(x)+f(x)T(yx)+21(yx)T2f(z)(yx)which means
f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{m}{2}||y-x||_2^2 f(y)f(x)+f(x)T(yx)+2myx22for all x x x and y y y in S S S.

Upper bound on

There exists a constant M M M such that
∇ 2 f ( x ) ⪯ M I ∇^2f(x)\preceq MI 2f(x)MIwhich implies that
f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)^T(y-x)+\frac{M}{2}||y-x||_2^2 f(y)f(x)+f(x)T(yx)+2Myx22

9.2 Descent methods

The algorithms described in this chapter produce a minimizing sequence x ( k ) , k = 1 , . . . , x^{(k)}, k = 1,..., x(k),k=1,..., where
x ( k + 1 ) = x ( k ) + t ( k ) Δ x ( k ) x^{(k+1)}=x^{(k)}+t^{(k)}\Delta x^{(k)} x(k+1)=x(k)+t(k)Δx(k)and t ( k ) > 0 t^{(k)} > 0 t(k)>0 (except when x ( k ) x^{(k)} x(k) is optimal). Here the concatenated symbols ∆ ∆ and x x x that form ∆ x ∆x x are to be read as a single entity, a vector in R n \mathbf{R}^n Rn called the step or search direction (even though it need not have unit norm), and k = 0 , 1 , . . . k = 0, 1, . . . k=0,1,... denotes the iteration number. The scalar t ( k ) ≥ 0 t^{(k)} ≥ 0 t(k)0 is called the step size or step length at iteration k k k (even though it is not equal to ∥ x ( k + 1 ) − x ( k ) ∥ ∥x^{(k+1)} − x^{(k)}∥ x(k+1)x(k) unless ∥ ∆ x ( k ) ∥ = 1 ∥∆x^{(k)}∥ = 1 x(k)=1).

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OA0JXZJA-1605173138341)(/Users/apple/Library/Application Support/typora-user-images/image-20201112155335401.png)]

Exact line search

One line search method sometimes used in practice is exact line search, in which t t t is chosen to minimize f f f along the ray { x + t ∆ x ∣ t ≥ 0 } \{x+t∆x|t≥0\} {x+txt0}:
t = arg ⁡ min ⁡ s ≥ 0 f ( x + s Δ x ) t=\arg\min_{s\geq0}f(x+s\Delta x) t=args0minf(x+sΔx)Many inexact line search methods have been proposed. One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants α , β α, β α,β with 0 < α < 0.5 , 0 < β < 1 0 < α < 0.5, 0 < β < 1 0<α<0.5,0<β<1.
在这里插入图片描述
在这里插入图片描述

9.3 Gradient descent method

A natural choice for the search direction is the negative gradient ∆ x = − ∇ f ( x ) ∆x = −∇f(x) x=f(x). The resulting algorithm is called the gradient algorithm or gradient descent method.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mN6ZnhYI-1605173138344)(/Users/apple/Library/Application Support/typora-user-images/image-20201112160817761.png)]

9.3.1 Convergence analysis

Analysis for exact line search

We assume f f f is strongly convex on S S S, so there are positive constants m m m and M M M such that m I ⪯ ∇ f ( x ) ⪯ M I mI \preceq ∇ f(x) \preceq MI mIf(x)MI for all x ∈ S x ∈ S xS. Define the function f ~ : R → R \tilde{f} : \mathbf{R} → \mathbf{R} f~:RR by f ~ ( t ) = f ( x − t ∇ f ( x ) ) \tilde{f} (t) = f (x − t∇f (x)) f~(t)=f(xtf(x)). We obtain a quadratic upper bound on f ~ \tilde{f} f~:
f ~ ≤ f ( x ) − t ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 + M t 2 2 ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 \tilde{f}\leq f(x)-t||\nabla f(x)||_2^2+\frac{Mt^2}{2}||\nabla f(x)||_2^2 f~f(x)tf(x)22+2Mt2f(x)22The RHS has minimum value about t t t: f ( x ) − 1 2 M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x)-\frac{1}{2M}||\nabla f(x)||_2^2 f(x)2M1f(x)22, then we have
f ( x + ) − p ∗ ≤ f ( x ) − p ∗ − 1 2 M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x^+)-p^*\leq f(x)-p^*-\frac{1}{2M}||\nabla f(x)||_2^2 f(x+)pf(x)p2M1f(x)22Combine with ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 ≥ 2 m ( f ( x ) − p ∗ ) ||\nabla f(x)||_2^2\geq 2m(f(x)-p^*) f(x)222m(f(x)p), we get
f ( x + ) − p ⋆ ≤ ( 1 − m M ) ( f ( x ) − p ⋆ ) f(x^+) − p^⋆ ≤ (1 − \frac{m}{M})(f(x) − p^⋆) f(x+)p(1Mm)(f(x)p)Applying this inequality recursively, we find that
f ( x ( k ) ) − p ∗ ≤ c k ( f ( x ( 0 ) ) − p ∗ ) f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*) f(x(k))pck(f(x(0))p)where c = 1 − m M < 1 c=1-\frac{m}{M}<1 c=1Mm<1, which shows that f ( x ( k ) ) f(x^{(k)}) f(x(k)) converges to p ⋆ p^⋆ p as k → ∞ k → ∞ k.

Analysis for backtracking line search

Analogous to exact line search we get
f ( x ( k ) ) − p ∗ ≤ c k ( f ( x ( 0 ) ) − p ∗ ) f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*) f(x(k))pck(f(x(0))p)where
c = 1 − min ⁡ { 2 m α , 2 β α m M } < 1 c=1-\min\{ 2m\alpha,2\beta\alpha\frac{m}{M} \}<1 c=1min{2mα,2βαMm}<1

9.3.2 Examples

9.4 Steepest descent method

The first-order Taylor approximation of f ( x + v ) f (x + v) f(x+v) around x x x is
f ( x + v ) ≈ f ^ ( x + v ) = f ( x ) + ∇ f ( x ) T v f(x+v)\approx\hat{f}(x+v)=f(x)+\nabla f(x)^Tv f(x+v)f^(x+v)=f(x)+f(x)TvLet ∥ ⋅ ∥ ∥ · ∥ be any norm on R \mathbf{R} R . We define a normalized steepest descent direction (with respect to the norm ∥ ⋅ ∥ ∥ · ∥ ) as
Δ x n s d = arg ⁡ min ⁡ { ∇ f ( x ) T v ∣ ∣ ∣ v ∣ ∣ = 1 } \Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||=1 \} Δxnsd=argmin{f(x)Tvv=1}It is also convenient to consider a steepest descent step Δ x n s d \Delta x_{\rm nsd} Δxnsd that is unnormalized, by scaling the normalized steepest descent direction in a particular way:
Δ x s d = ∣ ∣ ∇ f ( x ) ∣ ∣ ∗ Δ x n s d \Delta x_{\rm sd} = ||\nabla f(x)||_*\Delta x_{\rm nsd} Δxsd=f(x)Δxnsdwhere ∥ ⋅ ∥ ∗ ∥ · ∥_∗ denotes the dual norm.

9.4.1 Steepest descent for Euclidean and quadratic norms

Steepest descent for Euclidean norm

If we take the norm ∥ ⋅ ∥ ∥·∥ to be the Euclidean norm we find that the steepest descent direction is simply the negative gradient ∆ x s d = − ∇ f ( x ) ∆x_{\rm sd} = −∇f(x) xsd=f(x).

Steepest descent for quadratic norm

We consider the quadratic norm
∣ ∣ z ∣ ∣ P = ( z T P z ) 1 2 = ∣ ∣ P 1 2 z ∣ ∣ 2 ||z||_P=(z^TPz)^{\frac{1}{2}}=||P^{\frac{1}{2}}z||_2 zP=(zTPz)21=P21z2where P ∈ S + + n P\in\mathbf{S}_{++}^n PS++n. The normalized steepest descent is
∆ x n s d = − ( f ( x ) T P − 1 ∇ f ( x ) ) − 1 2 P − 1 ∇ f ( x ) ∆ x s d = − P − 1 ∇ f ( x ) ∆x_{\rm nsd} =−(f(x)^TP^{−1}∇f(x))^{-\frac{1}{2}}P^{−1}∇f(x) \\ ∆x_{\rm sd} = -P^{-1}\nabla f(x) xnsd=(f(x)TP1f(x))21P1f(x)xsd=P1f(x)
在这里插入图片描述

9.4.2 Steepest descent for l 1 {l}_1 l1-norm

As another example, we consider the steepest descent method for the l 1 l_1 l1-norm. A normalized steepest descent direction,
Δ x n s d = arg ⁡ min ⁡ { ∇ f ( x ) T v ∣ ∣ ∣ v ∣ ∣ 1 ≤ 1 } \Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||_1\leq1 \} Δxnsd=argmin{f(x)Tvv11}Then a normalized steepest descent direction ∆ x n s d ∆x_{\rm nsd} xnsd for the l 1 l_1 l1-norm is given by
∆ x n s d = − s i g n ( ∂ f ( x ) ∂ x i ) e i ∆x_{\rm nsd} = −{\rm sign}(\frac{∂f(x)}{∂x_i})e_i xnsd=sign(xif(x))eiAn unnormalized steepest descent step is then
∆ x s d = ∆ x n s d ∥ ∇ f ( x ) ∥ ∞ = − ∂ f ( x ) ∂ x i e i ∆x_{\rm sd} = ∆x_{\rm nsd}∥∇f(x)∥_∞ = −\frac{∂f(x)}{∂x_i}e_i xsd=xnsdf(x)=xif(x)ei
在这里插入图片描述

9.4.3 Convergence analysis

We have
f ( x ( k ) ) − p ∗ ≤ c k ( f ( x ( 0 ) ) − p ∗ ) f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*) f(x(k))pck(f(x(0))p)where c = 1 − 2 m α γ ~ 2 min ⁡ { 1 , β γ 2 M } < 1 c=1-2m\alpha\tilde{\gamma}^2\min\{ 1,\frac{\beta\gamma^2}{M} \}<1 c=12mαγ~2min{1,Mβγ2}<1.

9.4.4 Discussion and examples

9.5 Newton’s method

9.5.1 The Newton step

For x ∈ d o m   f x ∈ \mathbf{dom} \ f xdom f, the vector
Δ x n t = − ∇ 2 f ( x ) − 1 ∇ f ( x ) \Delta x_{\rm nt}=-\nabla^2f(x)^{-1}\nabla f(x) Δxnt=2f(x)1f(x)is called the Newton step. Positive definiteness of ∇ 2 f ( x ) ∇_2f(x) 2f(x) implies that
∇ f ( x ) T ∆ x n t = − ∇ f ( x ) T ∇ 2 f ( x ) − 1 ∇ f ( x ) < 0 ∇f(x)^T ∆x_{\rm nt} = −∇f(x)^T ∇^2f(x)^{−1}∇f(x) < 0 f(x)Txnt=f(x)T2f(x)1f(x)<0unless ∇ f ( x ) = 0 ∇f(x) = 0 f(x)=0, so the Newton step is a descent direction (unless x x x is optimal).

9.5.2 Newton’s method

Newton’s method, as outlined below, is sometimes called the damped Newton method or guarded Newton method, to distinguish it from the pure Newton method, which uses a fixed step size t = 1 t = 1 t=1.
在这里插入图片描述

9.5.3 Convergence analysis

There are numbers η η η and γ γ γ with 0 < η ≤ m 2 L 0 < η ≤ \frac{m^2}{L} 0<ηLm2 and γ > 0 γ > 0 γ>0 such that the following hold:

If ∥ ∇ f ( x ( k ) ) ∥ 2 ≥ η ∥∇f(x^{(k)})∥_2 ≥ η f(x(k))2η, then
f ( x ( k + 1 ) ) − f ( x ( k ) ) ≤ − γ f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ f(x(k+1))f(x(k))γIf ∥ ∇ f ( x ( k ) ) ∥ 2 < η ∥∇f(x^{(k)})∥_2 < η f(x(k))2<η, then the backtracking line search selects t ( k ) = 1 t^{(k)}=1 t(k)=1 and
L 2 m 2 ∥ ∇ f ( x ) ( k + 1 ) ∥ 2 ≤ ( L 2 m 2 ∣ ∣ ∇ f ( x ( k ) ) ∣ ∣ 2 ) 2 \frac{L}{2m^2} ∥∇f(x)^{(k+1)}∥_2 ≤ (\frac{L}{2m^2}||\nabla f(x^{(k)})||_2)^2 2m2Lf(x)(k+1)2(2m2Lf(x(k))2)2The second case gives the result:
f ( x ( l ) ) − p ∗ ≤ 1 2 m ∣ ∣ ∇ f ( x ( l ) ) ∣ ∣ 2 2 ≤ 2 m 3 L 2 ( 1 2 ) 2 l − k + 1 f(x^{(l)})-p^*\leq \frac{1}{2m}||\nabla f(x^{(l)})||_2^2 \leq\frac{2m^3}{L^2}(\frac{1}{2})^{2^{l-k+1}} f(x(l))p2m1f(x(l))22L22m3(21)2lk+1This last inequality shows that convergence is extremely rapid once the second condition is satisfied. This phenomenon is called quadratic convergence.

9.6 Self-concordance

9.6.1 Definition and examples

We start by considering functions on R \mathbf{R} R. A convex function f : R → R f : \mathbf{R} → \mathbf{R} f:RR is self-concordant if
∣ f ′ ′ ′ ( x ) ∣ ≤ 2 f ′ ′ ( x ) 3 2 |f^{\prime\prime\prime}(x)|\leq 2f^{\prime\prime}(x)^{\frac{3}{2}} f(x)2f(x)23

9.6.2 Self-concordant calculus

Scaling and sum

Self-concordance is preserved by scaling by a factor exceeding one: If f f f is self-concordant and a ≥ 1 a ≥ 1 a1, then a f af af is self-concordant. Self-concordance is also preserved by addition: If f 1 , f 2 f_1, f_2 f1,f2 are self-concordant, then f 1 + f 2 f_1 + f_2 f1+f2 is self-concordant.

Composition with affine function

If f : R n → R f : \mathbf{R}^n → \mathbf{R} f:RnR is self-concordant, and A ∈ R n × m , b ∈ R n A ∈ \mathbf{R}^{n\times m}, b ∈ \mathbf{R}^n ARn×m,bRn, then f ( A x + b ) f(Ax+b) f(Ax+b) is self-concordant.

Composition with logarithm

Let g : R → R g : \mathbf{R} → \mathbf{R} g:RR be a convex function with d o m   g = R + + \mathbf{dom} \ g = \mathbf{R}_{++} dom g=R++, and
∣ g ′ ′ ′ ( x ) ∣ ≤ 3 g ′ ′ ( x ) x |g^{\prime\prime\prime}(x)| ≤ 3\frac{g^{\prime\prime}(x)}{x} g(x)3xg(x)for all x x x. Then
f ( x ) = − log ⁡ ( − g ( x ) ) − log ⁡ x f(x)=-\log (-g(x))-\log x f(x)=log(g(x))logxis self-concordant on { x ∣ x > 0 , g ( x ) < 0 } \{x | x > 0, g(x) < 0\} {xx>0,g(x)<0}.

9.6.3 Properties of self-concordant functions

9.6.4 Analysis of Newton’s method for self-concordant functions

We will show that there are numbers η η η and γ > 0 γ > 0 γ>0, with 0 < η ≤ 1 4 0 < η ≤ \frac{1}{4} 0<η41, that depend only on the line search parameters α α α and β β β, such that the following hold:
If λ ( x ( k ) ) ≥ η \lambda(x^{(k)}) ≥ η λ(x(k))η, then
f ( x ( k + 1 ) ) − f ( x ( k ) ) ≤ − γ f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ f(x(k+1))f(x(k))γIf λ ( x ( k ) ) < η \lambda(x^{(k)}) < η λ(x(k))<η, then the backtracking line search selects t ( k ) = 1 t^{(k)}=1 t(k)=1 and
2 λ ( x ( k + 1 ) ) ≤ ( 2 λ ( x ( k ) ) ) 2 2\lambda(x^{(k+1)})\leq (2\lambda(x^{(k)}))^2 2λ(x(k+1))(2λ(x(k)))2As a consequence,
f ( x ( l ) ) − p ∗ ≤ λ ( x ( l ) ) 2 ≤ ( 1 2 ) 2 l − k + 1 f(x^{(l)})-p^*\leq\lambda(x^{(l)})^2\leq(\frac{1}{2})^{2^{l-k+1}} f(x(l))pλ(x(l))2(21)2lk+1

9.6.5 Discussion and numerical examples

9.7 Implementation

9.7.1 Pre-computation for line searches

In the simplest implementation of a line search, f ( x + t ∆ x ) f (x + t∆x) f(x+tx) is evaluated for each value of t t t in the same way that f ( z ) f(z) f(z) is evaluated for any z ∈ d o m   f z ∈ \mathbf{dom} \ f zdom f. But in some cases we can exploit the fact that f f f (and its derivatives, in an exact line search) are to be evaluated at many points along the ray { x + t ∆ x ∣ t ≥ 0 } \{x + t∆x | t ≥ 0\} {x+txt0} to reduce the total computational effort. This usually requires some pre-computation, which is often on the same order as computing f f f at any point, after which f f f (and its derivatives) can be computed more efficiently along the ray.

9.7.2 Computing the Newton step

To compute the Newton step ∆ x n t ∆x_{\rm nt} xnt, we first evaluate and form the Hessian matrix H = ∇ 2 f ( x ) H = ∇^2f(x) H=2f(x) and the gradient g = ∇ f ( x ) g = ∇f(x) g=f(x) at x x x. Then we solve the system of linear equations H ∆ x n t = − g H∆x_{\rm nt} = −g Hxnt=g to find the Newton step. This set of equations is sometimes called the Newton system (since its solution gives the Newton step) or the normal equations, since the same type of equation arises in solving a least-squares problem.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值