Lect2: Line Search Methods-CSDN博客

1. The Wolfe Conditions

The Armijo condition:
\(\alpha_k\) should give sufficienti decrease in the object function \(f\):
\[f(x_k+\alpha_k p_k)\leq f(x_k)+c_1\alpha_k\nabla f_k^Tp_k\]
for some constant \(c_1\in(0,1)\). In practice, \(c_1\) is chosen to qe quite small, say \(c_1=10^{-4}\)
The curvature contion:
\[\nabla f(x_k+\alpha_kp_k)^Tp_k\geq c_2\nabla f_k^Tp_k\]
for some constant \(c_2\in(c_1,1)\)
Remark:The Armijo condition and curvature condition are known as the Wolfe conditions
Remark:Typical values of \(c_2\) are 0.9 when the search direction \(p_k\) is chosen by a Newton or quasi-Newton method, and 0.1 when \(p_k\) is obtained from a nonlinear conjugate gradient method
Remark:Armijo condition is satisfied for all small enough \(\alpha\), the curvature condition enforces to choose \(\alpha\) large enough
The strong Wolfe conditions:
\[f(x_k+\alpha_k p_k)\leq f(x_k)+c_1\alpha_k\nabla f_k^Tp_k\] \[|\nabla f(x_k+\alpha_kp_k)^Tp_k|\leq c_2|\nabla f_k^Tp_k|\]
Remark:The only difference is that we no longer allow \(\nabla f(x_k+\alpha_kp_k)^Tp_k\) to be too positive, hence exclude points that are far from stationary points

2. Existence of Intervals for Wolfe Conditions:

Lemma:Suppose that \(f:\mathbb{R}^n\rightarrow\mathbb{R}\) is continuously differentiable. Let \(p_k\) be a descent direction at \(x_k\), and assume that \(f\) is bounded below along the ray
\(\{x_k+\alpha p_k|\alpha>0\}\). Then if \(0<c_1<c_2<1\), there exist intervals of step length satisfying the Wolfe conditions and strong Wolfe conditions
Remark: The Wolfe conditions can be used in most line search methods, and are particularly important in the implementation of quasi-Newton methods

3. The Goldstein Conditions:

\[f(x_k)+(1-c)\alpha_k\nabla f_k^Tp_k\leq f(x_k+\alpha_kp_k)\leq f(x_k)+c\alpha_k\nabla f_k^Tp_k\]
Remark:The Goldstein conditions are often used in Newton-type methods but are not well suited for quasi-Newton methods that maintain a positive definite Hessian approximation.

4. Backtracking Line Search:

choose \(\bar{\alpha}>0, \rho,c\in(0,1)\), set \(\alpha\leftarrow\bar{\alpha}\)
repeat until \(f(x_k+\alpha p_k)\leq f(x_k)+c\alpha\)
　　　\(\alpha\leftarrow\rho\alpha\)
end(repeat)
terminate with \(\alpha_k=\alpha\)
Remark:the initial step length \(\bar{\alpha}\) is chosen to be 1 in Newton and quasi-Newton methods, but can have different values in other algorithms such as steepest descent or conjugate gradient
Remark:we need to ensure that at each iteration, we have \(\rho\in[\rho_l,\rho_h]\) for some fixed constants \(0<\rho_l<\rho_h<1\)

5. Convergence of Line Search Methods:

Theorem:Consider any iteration of the form \(x_{k+1}=x_k+\alpha_kp_k\), where \(p_k\) is a descent direction and \(\alpha_k\) satisfies the Wolfe conditions. Supposed that \(f\) is bounded below in \(\mathbb{R}^n\) and \(f\) is continuously differentiable in an open set \(\mathcal{N}\) containing the Level set \(\mathcal{L}=\{x:f(x)<f(x_0)\}\), where \(x_0\) is the starting point of the iteration. Assume also that the gradient \(\nabla f\) is Lipschitz continuous on \(\mathcal{N}\), then the Zoutendijk condition holds:
\[\sum_{k\geq} cos^2\theta_k||\nabla f_k||^2<\infty\]
where \(cos\theta_k=\frac{-\nabla f_k^Tp_k}{||\nabla f_k||*||p_k||}\)
Remark:If \(p_k\) is chosen such that there is a positive constant \(\delta\) satisfying \(cos\theta_k\geq\delta>0\) for all k, then it follows from Zoutendijk condition that
\[\lim_{k\rightarrow\infty}||\nabla f_k||=0\]
Remark:For the Newton-like method, assume that the matrices \(B_k\) are positive definite with a uniformly bounded condition number, that is, there is a constant \(M\) such that \(||B_k||*||B_k^{-1}||\leq M\), for all k, then \(cos\theta_k\geq\frac{1}{M}\), so we have \(\lim_{k\rightarrow\infty}||\nabla f_k||=0\)
Remark:For the conjugate gradient method, we can only prove a weaker result: \(\lim\inf_{k\rightarrow\infty}||\nabla f_k||=0\)

6. A General Class of Algorithms:

Consider any algorithm for which

every iteration produces a decrease in the objective function
every m-th iteration is a steepest descent step, with step length chosen to satisfy the Wolfe or
Goldstein conditions
the steepest descent step ensures convergence, while the other \(m-1\) step can be used to design algorithm that performs better than steepest descent

7. The Challenge of Designing Algorithms:

Algorithmic strategies that achieve rapid convergence can sometimes conflict with the requirements of global convergence, and vice versa. For example, the steepest descent method is the quintessential globally convergent algorithm, but it is quite slow in practice. On the other hand, the pure Newton iteration converges rapidly when started close enough to a solution, but its steps may not even be descent directions away from the solution. The challenge is to design algorithms that incorporate both properties: good global convergence guarantees and a rapid rate of convergence.

8. Convergence of Steepest Descent Methods:

consider the quadratic form \(f(x)=\frac{1}{2}x^TQx-b^Tx\), where \(Q\) is symmetric and positive definite, use the exact line search: \(\alpha_k=argmin_{\alpha}f(x_k-\alpha\nabla f_k)= \frac{\nabla f_k^T\nabla f_k}{\nabla f_k^TQ\nabla f_k}\), then the error norm satisfies
\[||x_{k+1}-x^*||_Q\leq (\frac{\lambda_n-\lambda_1}{\lambda_n+\lambda_1})^2||x_k-x^*||^2_Q\]
where \(0<\lambda_1\leq\cdots\leq\lambda_n\) are eigenvalues of \(Q\), and \(||x||_Q^2=x^TQx\)
Suppose that \(f:\mathbb{R}^n\rightarrow\mathbb{R}\) is twice continuously differentiable, and that the iterates generated by the steepest descent meethod with exact line searches converges to a point \(x^*\) where the Hessian matrix \(\nabla^2f(x^*)\) is positive definite, then
\[f(x_{k+1})-f(x^*)\leq(\frac{\lambda_n-\lambda_1}{\lambda_n+\lambda_1})^2(f(x_k)-f(x^*))\]
where \(0<\lambda_1\leq\cdots\leq\lambda_n\) are eigenvalues of \(\nabla^2 f(x^*)\)

9. Convergence of Quasi-Newton Methods:

suppose that \(f:\mathbb{R}^n\rightarrow\mathbb{R}\) is three times continuously differentiable. Consider the iteration \(x_{k+1}=x_k+\alpha_kp_k\), where \(p_k\) is a descent direction and \(\alpha_k\) satisfies the Wolfe conditions with \(c_1\leq\frac{1}{2}\). If the sequence \(\{x_k\}\) converges to a point \(x^*\) such that \(\nabla f(x^*)=0\) and \(\nabla^2f(x^*)\) is positive definite, and if the search direction satisfies
\[\lim_{k\rightarrow}\frac{||\nabla f_k+\nabla^2f_kp_k||}{||p_k||}=0\]
then, the step length \(\alpha_k=1\) is admissible for all \(k\) greater than a certain index \(k_0\), and if \(\alpha_k=1\) for all \(k>k_0\), \(\{x_k\}\) converges to \(x^*\) superlinearly
Remark:If we use the quasi-Newton method: \(p_k=-B_k^{-1}\nabla f_k\), the above equation is equivalent to:
\[\lim_{k\rightarrow\infty}\frac{||(B_k-\nabla^2f(x^*))p_k||}{||p_k||}=0\]
so \(B_k\) does not need to converge to \(\nabla f(x^*)\), it suffices that \(B_k\) converges to \(\nabla^2f(x^*)\) along the search direction \(p_k\)
Suppose that \(f:\mathbb{R}^n\rightarrow\mathbb{R}\) is three times continuously differentiable. Consider the iteration \(x_{k+1}=x_k+p_k\), where \(p_k=-B_k^{-1}\nabla f_k\). If the sequence \(\{x_k\}\) converges to a point \(x^*\) such that \(\nabla f(x^*)=0\) and \(\nabla^2f(x^*)\) is positive definite, Then \(\{x_k\}\) converges superlinearly if and only if
\[\lim_{k\rightarrow\infty}\frac{||(B_k-\nabla^2f(x^*))p_k||}{||p_k||}=0\]
holds

10. Convergences of Newton's Methods:

Suppose that \(f\) is twice differentiable and that the Hessian \(\nabla^2f(x)\) is Lipschitz continuous in a neighborhood of a solution \(x^*\) at which the second sufficient conditions are satisfied. Consider the iteration \(x_{k+1}=x_k+p_k\) , where \(p_k=-\nabla^2f_k^{-1}\nabla f_k\). Then

if the starting point \(x_0\) is sufficiently close to \(x^*\), the sequence converges to \(x^*\)
the rate of convergence of \(\{x_k\}\) is quadratic
the sequence of gradient norms \(\{||\nabla f_k||\}\) converges quadratically to zero

11. Coordinate Descent Methods:

The coordinate descent method with exact line searches can iterate infinitely without ever approaching a point where the gradient of the objective function vanishes
If the coordinate descent method converges to a solution, then its rate of convergence is often much slower than that of the steepest descent method, and the difference between them
increases with the number of variables
it does not require calculation of the gradient \(\nabla f_k\)
the speed of convergence can be quite acceptable if the variables are loosely coupled

Remark: back-and-forth approach, \(e_1,e_2,\cdots,e_{n-1},e_n,e_{n-1},\cdots,e_2,e_1, e_2,\cdots\)

12. Step Length Selection Algorithms:

Interpolation: initial guess \(\alpha_0\), if \(\phi(\alpha_0)\leq\phi(0)+c_1\alpha_0\phi'(0)\) is not satisfied, form the quadratic approximation \(\phi_q(\alpha)\) to \(\phi\) by interpolating \(\phi(0),\phi'(0)\) and \(\phi(\alpha_0)\), new value \(\alpha_1\) is defined as the minimizer of this quadratic form. If the sufficient decrease condition is not satisfied, construct a cubic function interpolating \(\phi(0),\phi'(0),\phi(\alpha_0)\) and \(\phi(\alpha_1)\) amd find the minimizer \(\alpha_2\) of this cubic form. If necessary, the process is repeated, using a cubic interpolant of \(\phi(0), \phi'(0)\) and the two most recent value of \(\phi\). If any \(\alpha_i\) is either too close to \(\alpha_{i-1}\) or too much smaller than \(\alpha_{i-1}\), reset \(\alpha_i=\frac{\alpha_{i-1}}{2}\)

Remark:The strategy just described assumes that derivative values are significantly more expansive to compute than function values. It is often possible, however, to compute the directional derivative simultaneously with the function, at little additional cost. Accordingly, we can design an alternative strategy based on cubic interpolation of the values of \(\phi\) and \(\phi'\) at the two most recent values of \(\alpha\)

13. The Initial Step Length

For Newton and quasi-Newton methods the step \(\alpha_0=1\) should always be used as the initial trial step length
For methods such as the steepest descent and conjugate gradient, a popular strategy is to choose the initial guess \(\alpha_0\) so that
\[\alpha_0\nabla f_k^Tp_k=\alpha_{k-1}\nabla f_{k-1}^Tp_{k-1}\]
Another useful strategy is to interpolate a quadratic to the data \(f(x_{k-1}), f(x_k)\) and \(\phi'(0)=\nabla f_k^Tp_k\) and to define \(\alpha_0\) to be its minimizer:
\[\alpha_0=\frac{2(f_k-f_{k-1})}{\phi'(0)}\]
It can be shown that if \(x_k\rightarrow x^*\) superlinearly, then the ratio in this expression converges to 1. If we adjust the above equation by setting:
\[\alpha_0\leftarrow\min(1,1.01\alpha_0)\]
we find that the unit step length \(\alpha_0=1\) will eventually always be tried and accepted, and the superlinear convergence properties of Newton and quasi-Newton methods will be observed.

14. A Linear Search Algorithm for The Wolfe Conditions:

We describe a one-dimensional search procedure that is guaranteed to find a step length satisfying the strong Wolfe conditions.
Linear Search Algorithm:
set \(\alpha_0\leftarrow 0\), choose \(\alpha_1>0\) and \(\alpha_{max}\)
\(i\leftarrow 1\)
repeat
　　evalute \(\phi(\alpha_i)\)
　　if (\(\phi(\alpha_i)>\phi(0)+c_1\alpha_i\phi'(0)\)) or (\(phi(\alpha_i)\geq\phi(\alpha_{i-1})\)
　　 and \(i>1\))
　　　　\(\alpha_{*}\leftarrow zoom(\alpha_{i-1},\alpha_i)\) and stop
　　evalue \(\phi'(\alpha_i)\)
　　if \(|\phi'(\alpha_i)|\leq -c_2\phi'(0)\)
　　　　set \(\alpha_{*}\rightarrow\alpha_i\) and stop
　　if \(\phi'(\alpha_i)\geq 0\)
　　　　set \(\alpha_*\leftarrow zoom(\alpha_i,\alpha_{i-1})\) and stop
　　choose \(\alpha_{i+1}\in(\alpha_i,\alpha_{max})\)
　　\(i\leftarrow i+1\)
end(repeat)

zoom:
repeat:
　　interpolate (qua) to find a trial step length \(\alpha_j\) between \(\alpha_l\) and \(\alpha_h\)
　　evaluate \(\phi(\alpha_j)\)
　　if \(\phi(\alpha_j)>\phi(0)+c_1\alpha_j\phi'(0)\) or \(\phi(\alpha_j)\geq\phi(\alpha_l)\)
　　　　\(\alpha_l\leftarrow\alpha_j\)
　　else
　　　　evaluate \(\phi'(\alpha_j)\)
　　　　if \(|\phi'(\alpha_j)|\leq-c_2\phi'(0)\)
　　　　　　set \(\alpha_{*}\leftarrow\alpha_j\) and stop
　　　　if \(\phi'(\alpha_j)(\alpha_h-\alpha_l)\geq 0\)
　　　　　　\(\alpha_h\leftarrow\alpha_l\)
　　　　\(\alpha_l\leftarrow\alpha_j\)
end(repeat)

Remark:the linear search algorithm uses the knowledge that the interval \((\alpha_{i-1},\alpha_i)\) contains step lengths satisfying the strong Wolf conditions if one of the following conditions is satisfied:

\(\alpha_i\) violates the sufficient decrease condition
\(\phi(\alpha_i)\geq\phi(\alpha_{i-1})\)
\(\phi'(\alpha_i)\geq 0\)

Remark: The last step of the algorithm performs extrapolation to find the next trial value \(\alpha_{i+1}\) or simply set \(\alpha_{i+1}\) to some constant multiple of \(\alpha_i\)

Remark:The order of the input arguments of \(zoom\) is such that each call has the form \(zoom(\alpha_l,\alpha_h)\), where:

the interval bounded by \(\alpha_l\) and \(\alpha_h\) contains step lengths that satisfy the strong Wolfe conditions
\(\alpha_l\) is, amongly all step lengths generated so far and satisfying the sufficient decrease condition, the one giving the smallest function value
\(\alpha_h\) is chosen so that \(\phi'(\alpha_l)(\alpha_h-\alpha_l)<0\)

Each iteration of zoom generates an iterate \(\alpha_j\) between \(\alpha_l\) and \(\alpha_h\) , and then replaces one of these endpoints by \(\alpha_j\) in such a way that the three above properties continue to hold.

转载于:https://www.cnblogs.com/cihui/p/6402783.html