凸优化算法-无约束问题-下降法（Descent Methods）

最新推荐文章于 2023-10-19 22:22:27 发布

MadJieJie

最新推荐文章于 2023-10-19 22:22:27 发布

阅读量769

点赞数

分类专栏： Algorithms

若有帮助，请点赞&收藏，转载请标注出处。

本文链接：https://blog.csdn.net/MadJieJie/article/details/118728129

版权

Algorithms 专栏收录该内容

5 篇文章 2 订阅

订阅专栏

Descent Methods

The algorithms described in this chapter produce a minimizing sequence $x (k)$ , $k = 1, . . ., K$ where $x^{(k+1)} = x^{(k)}+t^{(k)} \Delta x.$ where

$\Delta x \rightarrow$ a vector in $\mathbf{R}^n$ called the step or search direction.
$k = 1, . . ., K$ is the iteration number.
one iteration of an algorithm $\rightarrow$ $x^+ = x + t \Delta x$ , or $t\Delta x,$ in place of $x^{(k+1)} = x^{(k)}+t^{(k)} \Delta x$ .

All the methods we study are descent methods, which means that
$f(x^{(k+1)}) ~< ~f(x^{(k)})$ except when $x^{(k)}$ is optimal.

From convexity we know that $\nabla f(x^{(k)})^T (y-x^{(k)}) \geq 0$ implies $\geq f(x^{(k)})$ , so the search direction in a descent method must satisfy $\nabla f(x^{(k)})^T \Delta x^{(k)}<0,$ i.e., it must make an acute angle with the negative gradient.

To be noted, the stopping condition is often of the form $\| \nabla f(x) \|_2 \leq \delta$ , where $\delta>0$ and is small.
$\quad$

Exact Line Search

One line search method sometimes used in practice is exact line search, in which $t$ is chosen to minimize $f$ along the ray $\{ x+ t\Delta x ~| ~t \geq 0 \}:$ $\argmin_{s \geq 0} f(x+s\Delta x).$

Backtracking Line Search

Most line searches used in practice are inexact: the step length is chosen to approximately minimize $f$ along the ray $\Delta x ~ |~ t \geq 0}$ , or even to just reduce $f$ ‘enough’. One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants $\alpha,~\beta$ with $\alpha < 0.5,~ 0 < \beta <1.$
$\quad$

$\quad$

The line search is called backtracking because it starts with unit step size and then reduces it by the factor $\beta$ until the stopping condition $f(x+\Delta x) \leq f(x) + \alpha t \nabla f(x)^T \Delta x$ holds. Since $\Delta x$ is a descent direction, we have $\nabla f(x)^T \Delta x < 0$ , so for small enough $t$ we have $\Delta x) \approx f(x) + t \nabla f(x)^T \Delta x < f(x) + \alpha t \nabla f(x)^T \Delta x,$ which can be interpreted by the following figure.

Gradient Descent Method

When we try to solve the model parameters of machine learning algorithms, Gradient Descent Method (GDM) and Least-Square Method (LSM) are frequently used.

Interpretation of GDM

Actually, GSM, iterative, can be seen as climbing. As an example, someone at a mountain would like to find an efficient strategy to fastly arrive the bottom of the mountain (i.e., miniming the objective function). Taking the position as the base point ( ${x^{(k)}}$ ), we must find a steepest point ( ${x^{(k+1)}}$ ) and move a step ( $t\Delta x$ ) forward the direction of this point ( $\Delta x$ ). Similar to GDM, the Gradient Ascent Method is also the same theory.

Differential

Differential can be seen as the following meanings:

In function figures, the slope of the tangent line at one point.
The rate of change of a certain function

（1） the differential of one variable

As an example, $\frac{\mathrm{d}(x^2)}{\mathrm{d}x} = 2x$ .

（2）the differential of variables

As an example, $\frac{\partial (x^2y^2) }{\partial x} = 2xy^2$ .

（3）Gradient

Gradient, a vector $\Delta x$ , represents that the directional derivative of a certain function can be got the maximum value along this direction. In other words, the function at the point change the fast along the direction, and its rate of change is the maximum (the modulus of the gradient $f(\Phi)|$ ).
For example, denoting $f(x,y,z) = x^2 + y^2 + z^2,$ we have $\nabla f(x,y,z) = \left< \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z} \right> = ( 2x, 2y, 2z ).$

An example for solving gradient of a function by matlab is as follows:

syms x y z
f = x^2+y^2+z^2;
grad = gradient(f,[x ,y ,z])

Then, the preuso-code of GDM is shown in the following figure.

$\quad$

The stopping criterion is usually of the form $\| \nabla f(x) \|_2 ≤ η$ , where η is small and positive.
In most implementations, this condition is checked after step $\mathbf{1}$ , rather than after the update.

Covergence Analysis of Gradient Descent Method

We first assume $f$ is strongly convex on $\mathcal{S}$ , so there are positive constants $m$ and $M$ such that $m\mathbf{I} \preceq\nabla^2 f(x) \preceq M \mathbf{I}$ for all $\in \mathcal{S}.$ Define the function $\tilde{f}: \mathbf{R} \rightarrow \mathbf{R}$ by $\tilde{f}(t) = f (x - t \nabla f(x))$ , i.e., $f$ as a function of the step length $t$ in the negative gradient direction. In the following discussion, we will only consider $t$ for which $x-t\nabla f(x) \in S$ . From $(9.13)$ , with $x^{k+1}=x^k-t\nabla f(x)$ , we obtain a quadratic upper bound on $\tilde{f}$ : $f(x^k+1) \leq f(x^k) + \nabla f^T(x^k) \left(-t \nabla f(x^k) \right) + \frac{M}{2}\| -t \nabla f(x^k) \|_2^2.$

Analysis for exact line search

We now assume that an exact line search is used, and minimize over $t$ both sides of the inequality $(9.17)$ . On the lefthand, we get ${t}_{exact}$ is the step length that minimizes $\tilde{x}$ . The righthand side is a simple quardratic, which is minimized by $t=\frac{1}{M}$ , and has minimum value $\frac{1}{2M}\| \nabla f(x) \|_2^2$ .
Therefore, we have
$f(x^{k+1}) = \tilde{f}({t}_{exact}) ~ \leq ~ f(x^k) - \frac{1}{2M}\| \nabla f(x^k) \|_2^2.$
Subtracting $p^*$ from both sides, we get
$f(x^{k+1}) - p^* ~ \leq ~ f(x^k) - p^* - \frac{1}{2M} \| \nabla f(x^k) \|_2^2.$

We combine this with $\| \nabla f (x^k)\|_2^2 \geq 2 m (f(x^k)-p^*)$ (which follows from $(9.9)$ ) to conclude
$f(x^{k+1}) - p^* ~ \leq ~ \left(1-\frac{m}{M} \right) \left( f(x^k) - p^* \right) .$
Appying this inequality recursively, we find that
$f(x^k) - p^* \leq \left(1-\frac{m}{M} \right)^k \left( f(x^0) - p^* \right) = c^k \left( f(x^0) - p^* \right),$ where $=\frac{m}{M})$ shows that $f(x^k)$ converges to $p^*$ as $k\rightarrow \infty$ .
In particular, we must have
$f(x^k) - p^* \leq \epsilon$ after at most
$\frac{ \log \left( (f(x^0)-p^*)/\epsilon \right) }{\log \left( 1/c \right)}$ iterations of the gradient method with exact line search.
The numerator, $\log \left( (f(x^0)-p^*)/\epsilon \right),$ can be interpreted as the log of the ratio of the initial suboptimality (i.e., gep between $f(x^0)$ and $p^*$ ), to the final suboptimality (i.e., less than $\epsilon$ ). This term sugegests that the number of iterations depends on how good the initial point is, and what the final required accuracy is.
For large condition number bound $M / m$ , we have
$\log(1/c) = −\log(1 − m/M) \approx m/M,$ so our bound on the number of iterations required increases approximately linearly with increasing $M / m$ .

We will see that the gradient method does in fact require a large number of iterations, when the Hessian of $f$ , near $x^⋆$ , has a large condition number. Conversely, when the sublevel sets of $f$ are relatively isotropic, so that the condition number bound $M / m$ can be chosen to be relatively small, the bound (9.18) shows that convergence is rapid, since c is small, or at least not too close to one.

Analysis for backtracking line search

Steepest Descent Method

The first-order Taylor approximation of $f (x + v)$ around $x$ is
$\approx \hat{f}(x + v) = f(x) + \nabla f(x)^T v.$
The second term on the righthand side, $\nabla f(x)^T v$ , is the directional derivative of $f$ at $x$ in the direction $v$ . It gives the approximate change in $f$ for a small step $v$ .
Then, we now address the question of how to choose $v$ to make the direction derivative as negative (small) as possible.
Let $\|\cdot\|$ be any norm on $\mathbf{R}^n$ . We define a normalized steepest descent direction (with respect to the norm $\|\cdot \|$ ) as
$\Delta x = \argmin_x \{\nabla f(x)^T v~ |~ \| v \| =1 \}.$ To be noted, if $v=-\nabla f(x)^T$ , then the GSM and SDM are same.

Gradient Descent Methods Related to Machine Learning

Concepts Related

A. 学习率/步长（Learning rate / Step size）:
The length of the step is decided by learning rate.

B. 特征（Feature）:
Features are the inputting part of samples. For example, like two feature samples $x^{(0)}, y^{(0)} ),~( x^{(1)}, y^{(1)} )$ , the sample outputting corresponding to the sample feature $x^{(0)}$ is $y^{(0)}$ .

C. 假设函数（Hypothesis Function）
In supervised learning, the hypothesis function used to fit the input sample, denoted as, $h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 +...+ \theta_n x_n$ .

D. 损失函数（Loss Function）
$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2$

Batch gradient descent Method

A. Denoting loss (energy) function as $J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2$ , our objective is to $\min J(\theta).$

B. The partial derivative of $J(\theta)$ with respect to $\theta_j$ is
$\begin{aligned} \frac{\partial J(\theta) }{\partial \theta_{j}} &=\frac{\partial}{\partial \theta_{j}} \frac{1}{2}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^{2} \\ &=2 \cdot \frac{1}{2}\left(h_{\theta}(x^{(i)})-y^{(i)} \right) \cdot \frac{\partial}{\partial \theta_{j}}\left(h_{\theta}(x^{(i)})-y^{(i)} \right) \\ &=\left(h_{\theta}(x^{(i)})-y^{(i)} \right) \cdot \frac{\partial \left( \theta_{j} x^{(i)}-y^{(i)} \right)}{\partial \theta_{j}} \\ &=\left(h_{\theta}(x^{(i)})-y^{(i)} \right) x_{j}^i \\ &=-\left(y^{(i)} - h_{\theta}(x^{(i)})\right) x_{j}^i, \end{aligned}$
denoted as, $D=-\left(y^{(i)} - h_{\theta}(x^{(i)})\right) x_{j}^i.$

C. According to the above equation, we have the gradient
$\Delta \theta = -\alpha D =\alpha \left(y^{(i)} - h_{\theta}(x^{(i)})\right) x_{j}^i,$ where $\alpha$ is step size (leanring rate).
D. The preuso-code of BGDM is shown in the following figure:
$\quad$

Stochastic gradient descent Method

A. First, we rewrrite the loss function in Section 3 as
$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2} \left( y^{(i)} - h_\theta (x^{(i)}) \right)^2 = \frac{1}{m} \sum_{i=1}^{m} \cos t \left( \theta, (x^{(i)},y^{(i)}) \right),$ where $\cos t \left( \theta, (x^{(i)},y^{(i)}) \right) = \frac{1}{2} \left( y^{(i)} - h_\theta (x^{(i)}) \right)^2.$

B. Taking the partial derivative of the above loss function with respect to $\theta$ , we have the gradient
$\Delta \theta =\alpha \left( y^{(i)} - h_{\theta}(x^{(i)}) \right) x_j^{(i)}.$

C. The preuso-code of SGDM is shown in the following figure:
$\quad$

Mini-Batch Gradient Descent Method

Combing the characteristics of BGDM and SGDM, we have Mini-Batch Gradient Descent Method (MBGDM).

The preuso-code of MBGDM is shown in the following figure:
$\quad$

MadJieJie

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
凸优化算法-无约束问题-下降法（Descent Methods）

Directory1. Descent Methods2. Gradient Descent MethodInterpretation of GDMDifferential（1） the differential of one variable（2）the differential of variables（3）Gradient3. Batch gradient descent Method4. Stochastic gradient descent Method5. Mini-Batch Gradie..
复制链接

扫一扫