凸优化算法-无约束问题-下降法(Descent Methods)


Descent Methods

The algorithms described in this chapter produce a minimizing sequence x ( k ) x (k) x(k) , k = 1 , . . . , K k = 1,...,K k=1,...,K where x ( k + 1 ) = x ( k ) + t ( k ) Δ x . x^{(k+1)} = x^{(k)}+t^{(k)} \Delta x. x(k+1)=x(k)+t(k)Δx.where

  • Δ x → \Delta x \rightarrow Δx a vector in R n \mathbf{R}^n Rn called the step or search direction.
  • k = 1 , . . . , K k=1,...,K k=1,...,K is the iteration number.
  • one iteration of an algorithm → \rightarrow x + = x + t Δ x x^+ = x + t \Delta x x+=x+tΔx, or x : = x + t Δ x , x:= x + t\Delta x, x:=x+tΔx, in place of x ( k + 1 ) = x ( k ) + t ( k ) Δ x x^{(k+1)} = x^{(k)}+t^{(k)} \Delta x x(k+1)=x(k)+t(k)Δx.

All the methods we study are descent methods, which means that
f ( x ( k + 1 ) )   <   f ( x ( k ) ) f(x^{(k+1)}) ~< ~f(x^{(k)}) f(x(k+1)) < f(x(k)) except when x ( k ) x^{(k)} x(k) is optimal.

From convexity we know that ∇ f ( x ( k ) ) T ( y − x ( k ) ) ≥ 0 \nabla f(x^{(k)})^T (y-x^{(k)}) \geq 0 f(x(k))T(yx(k))0 implies f ( y ) ≥ f ( x ( k ) ) f(y) \geq f(x^{(k)}) f(y)f(x(k)), so the search direction in a descent method must satisfy ∇ f ( x ( k ) ) T Δ x ( k ) < 0 , \nabla f(x^{(k)})^T \Delta x^{(k)}<0, f(x(k))TΔx(k)<0, i.e., it must make an acute angle with the negative gradient.

To be noted, the stopping condition is often of the form ∥ ∇ f ( x ) ∥ 2 ≤ δ \| \nabla f(x) \|_2 \leq \delta f(x)2δ, where δ > 0 \delta>0 δ>0 and is small.
\quad

Exact Line Search

One line search method sometimes used in practice is exact line search, in which t t t is chosen to minimize f f f along the ray { x + t Δ x   ∣   t ≥ 0 } : \{ x+ t\Delta x ~| ~t \geq 0 \}: {x+tΔx  t0}: t = arg min ⁡ s ≥ 0 f ( x + s Δ x ) . t = \argmin_{s \geq 0} f(x+s\Delta x). t=s0argminf(x+sΔx).

Backtracking Line Search

Most line searches used in practice are inexact: the step length is chosen to approximately minimize f f f along the ray x + t Δ x   ∣   t ≥ 0 { x+ t \Delta x ~ |~ t \geq 0} x+tΔx  t0, or even to just reduce f f f ‘enough’. One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants α ,   β \alpha,~\beta α, β with 0 < α < 0.5 ,   0 < β < 1. 0 < \alpha < 0.5,~ 0 < \beta <1. 0<α<0.5, 0<β<1.
\quad

\quad

The line search is called backtracking because it starts with unit step size and then reduces it by the factor β \beta β until the stopping condition f ( x + Δ x ) ≤ f ( x ) + α t ∇ f ( x ) T Δ x f(x+\Delta x) \leq f(x) + \alpha t \nabla f(x)^T \Delta x f(x+Δx)f(x)+αtf(x)TΔx holds. Since Δ x \Delta x Δx is a descent direction, we have ∇ f ( x ) T Δ x < 0 \nabla f(x)^T \Delta x < 0 f(x)TΔx<0, so for small enough t t t we have P 9.13 :   f ( x + t Δ x ) ≈ f ( x ) + t ∇ f ( x ) T Δ x < f ( x ) + α t ∇ f ( x ) T Δ x , P9.13: ~ f(x + t \Delta x) \approx f(x) + t \nabla f(x)^T \Delta x < f(x) + \alpha t \nabla f(x)^T \Delta x, P9.13: f(x+tΔx)f(x)+tf(x)TΔx<f(x)+αtf(x)TΔx, which can be interpreted by the following figure.


Gradient Descent Method

When we try to solve the model parameters of machine learning algorithms, Gradient Descent Method (GDM) and Least-Square Method (LSM) are frequently used.

Interpretation of GDM

Actually, GSM, iterative, can be seen as climbing. As an example, someone at a mountain would like to find an efficient strategy to fastly arrive the bottom of the mountain (i.e., miniming the objective function). Taking the position as the base point ( x ( k ) {x^{(k)}} x(k)), we must find a steepest point ( x ( k + 1 ) {x^{(k+1)}} x(k+1)) and move a step ( t Δ x t\Delta x tΔx) forward the direction of this point ( Δ x \Delta x Δx). Similar to GDM, the Gradient Ascent Method is also the same theory.

Differential

Differential can be seen as the following meanings:

  • In function figures, the slope of the tangent line at one point.
  • The rate of change of a certain function
(1) the differential of one variable

As an example, d ( x 2 ) d x = 2 x \frac{\mathrm{d}(x^2)}{\mathrm{d}x} = 2x dxd(x2)=2x.

(2)the differential of variables

As an example, ∂ ( x 2 y 2 ) ∂ x = 2 x y 2 \frac{\partial (x^2y^2) }{\partial x} = 2xy^2 x(x2y2)=2xy2.

(3)Gradient

Gradient, a vector Δ x \Delta x Δx, represents that the directional derivative of a certain function can be got the maximum value along this direction. In other words, the function at the point change the fast along the direction, and its rate of change is the maximum (the modulus of the gradient ∣ g r a d   f ( Φ ) ∣ | grad ~ f(\Phi)| grad f(Φ)).
For example, denoting f ( x , y , z ) = x 2 + y 2 + z 2 , f(x,y,z) = x^2 + y^2 + z^2, f(x,y,z)=x2+y2+z2, we have ∇ f ( x , y , z ) = < ∂ f ∂ x , ∂ f ∂ y , ∂ f ∂ z > = ( 2 x , 2 y , 2 z ) . \nabla f(x,y,z) = \left< \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z} \right> = ( 2x, 2y, 2z ). f(x,y,z)=xf,yf,zf=(2x,2y,2z).

An example for solving gradient of a function by matlab is as follows:

syms x y z
f = x^2+y^2+z^2;
grad = gradient(f,[x ,y ,z])

Then, the preuso-code of GDM is shown in the following figure.

\quad

  • The stopping criterion is usually of the form ∥ ∇ f ( x ) ∥ 2 ≤ η \| \nabla f(x) \|_2 ≤ η f(x)2η, where η is small and positive.
  • In most implementations, this condition is checked after step 1 \mathbf{1} 1, rather than after the update.

Covergence Analysis of Gradient Descent Method

We first assume f f f is strongly convex on S \mathcal{S} S, so there are positive constants m m m and M M M such that m I ⪯ ∇ 2 f ( x ) ⪯ M I m\mathbf{I} \preceq\nabla^2 f(x) \preceq M \mathbf{I} mI2f(x)MI for all x ∈ S . x \in \mathcal{S}. xS. Define the function f ~ : R → R \tilde{f}: \mathbf{R} \rightarrow \mathbf{R} f~:RR by f ~ ( t ) = f ( x − t ∇ f ( x ) ) \tilde{f}(t) = f (x - t \nabla f(x)) f~(t)=f(xtf(x)), i.e., f f f as a function of the step length t t t in the negative gradient direction. In the following discussion, we will only consider t t t for which x − t ∇ f ( x ) ∈ S x-t\nabla f(x) \in S xtf(x)S. From ( 9.13 ) (9.13) (9.13), with x k + 1 = x k − t ∇ f ( x ) x^{k+1}=x^k-t\nabla f(x) xk+1=xktf(x), we obtain a quadratic upper bound on f ~ \tilde{f} f~: f ( x k + 1 ) ≤ f ( x k ) + ∇ f T ( x k ) ( − t ∇ f ( x k ) ) + M 2 ∥ − t ∇ f ( x k ) ∥ 2 2 . f(x^k+1) \leq f(x^k) + \nabla f^T(x^k) \left(-t \nabla f(x^k) \right) + \frac{M}{2}\| -t \nabla f(x^k) \|_2^2. f(xk+1)f(xk)+fT(xk)(tf(xk))+2Mtf(xk)22.

Analysis for exact line search

We now assume that an exact line search is used, and minimize over t t t both sides of the inequality ( 9.17 ) (9.17) (9.17). On the lefthand, we get t e x a c t {t}_{exact} texact is the step length that minimizes x ~ \tilde{x} x~. The righthand side is a simple quardratic, which is minimized by t = 1 M t=\frac{1}{M} t=M1, and has minimum value f ( x ) − 1 2 M ∥ ∇ f ( x ) ∥ 2 2 f(x) - \frac{1}{2M}\| \nabla f(x) \|_2^2 f(x)2M1f(x)22.
Therefore, we have
f ( x k + 1 ) = f ~ ( t e x a c t )   ≤   f ( x k ) − 1 2 M ∥ ∇ f ( x k ) ∥ 2 2 . f(x^{k+1}) = \tilde{f}({t}_{exact}) ~ \leq ~ f(x^k) - \frac{1}{2M}\| \nabla f(x^k) \|_2^2. f(xk+1)=f~(texact)  f(xk)2M1f(xk)22.
Subtracting p ∗ p^* p from both sides, we get
f ( x k + 1 ) − p ∗   ≤   f ( x k ) − p ∗ − 1 2 M ∥ ∇ f ( x k ) ∥ 2 2 . f(x^{k+1}) - p^* ~ \leq ~ f(x^k) - p^* - \frac{1}{2M} \| \nabla f(x^k) \|_2^2. f(xk+1)p  f(xk)p2M1f(xk)22.

We combine this with ∥ ∇ f ( x k ) ∥ 2 2 ≥ 2 m ( f ( x k ) − p ∗ ) \| \nabla f (x^k)\|_2^2 \geq 2 m (f(x^k)-p^*) f(xk)222m(f(xk)p) (which follows from ( 9.9 ) (9.9) (9.9)) to conclude
f ( x k + 1 ) − p ∗   ≤   ( 1 − m M ) ( f ( x k ) − p ∗ ) . f(x^{k+1}) - p^* ~ \leq ~ \left(1-\frac{m}{M} \right) \left( f(x^k) - p^* \right) . f(xk+1)p  (1Mm)(f(xk)p).
Appying this inequality recursively, we find that
f ( x k ) − p ∗ ≤ ( 1 − m M ) k ( f ( x 0 ) − p ∗ ) = c k ( f ( x 0 ) − p ∗ ) , f(x^k) - p^* \leq \left(1-\frac{m}{M} \right)^k \left( f(x^0) - p^* \right) = c^k \left( f(x^0) - p^* \right), f(xk)p(1Mm)k(f(x0)p)=ck(f(x0)p), where c   ( c = m M ) c ~ (c =\frac{m}{M}) c (c=Mm) shows that f ( x k ) f(x^k) f(xk) converges to p ∗ p^* p as k → ∞ k\rightarrow \infty k.
In particular, we must have
f ( x k ) − p ∗ ≤ ϵ f(x^k) - p^* \leq \epsilon f(xk)pϵ after at most
log ⁡ ( ( f ( x 0 ) − p ∗ ) / ϵ ) log ⁡ ( 1 / c ) \frac{ \log \left( (f(x^0)-p^*)/\epsilon \right) }{\log \left( 1/c \right)} log(1/c)log((f(x0)p)/ϵ) iterations of the gradient method with exact line search.
The numerator, log ⁡ ( ( f ( x 0 ) − p ∗ ) / ϵ ) , \log \left( (f(x^0)-p^*)/\epsilon \right), log((f(x0)p)/ϵ), can be interpreted as the log of the ratio of the initial suboptimality (i.e., gep between f ( x 0 ) f(x^0) f(x0) and p ∗ p^* p), to the final suboptimality (i.e., less than ϵ \epsilon ϵ). This term sugegests that the number of iterations depends on how good the initial point is, and what the final required accuracy is.
For large condition number bound M / m M/m M/m, we have
log ⁡ ( 1 / c ) = − log ⁡ ( 1 − m / M ) ≈ m / M , \log(1/c) = −\log(1 − m/M) \approx m/M, log(1/c)=log(1m/M)m/M, so our bound on the number of iterations required increases approximately linearly with increasing M / m M/m M/m.

We will see that the gradient method does in fact require a large number of iterations, when the Hessian of f f f, near x ⋆ x^⋆ x , has a large condition number. Conversely, when the sublevel sets of f f f are relatively isotropic, so that the condition number bound M / m M/m M/m can be chosen to be relatively small, the bound (9.18) shows that convergence is rapid, since c is small, or at least not too close to one.

Analysis for backtracking line search


Steepest Descent Method

The first-order Taylor approximation of f ( x + v ) f(x+v) f(x+v) around x x x is
f ( x + v ) ≈ f ^ ( x + v ) = f ( x ) + ∇ f ( x ) T v . f(x + v) \approx \hat{f}(x + v) = f(x) + \nabla f(x)^T v. f(x+v)f^(x+v)=f(x)+f(x)Tv.
The second term on the righthand side, ∇ f ( x ) T v \nabla f(x)^T v f(x)Tv, is the directional derivative of f f f at x x x in the direction v v v. It gives the approximate change in f f f for a small step v v v.
Then, we now address the question of how to choose v v v to make the direction derivative as negative (small) as possible.
Let ∥ ⋅ ∥ \|\cdot\| be any norm on R n \mathbf{R}^n Rn. We define a normalized steepest descent direction (with respect to the norm ∥ ⋅ ∥ \|\cdot \| ) as
Δ x = arg min ⁡ x { ∇ f ( x ) T v   ∣   ∥ v ∥ = 1 } . \Delta x = \argmin_x \{\nabla f(x)^T v~ |~ \| v \| =1 \}. Δx=xargmin{f(x)Tv  v=1}. To be noted, if v = − ∇ f ( x ) T v=-\nabla f(x)^T v=f(x)T, then the GSM and SDM are same.


Gradient Descent Methods Related to Machine Learning

Concepts Related

A. 学习率/步长(Learning rate / Step size):
The length of the step is decided by learning rate.

B. 特征(Feature):
Features are the inputting part of samples. For example, like two feature samples ( x ( 0 ) , y ( 0 ) ) ,   ( x ( 1 ) , y ( 1 ) ) ( x^{(0)}, y^{(0)} ),~( x^{(1)}, y^{(1)} ) (x(0),y(0)), (x(1),y(1)), the sample outputting corresponding to the sample feature x ( 0 ) x^{(0)} x(0) is y ( 0 ) y^{(0)} y(0).

C. 假设函数(Hypothesis Function
In supervised learning, the hypothesis function used to fit the input sample, denoted as, h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 +...+ \theta_n x_n hθ(x)=θ0+θ1x1+θ2x2+...+θnxn.

D. 损失函数(Loss Function
J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 J(θ)=2m1i=1m(hθ(x(i))y(i))2

Batch gradient descent Method

A. Denoting loss (energy) function as J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 J(θ)=2m1i=1m(hθ(x(i))y(i))2, our objective is to min ⁡ J ( θ ) . \min J(\theta). minJ(θ).

B. The partial derivative of J ( θ ) J(\theta) J(θ) with respect to θ j \theta_j θj is
∂ J ( θ ) ∂ θ j = ∂ ∂ θ j 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 = 2 ⋅ 1 2 ( h θ ( x ( i ) ) − y ( i ) ) ⋅ ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = ( h θ ( x ( i ) ) − y ( i ) ) ⋅ ∂ ( θ j x ( i ) − y ( i ) ) ∂ θ j = ( h θ ( x ( i ) ) − y ( i ) ) x j i = − ( y ( i ) − h θ ( x ( i ) ) ) x j i , \begin{aligned} \frac{\partial J(\theta) }{\partial \theta_{j}} &=\frac{\partial}{\partial \theta_{j}} \frac{1}{2}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^{2} \\ &=2 \cdot \frac{1}{2}\left(h_{\theta}(x^{(i)})-y^{(i)} \right) \cdot \frac{\partial}{\partial \theta_{j}}\left(h_{\theta}(x^{(i)})-y^{(i)} \right) \\ &=\left(h_{\theta}(x^{(i)})-y^{(i)} \right) \cdot \frac{\partial \left( \theta_{j} x^{(i)}-y^{(i)} \right)}{\partial \theta_{j}} \\ &=\left(h_{\theta}(x^{(i)})-y^{(i)} \right) x_{j}^i \\ &=-\left(y^{(i)} - h_{\theta}(x^{(i)})\right) x_{j}^i, \end{aligned} θjJ(θ)=θj21(hθ(x(i))y(i))2=221(hθ(x(i))y(i))θj(hθ(x(i))y(i))=(hθ(x(i))y(i))θj(θjx(i)y(i))=(hθ(x(i))y(i))xji=(y(i)hθ(x(i)))xji,
denoted as, D = − ( y ( i ) − h θ ( x ( i ) ) ) x j i . D=-\left(y^{(i)} - h_{\theta}(x^{(i)})\right) x_{j}^i. D=(y(i)hθ(x(i)))xji.

C. According to the above equation, we have the gradient
Δ θ = − α D = α ( y ( i ) − h θ ( x ( i ) ) ) x j i , \Delta \theta = -\alpha D =\alpha \left(y^{(i)} - h_{\theta}(x^{(i)})\right) x_{j}^i, Δθ=αD=α(y(i)hθ(x(i)))xji, where α \alpha α is step size (leanring rate).
D. The preuso-code of BGDM is shown in the following figure:
\quad


Stochastic gradient descent Method

A. First, we rewrrite the loss function in Section 3 as
J ( θ ) = 1 m ∑ i = 1 m 1 2 ( y ( i ) − h θ ( x ( i ) ) ) 2 = 1 m ∑ i = 1 m cos ⁡ t ( θ , ( x ( i ) , y ( i ) ) ) , J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2} \left( y^{(i)} - h_\theta (x^{(i)}) \right)^2 = \frac{1}{m} \sum_{i=1}^{m} \cos t \left( \theta, (x^{(i)},y^{(i)}) \right), J(θ)=m1i=1m21(y(i)hθ(x(i)))2=m1i=1mcost(θ,(x(i),y(i))), where cos ⁡ t ( θ , ( x ( i ) , y ( i ) ) ) = 1 2 ( y ( i ) − h θ ( x ( i ) ) ) 2 . \cos t \left( \theta, (x^{(i)},y^{(i)}) \right) = \frac{1}{2} \left( y^{(i)} - h_\theta (x^{(i)}) \right)^2. cost(θ,(x(i),y(i)))=21(y(i)hθ(x(i)))2.

B. Taking the partial derivative of the above loss function with respect to θ \theta θ, we have the gradient
Δ θ = α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) . \Delta \theta =\alpha \left( y^{(i)} - h_{\theta}(x^{(i)}) \right) x_j^{(i)}. Δθ=α(y(i)hθ(x(i)))xj(i).

C. The preuso-code of SGDM is shown in the following figure:
\quad


Mini-Batch Gradient Descent Method

Combing the characteristics of BGDM and SGDM, we have Mini-Batch Gradient Descent Method (MBGDM).

The preuso-code of MBGDM is shown in the following figure:
\quad


  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值