[First order method] Gradient descent

1. gradient descent

1.1 Model to consider

Consider unconstrained, smooth convex optimization

minxf(x)

i.e., f is convex and differentiable with dom(f)=R. Denote the optimal criteria value by f=minxf(x) , and a solution by x .

Gradient descent: choose initial value x(0) , repeat:

x(k)=x(k1)tkf(x(k1))

Stop at some point.

1.2 Interpretaion

1.2.1 Interpretation via newton method

Since for the smooth and convex function f(x) , the minimum satisfies a condition that

f(x)=0

  • If f(x) can be calculated simply, we can get x directly just by solving this equality.
  • If f(x) is difficult to calculated. Then we can use linear approximation to f(x) at x(0) :
    (x)=f(x(0))+2f(x(0))(xx(0))

    and by setting this linear approximation to zero, we get
    x(new)=x(0)2f(x(0))1f(x(0))

But for many functions, calculating their twice differential is difficult. So we can replace 2f(x) by 1tI :

x(new)=x(0)tf(x(0))

The core idea behind gradient descent is that using linear approximation of f(x) to get the root of f(x) , which is exactly Newton-Raphson method. Newton-Raphson method is a method for finding successively better approximations to the roots (or zeroes) of a real-valued function.

1.2.2 Interpretation via quadratic approximation of original function

Linear approximation of f(x) can be regarded as a quadratic approximation of the original function f(x) :

q(x)=f(x(0))+f(x(0))(xc(0))+12(xx(0))2f(x(0))(xx(0))

which satisfies that:
q(x)=(x)

we can also use 1tI to replace 2f(x(0)) :

f~(x)f(x(0))+f(x(0))(xc(0))+2t(xx(0))(xx(0))

Then setting first differential of f~(x) to be zero:

f~(x)=f(x(0))+1t(xx(0))=0

we get
x(new)=x(0)tf(x(0))

1.3 How to choose step size t(k)

If t is too large, we algorithm will not converge. If it is tool small, the algorithm will converge too slow. So how to choose a suitable t?

  • fixed t
  • The exact line search can be used if f(x) is good enough:
    t=argmintf(xtf(x))
    • But in most cases, we can use backtracking line search

(i) set β(0,1) and α(0,1/2) fixed.(in practice, choose α=1/2 )
(ii) at each iteration, start with t=1 and while

f(xtf(x))>f(x)αtf(x)22

shrink t=βt . Else perform gradient descent:
x+=xtf(x)

From backtracking line search, we can see that

f(xtf(x))f(x)αtf(x)22f(x)

which make sure that gradient descent is going on a exact descent direction.

1.4 Convergence analysis

Theorem:[lipschitz of first derivative with fixed t] if f is convex and differentiable, dom(f)=Rn and f(x) is L lipschize differentiable. Then with fixed step size t<1L , we have

f(x(k))fx(0)x222tk

proof:

  • from lipschize properties of f(x) , we have
    f(x)f(y)2Lxy2f(y)f(x)+f(x)(yx)+L2yx22
  • from the definition of gradient descent method, we know
    x+=xtf(x)
  • from the convexity of f(x) , we have
    f(x)f(x)+f(x)(xx)

    which can be written as
    f(x)f(x)+f(x)(xx)

combining these three together, we have

f(x+)=f(xtf(x))f(x)tf(x)22+L2tf(x)22=f(x)(1Lt2)tf(x)22f(x)12tf(x)22f(x)+f(x)(xx)12tf(x)22f(x)+12(2t(xx+)(xx)1txx+22)=f(x)+12t(2(xx+)(xx)xx+22)=f(x)+12t(xx22x+x22)

So we have

f(x(k))f(x(1))f(x)+12t(x(k1)x22x(k)x22)f(x)+12t(x(0)x22x(1)x22)

summing all of these inequalities, we have

f(x(1))++f(x(k))kf(x)+12t(x(0)x22x(k)x22)kf(x)+12tx(0)x22

Then
f(x(k))f(x(1))++f(x(k))kf(x)+x(0)x222tk

Theorem:[lipschitz of first derivative with backtracking] if f is convex and differentiable, dom(f)=Rn and f(x) is L lipschize differentiable. Then with backtracking line search, we have

f(x(k))fx(0)x222tmink

where tminmin{1,βL}

proof:
All are the same as fixed t but for the value of tmin ,
From the backtrack line search idea, we know that, there exists a t0 such that for t[0,t0] , we have

f(xtf(x))f(x)12tf(x)22

So the final value of tbacktrack(βt0,t0] . From the equality of last theorem: we have

f(x+)f(x)(1Lt2)tf(x)22

we know that t0=1L .

So tmin(βL,1L] and tmin1

tminmin{1,βL}

Theorem:[lipschitz of first derivative and strong convexity of function] If f(x) is m-strong convex and f(x) is L-lipshcitz function, then gradient descent with fixed step size t2L+m or with backtracking line search satisfies:

f(x(k))fckL2x(0)x22

with 0<c<1

proof:

x+x22=xtf(x)x22=xx22+t2f(x)222tf(x)(xx)xx22+t2f(x)222t{mLL+mxx22+1m+Lf(x)22}=(12tmLm+L)xx22+(t22tm+L)f(x)22(12tmLm+L)xx22

Then we have
x(k)x22(12tmLm+L)kx(0)x22

and from the convexity of f(x)
f(x(k))f(x)f(x)(x(k)x(0))+L2x(k)x22L2x(k)x22L2(12tmLm+L)kx(0)x22

1.5 Summarize

  • For lipschitz gradient situation, gradient descent has convergence rate O(1/k)
    i.e., to get f(x(k))fO(ϵ) , we need O(1/ϵ) iterations
  • For lipschitz gradient and strong function situation, gradient decent has exponential convergence rate.

Reference:
http://www.stat.cmu.edu/~ryantibs/convexopt/lectures/05-grad-descent.pdf
http://www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf
http://www.stat.cmu.edu/~ryantibs/convexopt/scribes/05-grad-descent-scribed.pdf

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值