AdaGrad - Adaptive Subgradient Methods

AdaGrad - Adaptive Subgradient Methods

https://cs.stanford.edu/~ppasupat/a9online/1107.html

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features.

Setup

Let f:RnR be a convex function and XRn be a convex compact subset of Rn. We want to optimizeminxXf(x)

There is a tradeoff between convergence rate and computation time. Second-order methods (e.g. BFGS) have convergence rate O(1/eT) but need O(n3) computation time. We will focus on first-order methods, which need only O(n) computation time. (In machine learning, speed is more important than fine-grained accuracy.)

Proximal Point Algorithm (Gradient Descent)

In proximal point algorithm, we repeatedly take small steps until we reach the optimum. If f is differentiable, we estimate f(y) withf(y)f(x)+f(x),yx

The update rule isx(k+1)=argminxX{f(x(k))+f(x(k)),xx(k)+12αkxx(k)22}=argminxX{f(x(k)),x+12αkxx(k)22}

We regularize with step size αk (usually constant in batch descent) to ensure that x(k+1) is not too far from x(k). For X=Rn (unconstrained optimization), solving for optimal x givesx(k+1)=x(k)αf(x(k))

The convergence rate is O(1/T). See the proof in the Note.

Stochastic Gradient Descent

Suppose f can be decomposed intof(x)=1NNi=1F(x,ai)where aiRn are data points. If N is large, calculating f(x) becomes expensive. Instead, we approximatef(x(k))g(k):=F(x(k),ai) where irandom−−−{1,,N}Stochastic gradient descent uses the update rulex(k+1)=argminxX{g(k),x+12αkxx(k)22}

If i is uniformly random, E[g(k)x(k)]=f(x(k)), so the estimate is unbiased. However, we should decrease αk over time (strengthen regularization) to prevent the noisy g(k) from ruining the already-good x(k).

The convergence rate is O(1/T). See the proof in the Note.

Improving the Regularizer

Consider the problemminxR2100x21+x22

Gradient descent converges slowly because the L2 norm xx(k)22 in the regularizer does not match the geometry of the problem, where changing x1 has much more effect than changing x2.

Instead, let's use the matrix norm. For a positive definite matrix B, definex2B:=xTBx0For simplicity, let B be a diagonal matrix with positive terms. Sox2B:=jbjjx2j0

The update rule becomesx(k+1)=argminxX[f(x(k)),x+12αkxx(k)2B]=x(k)αB1f(x(k))(if X=Rn)

If f(x)=12xTAx for some A, then by choosing B=A or something similar to A, the method converges very quickly. In practice, however, picking such B can be difficult or impossible (e.g. online setting), so we need to estimate B as the method progresses.

Idea Use the previously computed gradients gk to estimate the geometry.large fxjlarge Ajjlarge Bjj

Another intuition is that we should care more about rare features. For rare features (small accumulative gradient), we should regularize less (use small Bjj).

AdaGrad is an extension of SGD. In iteration k, defineG(k)=diag[ki=1g(i)(g(i))T]1/2i.e., let G(k) be a diagonal matrix with entriesG(k)jj=ki=1(g(i)j)2and use the update rulex(k+1)=argminxX{f(x(k)),x+12αxx(k)2G(k)}=x(k)αG1f(x(k))(if X=Rn)

The convergence rate is O(1/T) but with a better constant depending on the geometry. See the proof in the Note.

Reference

Paper http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf

Slides https://web.stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值