AdaGrad - Adaptive Subgradient Methods

AdaGrad - Adaptive Subgradient Methods

https://cs.stanford.edu/~ppasupat/a9online/1107.html

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features.

Setup

Let f:RnR be a convex function and XRn be a convex compact subset of Rn. We want to optimizeminxXf(x)

There is a tradeoff between convergence rate and computation time. Second-order methods (e.g. BFGS) have convergence rate O(1/eT) but need O(n3) computation time. We will focus on first-order methods, which need only O(n) computation time. (In machine learning, speed is more important than fine-grained accuracy.)

Proximal Point Algorithm (Gradient Descent)

In proximal point algorithm, we repeatedly take small steps until we reach the optimum. If f is differentiable, we estimate f(y) withf(y)f(x)+f(x),yx

The update rule isx(k+1)=argminxX{f(x(k))+f(x(k)),xx(k)+12αkxx(k)22}=argminxX{f(x(k)),x+12αkxx(k)22}

We regularize with step size αk (usually constant in batch descent) to ensure that x(k+1) is not too far from x(k). For X=Rn (unconstrained optimization), solving for optimal x givesx(k+1)=x(k)αf(x(k))

The convergence rate is O(1/T). See the proof in the Note.

Stochastic Gradient Descent

Suppose f can be decomposed intof(x)=1NNi=1F(x,ai)where aiRn are data points. If N is large, calculating f(x) becomes expensive. Instead, we approximatef(x(k))g(k):=F(x(k),ai) where irandom−−−{1,,N}Stochastic gradient descent uses the update rulex(k+1)=argminxX{g(k),x+12αkxx(k)22}

If i is uniformly random, E[g(k)x(k)]=f(x(k)), so the estimate is unbiased. However, we should decrease αk over time (strengthen regularization) to prevent the noisy g(k) from ruining the already-good x(k).

The convergence rate is O(1/T). See the proof in the Note.

Improving the Regularizer

Consider the problemminxR2100x21+x22

Gradient descent converges slowly because the L2 norm xx(k)22 in the regularizer does not match the geometry of the problem, where changing x1 has much more effect than changing x2.

Instead, let's use the matrix norm. For a positive definite matrix B, definex2B:=xTBx0For simplicity, let B be a diagonal matrix with positive terms. Sox2B:=jbjjx2j0

The update rule becomesx(k+1)=argminxX[f(x(k)),x+12αkxx(k)2B]=x(k)αB1f(x(k))(if X=Rn)

If f(x)=12xTAx for some A, then by choosing B=A or something similar to A, the method converges very quickly. In practice, however, picking such B can be difficult or impossible (e.g. online setting), so we need to estimate B as the method progresses.

Idea Use the previously computed gradients gk to estimate the geometry.large fxjlarge Ajjlarge Bjj

Another intuition is that we should care more about rare features. For rare features (small accumulative gradient), we should regularize less (use small Bjj).

AdaGrad is an extension of SGD. In iteration k, defineG(k)=diag[ki=1g(i)(g(i))T]1/2i.e., let G(k) be a diagonal matrix with entriesG(k)jj=ki=1(g(i)j)2and use the update rulex(k+1)=argminxX{f(x(k)),x+12αxx(k)2G(k)}=x(k)αG1f(x(k))(if X=Rn)

The convergence rate is O(1/T) but with a better constant depending on the geometry. See the proof in the Note.

Reference

Paper http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf

Slides https://web.stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf

阅读更多
想对作者说点什么?
相关热词

博主推荐

换一批

没有更多推荐了,返回首页