https://cs.stanford.edu/~ppasupat/a9online/1107.html
AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features.
Setup
Let f:Rn→R be a convex function and X⊆Rn be a convex compact subset of Rn. We want to optimizeminx∈Xf(x)
There is a tradeoff between convergence rate and computation time. Second-order methods (e.g. BFGS) have convergence rate O(1/eT) but need O(n3) computation time. We will focus on first-order methods, which need only O(n) computation time. (In machine learning, speed is more important than fine-grained accuracy.)
Proximal Point Algorithm (Gradient Descent)
In proximal point algorithm, we repeatedly take small steps until we reach the optimum. If f is differentiable, we estimate f(y) withf(y)≈f(x)+⟨∇f(x),y−x⟩
The update rule isx(k+1)=argminx∈X{f(x(k))+⟨∇f(x(k)),x−x(k)⟩+12αk∥∥x−x(k)∥∥22}=argminx∈X{⟨∇f(x(k)),x⟩+12αk∥∥x−x(k)∥∥22}
We regularize with step size αk (usually constant in batch descent) to ensure that x(k+1) is not too far from x(k). For X=Rn (unconstrained optimization), solving for optimal x givesx(k+1)=x(k)−α∇f(x(k))
The convergence rate is O(1/T). See the proof in the Note.
Stochastic Gradient Descent
Suppose f can be decomposed intof(x)=1NN∑i=1F(x,ai)where ai∈Rn are data points. If N is large, calculating ∇f(x) becomes expensive. Instead, we approximate∇f(x(k))≈g(k):=∇F(x(k),ai) where irandom←−−−−{1,…,N}Stochastic gradient descent uses the update rulex(k+1)=argminx∈X{⟨g(k),x⟩+12αk∥∥x−x(k)∥∥22}
If i is uniformly random, E[g(k)∣x(k)]=∇f(x(k)), so the estimate is unbiased. However, we should decrease αk over time (strengthen regularization) to prevent the noisy g(k) from ruining the already-good x(k).
The convergence rate is O(1/√T). See the proof in the Note.
Improving the Regularizer
Consider the problemminx∈R2100x21+x22
Gradient descent converges slowly because the L2 norm ∥∥x−x(k)∥∥22 in the regularizer does not match the geometry of the problem, where changing x1 has much more effect than changing x2.
Instead, let's use the matrix norm. For a positive definite matrix B, define∥x∥2B:=xTBx≥0For simplicity, let B be a diagonal matrix with positive terms. So∥x∥2B:=∑jbjjx2j≥0
The update rule becomesx(k+1)=argminx∈X[⟨∇f(x(k)),x⟩+12αk∥∥x−x(k)∥∥2B]=x(k)−αB−1∇f(x(k))(if X=Rn)
If f(x)=12xTAx for some A, then by choosing B=A or something similar to A, the method converges very quickly. In practice, however, picking such B can be difficult or impossible (e.g. online setting), so we need to estimate B as the method progresses.
Idea Use the previously computed gradients gk to estimate the geometry.large ∂f∂xj⟺large Ajj⟺large Bjj
Another intuition is that we should care more about rare features. For rare features (small accumulative gradient), we should regularize less (use small Bjj).
AdaGrad is an extension of SGD. In iteration k, defineG(k)=diag[k∑i=1g(i)(g(i))T]1/2i.e., let G(k) be a diagonal matrix with entriesG(k)jj=⎷k∑i=1(g(i)j)2and use the update rulex(k+1)=argminx∈X{⟨∇f(x(k)),x⟩+12α∥∥x−x(k)∥∥2G(k)}=x(k)−αG−1∇f(x(k))(if X=Rn)
The convergence rate is O(1/√T) but with a better constant depending on the geometry. See the proof in the Note.
Reference
Paper http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf
Slides https://web.stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf