AdaGrad - Adaptive Subgradient Methods

最新推荐文章于 2022-09-30 13:13:09 发布

Ivy_daisy

最新推荐文章于 2022-09-30 13:13:09 发布

阅读量936

点赞数

分类专栏：机器学习深度学习文章标签： Adagrad

深度学习同时被 2 个专栏收录

16 篇文章 0 订阅

订阅专栏

机器学习

15 篇文章 0 订阅

订阅专栏

AdaGrad - Adaptive Subgradient Methods

https://cs.stanford.edu/~ppasupat/a9online/1107.html

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features.

Setup

Let $f : R^{n} \to R$ be a convex function and $X \subseteq R^{n}$ be a convex compact subset of $R^{n}$ . We want to optimize $min x \in X f (x)$

There is a tradeoff between convergence rate and computation time. Second-order methods (e.g. BFGS) have convergence rate $O (1 / e^{T})$ but need $O (n^{3})$ computation time. We will focus on first-order methods, which need only $O (n)$ computation time. (In machine learning, speed is more important than fine-grained accuracy.)

Proximal Point Algorithm (Gradient Descent)

In proximal point algorithm, we repeatedly take small steps until we reach the optimum. If $f$ is differentiable, we estimate $f (y)$ with $f (y) \approx f (x) + ⟨ \nabla f (x), y - x ⟩$

The update rule is $\begin{matrix} x^{(k + 1)} & = argmin x \in X {f (x^{(k)}) + ⟨ \nabla f (x^{(k)}), x - x^{(k)} ⟩ + \frac{1}{2 α_{k}} {∥ ∥ x - x^{(k)} ∥ ∥}_{2}^{(k)}} = argmin x \in X {⟨ \nabla f (x^{(k)}), x ⟩ + \frac{1}{2 α_{k}} {∥ ∥ x - x^{(k)} ∥ ∥}_{2}^{(k)}} \end{matrix}$

We regularize with step size $α_{k}$ (usually constant in batch descent) to ensure that $x^{(k + 1)}$ is not too far from $x^{(k)}$ . For $X = R^{n}$ (unconstrained optimization), solving for optimal $x$ gives $x^{(k + 1)} = x^{(k)} - α \nabla f (x^{(k)})$

The convergence rate is $O (1 / T)$ . See the proof in the Note.

Stochastic Gradient Descent

Suppose $f$ can be decomposed into $f (x) = \frac{1}{N} N \sum i = 1 F (x, a_{i})$ where $a_{i} \in R^{n}$ are data points. If $N$ is large, calculating $\nabla f (x)$ becomes expensive. Instead, we approximate $\nabla f (x^{(k)}) \approx g^{(k)} := \nabla F (x^{(k)}, a_{i}) where i random \leftarrow --- - {1, \dots, N}$ Stochastic gradient descent uses the update rule $x^{(k + 1)} = argmin x \in X {⟨ g^{(k)}, x ⟩ + \frac{1}{2 α_{k}} {∥ ∥ x - x^{(k)} ∥ ∥}_{2}^{(k)}}$

If $i$ is uniformly random, $E [g^{(k)} ∣ x^{(k)}] = \nabla f (x^{(k)})$ , so the estimate is unbiased. However, we should decrease $α_{k}$ over time (strengthen regularization) to prevent the noisy $g^{(k)}$ from ruining the already-good $x^{(k)}$ .

The convergence rate is $O (1 / \sqrt{T})$ . See the proof in the Note.

Improving the Regularizer

Consider the problem $min x \in R^{2} 100 x_{1}^{2} + x_{2}^{2}$

Gradient descent converges slowly because the L2 norm ${∥ ∥ x - x^{(k)} ∥ ∥}_{2}^{(k)}$ in the regularizer does not match the geometry of the problem, where changing $x_{1}$ has much more effect than changing $x_{2}$ .

Instead, let's use the matrix norm. For a positive definite matrix $B$ , define ${∥ x ∥}_{B}^{2} := x^{T} B x \geq 0$ For simplicity, let $B$ be a diagonal matrix with positive terms. So ${∥ x ∥}_{B}^{2} := \sum j b_{j j} x_{j}^{2} \geq 0$

The update rule becomes $\begin{matrix} x^{(k + 1)} & = argmin x \in X [⟨ \nabla f (x^{(k)}), x ⟩ + \frac{1}{2 α_{k}} {∥ ∥ x - x^{(k)} ∥ ∥}_{B}^{(k)}] = x^{(k)} - α B^{- 1} \nabla f (x^{(k)}) (if X = R^{n}) \end{matrix}$

If $f (x) = \frac{1}{2} x^{T} A x$ for some $A$ , then by choosing $B = A$ or something similar to $A$ , the method converges very quickly. In practice, however, picking such $B$ can be difficult or impossible (e.g. online setting), so we need to estimate $B$ as the method progresses.

Idea Use the previously computed gradients $g_{k}$ to estimate the geometry. $large \frac{\partial f}{\partial x_{j}} ⟺ large A_{j j} ⟺ large B_{j j}$

Another intuition is that we should care more about rare features. For rare features (small accumulative gradient), we should regularize less (use small $B_{j j}$ ).

AdaGrad is an extension of SGD. In iteration $k$ , define $G^{(k)} = d i a g {[k \sum i = 1 g^{(i)} (g^{(i)})^{T}]}^{1 / 2}$ i.e., let $G^{(k)}$ be a diagonal matrix with entries $G_{j j}^{(k)} = \sqrt{k \sum i = 1 (g_{j}^{(i)})^{2}}$ and use the update rule $\begin{matrix} x^{(k + 1)} & = argmin x \in X {⟨ \nabla f (x^{(k)}), x ⟩ + \frac{1}{2 α} {∥ ∥ x - x^{(k)} ∥ ∥}_{G^{(k)}}^{(k)}} = x^{(k)} - α G^{- 1} \nabla f (x^{(k)}) (if X = R^{n}) \end{matrix}$

The convergence rate is $O (1 / \sqrt{T})$ but with a better constant depending on the geometry. See the proof in the Note.

Reference

Paper http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf

Slides https://web.stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf

Ivy_daisy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
AdaGrad - Adaptive Subgradient Methods

AdaGrad - Adaptive Subgradient Methodshttps://cs.stanford.edu/~ppasupat/a9online/1107.htmlAdaGrad is an optimization method that allows different step sizes for different features. It increases the in...
复制链接

扫一扫

专栏目录