Online Convex Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

最新推荐文章于 2021-07-05 12:25:14 发布

datou1596

最新推荐文章于 2021-07-05 12:25:14 发布

阅读量296

点赞数

本文链接：https://blog.csdn.net/datou1596/article/details/113446913

版权

该论文探讨了随机梯度下降法(SGD)在随机环境下的收敛性能。当目标函数具有强凸性和光滑性时，研究了SGD的理论结果。论文指出，在某些强凸随机问题中，通过调整算法可以达到O(1/T)的最优收敛率，而不仅仅是O(log(T)/T)。实验表明，采用α后缀平均策略的SGD在实际应用中优于其他选择策略。

摘要由CSDN通过智能技术生成

Online Convex Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Jan. 30, 2021

$\underline{\text{Aim}}$

In this paper, the popular SGD (Stochastic gradient descent) in a stochastic setting is investigated. The convergence performances of SGD under different settings (smoothness, strong convexity) of the objective function $\mathcal{F}$ are explored and theoretical results are derived in the paper.

$\underline{\text{Background}}$

SGD is one of the most popular first order method to solve convex learning problems. The framework in the paper to analyze SGD and other first-order algorithms is via stochastic optimization, where the goal is to optimize an unknown convex function $\mathcal{F}$ , given only unbiased estimates of $\mathcal{F}$ 's subgradients.

An important special case where $\mathcal{F}$ is strongly convex. Many problems with such strongly convex $\mathcal{F}$ function has a well-known convergence guarantee $\mathcal{O}(\log(T)/T)$ for SGD with averaging. Suprisingly, it was shown by Hazan and Kale (Hazan & Kale, 2011) that for some strongly convex stochastic problems, an optimal $\mathcal{O}(1/T)$ rate can be obtained using a different algorithm. A very similar algorithm was also presented recently by Juditsky and Nesterov (Juditsky & Nesterov, 2010). So, $\mathcal{O}(\log(T)/T)$ or $\mathcal{O}(1/T)$ ? If the latter is the case, SGD may face the crisis of being abandoned for its low performance.

$\underline{\text{Brief Project Description}}$
The goal of this paper is to minimize a convex function $\mathcal{F}$ over some convex domain w $\mathcal{W}$ ( $\mathcal{W}$ is assumed to be a subset of some Hilbert space). $\mathcal{F}$ is not known, the only information available is through a stochastic gradient oracle, which given some $\bm{w}\in\mathcal{W}$ , produces a vector $\hat{g}$ , whose expectation $\mathbb{E}[\hat{g}]=g$ is a subgradient of $F$ at $\bm{w}$ . It is assumed that $\mathcal{F}$ attains a minimum at some $\bm{w}^*\in\mathcal{W}$ and so naturally the analysis in the paper is mainly about providing bounds on $\mathcal{F}(\bm{w}_t)-\mathcal{F}(\bm{w}^*)$