Online Convex Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

该论文探讨了随机梯度下降法(SGD)在随机环境下的收敛性能。当目标函数具有强凸性和光滑性时,研究了SGD的理论结果。论文指出,在某些强凸随机问题中,通过调整算法可以达到O(1/T)的最优收敛率,而不仅仅是O(log(T)/T)。实验表明,采用α后缀平均策略的SGD在实际应用中优于其他选择策略。
摘要由CSDN通过智能技术生成
Online Convex Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization


Jan. 30, 2021


Aim ‾ \underline{\text{Aim}} Aim

In this paper, the popular SGD (Stochastic gradient descent) in a stochastic setting is investigated. The convergence performances of SGD under different settings (smoothness, strong convexity) of the objective function F \mathcal{F} F are explored and theoretical results are derived in the paper.

Background ‾ \underline{\text{Background}} Background

SGD is one of the most popular first order method to solve convex learning problems. The framework in the paper to analyze SGD and other first-order algorithms is via stochastic optimization, where the goal is to optimize an unknown convex function F \mathcal{F} F, given only unbiased estimates of F \mathcal{F} F's subgradients.

An important special case where F \mathcal{F} F is strongly convex. Many problems with such strongly convex F \mathcal{F} F function has a well-known convergence guarantee O ( log ⁡ ( T ) / T ) \mathcal{O}(\log(T)/T) O(log(T)/T) for SGD with averaging. Suprisingly, it was shown by Hazan and Kale (Hazan & Kale, 2011) that for some strongly convex stochastic problems, an optimal O ( 1 / T ) \mathcal{O}(1/T) O(1/T) rate can be obtained using a different algorithm. A very similar algorithm was also presented recently by Juditsky and Nesterov (Juditsky & Nesterov, 2010). So, O ( log ⁡ ( T ) / T ) \mathcal{O}(\log(T)/T) O(log(T)/T) or O ( 1 / T ) \mathcal{O}(1/T) O(1/T)? If the latter is the case, SGD may face the crisis of being abandoned for its low performance.

Brief Project Description ‾ \underline{\text{Brief Project Description}} Brief Project Description
The goal of this paper is to minimize a convex function F \mathcal{F} F over some convex domain w W \mathcal{W} W ( W \mathcal{W} W is assumed to be a subset of some Hilbert space). F \mathcal{F} F is not known, the only information available is through a stochastic gradient oracle, which given some w ∈ W \bm{w}\in\mathcal{W} wW, produces a vector g ^ \hat{g} g^, whose expectation E [ g ^ ] = g \mathbb{E}[\hat{g}]=g E[g^]=g is a subgradient of F F F at w \bm{w} w. It is assumed that F \mathcal{F} F attains a minimum at some w ∗ ∈ W \bm{w}^*\in\mathcal{W} wW and so naturally the analysis in the paper is mainly about providing bounds on F ( w t ) − F ( w ∗ ) \mathcal{F}(\bm{w}_t)-\mathcal{F}(\bm{w}^*) F(w

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值