Aim ‾ \underline{\text{Aim}} Aim
In this paper, the popular SGD (Stochastic gradient descent) in a stochastic setting is investigated. The convergence performances of SGD under different settings (smoothness, strong convexity) of the objective function F \mathcal{F} F are explored and theoretical results are derived in the paper.
Background ‾ \underline{\text{Background}} Background
SGD is one of the most popular first order method to solve convex learning problems. The framework in the paper to analyze SGD and other first-order algorithms is via stochastic optimization, where the goal is to optimize an unknown convex function F \mathcal{F} F, given only unbiased estimates of F \mathcal{F} F's subgradients.
An important special case where F \mathcal{F} F is strongly convex. Many problems with such strongly convex F \mathcal{F} F function has a well-known convergence guarantee O ( log ( T ) / T ) \mathcal{O}(\log(T)/T) O(log(T)/T) for SGD with averaging. Suprisingly, it was shown by Hazan and Kale (Hazan & Kale, 2011) that for some strongly convex stochastic problems, an optimal O ( 1 / T ) \mathcal{O}(1/T) O(1/T) rate can be obtained using a different algorithm. A very similar algorithm was also presented recently by Juditsky and Nesterov (Juditsky & Nesterov, 2010). So, O ( log ( T ) / T ) \mathcal{O}(\log(T)/T) O(log(T)/T) or O ( 1 / T ) \mathcal{O}(1/T) O(1/T)? If the latter is the case, SGD may face the crisis of being abandoned for its low performance.
Brief Project Description ‾ \underline{\text{Brief Project Description}} Brief Project Description
The goal of this paper is to minimize a convex function F \mathcal{F} F over some convex domain w W \mathcal{W} W ( W \mathcal{W} W is assumed to be a subset of some Hilbert space). F \mathcal{F} F is not known, the only information available is through a stochastic gradient oracle, which given some w ∈ W \bm{w}\in\mathcal{W} w∈W, produces a vector g ^ \hat{g} g^, whose expectation E [ g ^ ] = g \mathbb{E}[\hat{g}]=g E[g^]=g is a subgradient of F F F at w \bm{w} w. It is assumed that F \mathcal{F} F attains a minimum at some w ∗ ∈ W \bm{w}^*\in\mathcal{W} w∗∈W and so naturally the analysis in the paper is mainly about providing bounds on F ( w t ) − F ( w ∗ ) \mathcal{F}(\bm{w}_t)-\mathcal{F}(\bm{w}^*) F(w