Brief Introduction and Background
In the practice of online learning in industrial, stochastic gradient descend is a common method to solve optimization problems. To be specific, SGD is the most popular method for its satisfactory convergence rate, known as O(log(T)/T). Under the condition of a convex loss function and a training set of T examples, SGD is employed to generate a sequence of T point predictors {w1,…,wT}. However, when the loss function is strongly convex, there is a paper declaimed that is not the most optimal convergence rate. In particular, could be obtained with a more complex and comparable more computational complexity algorithm (Hazan & Kale, 2011). In other words, it means that can be too loose to analyze the stochastic setting properly. This paper, however, proves with experiment results that SGD is still the optimal algorithm in the stochastic setting and avoid an online analysis with an online-to-batch conversion. They show that, for smooth problems, SGD can also reach a convergence rate at , for non-smooth problems, the bound convergence rate might still be O(log(T)/T).
Definition
Under the standard setting of convex stochastic optimization, convex function is unknown, and a vector is generated with a given input . The goal of optimization algorithm is to find a predictor whose expected loss F(w) is optimal.
∀w,w'∈W, ∀g, F is strongly convex, if
Then the algorithm reaches a single point after obtaining a sequence of points
With the projection operator on , the can be obtained by
This paper considered a more general step sizes with a constant instead of the step size in the stochastic optimization of -strongly function analysis.
Smooth functions and non-smooth functions
The author investigates the optimality of SGD algorithm in two different conditions: with smooth function and non-smooth function. In the optimization problems where convex function F(.) is both strongly convex and smooth with
Pick for constant ,which indicates that the distance from to w* is on the order of For the non-smooth function, the more general cases, the author declaimed that the intuition that an rate for SGD with averaging is tighter than is incorrect. The lower bound of SGD with averaging is , which indicates that the convergence rate of SGD is still tight for online learning.
Experiment results
This article conducted three experiments. The first is strong convex smooth F; the second is non-smooth strong convex F; the third is different from the first two in that there are three real binary classification data sets. The essential algorithms are compared in three experiments: the first three algorithms are SGD algorithms using the above-mentioned different point selection strategies, and the latter algorithm is the strong convex stochastic optimization algorithm EPOCH_GD (Hazan & Kale, 2011) algorithm. The results of the three experiments all show that, among a variety of algorithms, the SGD algorithm that selects the last-w strategy has the worst performance, and the α suffix average SGD algorithm has the best performance.
Discussion
This paper prove the optimality of SGD algorithm in the stochastic setting compared with the more complex algorithm with convergence rate . The simple and most popular method performs efficiently under smooth function condition and non-smooth function condition. And it also reveals that SGD can also reach with a simple modification of averaging step. However, the author also left some questions for future research. The rate still need averaging in non-smooth case.
Reference
[1] Hazan, E. and Kale, S. Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. In COLT, 2011.