RDA(regularized dual average) is a algorithm to efficiently solve regularized stochastic learning and regularized online optimization. This paper shows RDA's effectivity for sparse online learning with L1-regularization.
For the traditional online algorithms, such as SGD, they have limited capability of exploiting problem structure in solving regularized learning problems. It means that their low accuracy often makes it hard to obtain the desired regularization effects. So in this paper, the author introduces a strongly convex function which has the same optimal as the regularization function to make great use of this strong convexity to make the iterarion converge faster and get better result.
The algorithm is showed below and some explaination will be given:
Let's take a look at the average subgradient: where
is the subgradient of
.As samples added, the subgradient contains not only the new t's information but part of all previous information, which makes the loss function only trucates and get the effect similar to batch processing. The weight updates as
. Some may ask about the complexity of this optimization problem. Fortunately, this problem is indeed simple for many important leaning problems in practice. Compare this form with SGD method:
, we can find that these two method should have same solution theoratically, which is actually the thinking of dual average. Besides, this form is interesting and below is my thinking of its motivation. For the regularization function, there might be a gentle slope where the converge might be really slow. If we introduce a strong convex function here and make these two function have same optimal, then the solution will be obvious and easy to converge.
This paper does not include the complete proof of regret bounds and convergence. Some simple results are provided: for regularization function whose convexity parameter is 0: By setting proper parameters, we get regret bound o() and for those strongly convex function: o(ln(t)). Besides, as for the optimization problem mentioned before, the author uses some examples to give a simple solution to some commonly used regularization function. As a result, this state-of-the-art RDA algorithm is useful for regularized online optimization problems.
Furthermore, this algorithm also performs well on solving regularized stochastic learning problem. As it is not the main topic of this course, here I only list some great performance here:1.the computational complexity per iteration is 𝑂(𝑛), the same as the SGD method.2.converges to the optimal solution of with the optimal rate 𝑂(1/√𝑡). If the the regularization function Ψ(𝑤) is strongly convex, we have the better rate 𝑂(ln 𝑡/𝑡).