The paper contains two main parts. Part A is online convex programming. The author describe a Greedy Projection algorithm and analyze the performance of it, proving that the average regret of it converges to zero. Part B is an application of OCP. The author firstly formulates the repeated game as online convex program and then proposes a Generalized Infinitesimal Gradient Ascent(GIGA) algorithm which is quite similar to Greedy Projection. All in all, the main contribution this paper gives is that the Greedy Projection, or say projection gradient descent methods gives a general algorithm for a large variety of problems. Besides, this simple algorithm gets state-of-art performance.
Firstly, the definition of online convex program is that for any step t, only after we choose one from feasible set we can see the cost function. Obviously, online convex optimization is not able to find the optimal of the problem due to the uncertainty. But we do be able to find the bound of a certain algorithm considering some specific rules. regret is the difference between the total online cost and the cost of a particular kind of offline algorithm. Here offline algorithm means that we know the cost function in advance and we choose the same point in the feasible set in each step. The offline algorithm is simply a ordinary convex optimization problem.
To cope with this online convex problem, greedy algorithm is proposed in this paper. Greedy algorithm use gradient descent and feasible set projection alternatively to obtain a sequence of inputs. At step t use the current cost and action to do one step gradient descent. Then projecting the value of it to the feasible set using the argmin projection. It is proved that the averatge regret of greedy projection approaches zero. Moreover, in section 2.3, the auther proposes another algorithm Lazy Projection which performs suprisingly well.
In part B, to deal with repeated game problem, the auther proposes generalized infinitesimal gradient ascent alrithm. It shows the application of online convex optimization. As utility needs to be as large as possible, we should use gradient ascent instead of descent. Repeated game regret is a little bit different. It is the maximum regret of not playing a paticular action, which means that both player's and enviroment's behavior are distributions. Besides, instead of gradient ascent, it acutally use the utility function itself during updating. By setting the learning rate equal to square root of time slot t, GIGA is universally consistent.
It is a useful guideline for works in projectiong gradient decent method under the setting of online learning. Besides, it suggests future work to accomplish this work in a reinforce learning framework.