2021-03-29

Projection-free Online Learning


Mar. 29, 2021


Aim ‾ \underline{\text{Aim}} Aim

In this paper, using the Frank-Wolfe technique, efficient learning algorithms that eschew projections are proposed, in favor of much more efficient linear optimization steps. Other advantages of the algorithms are that they are parameter-free in the stochastic case and produce sparse decisions.

Background ‾ \underline{\text{Background}} Background

Projections are usually the computational bottleneck in algorithms, as the l 2 l_2 l2-norm and quadratic programs are involved. In constrast, in many settings of practical interest, while solving convex quadratic programs is out of the question, linear optimization can be carried out efficiently. In this paper, efficient online learning algorithms are given, which replace the projection step with a linear optimization step for a variety of settings, as given in the following theorem:

Theorem 1.1. There is an algorithm scheme for online convex optimization that performs one linear optimization over the convex domain per iteration, and with appropriate modifications for each setting, obtains the following regret bounds:
在这里插入图片描述
Furthermore, in each iteration t, the algorithm maintains an explicit, efficiently sampleable distribution over at most t boundary points with expectation equal to the current iterate.

The above theorem entails appealing advantages: computational efficiency, parameter free algorithm implementation and efficient representation and sparsity. The proposed algorithm explicitly maintains a distribution over the vertices (or more generally boundary points) of the decision set, thereby eliminating the need for any further decomposition. In fact, in round t the distribution is supported on at most t boundary points, thus giving a form of sparsity.

Brief Project Description ‾ \underline{\text{Brief Project Description}} Brief Project Description

Still, Online Convex Optimization (OCO) is the interest of this paper. Iteratively in each round t = 1 , 2 , ⋯   , T t=1,2,\cdots, T t=1,2,,T a learner is required to produce a point x t \mathbf{x}_t xt from a convex, compact set K ∈ R n \mathcal{K} \in \mathbb{R}^n KRn. In response, an adversary produces a convex cost function f t : K → R f_t: \mathcal{K}\rightarrow\mathbb{R} ft:KR, and the learner suffers the cost f t ( x t ) f_t(\mathbf{x}_t) ft(xt). The goal of the learner is to produce points x t \mathbf{x}_t xt so that the regret,
 Regret  : = ∑ t = 1 T f t ( x t ) − min ⁡ x ∈ K ∑ t f t ( x ) \text { Regret }:=\sum_{t=1}^{T} f_{t}\left(\mathbf{x}_{t}\right)-\min _{\mathbf{x} \in \mathcal{K}} \sum_{t} f_{t}(\mathbf{x})  Regret :=t=1Tft(xt)xKmintft(x)

is sublinear in T T T. If the cost functions are stochastic, regret is measured using the expected cost function f = E [ f t ] f=\mathbf{E}\left[f_{t}\right] f=E[ft] instead of the actual costs.

After the conceptions β \beta β-smooth and σ \sigma σ-strongly convex are introduced, Smoothed Functions are brought about:

Let B \mathbb{B} B and S \mathbb{S} S denote the unit ball and unit sphere in R n \mathbb{R}^n Rn respectively. Given δ > 0 \delta > 0 δ>0, let the δ \delta δ-smoothing of a function f f f be,
f ^ δ ( x ) = E u ∈ B [ f ( x + δ u ) ] \hat{f}_{\delta}(\mathbf{x})=\mathbf{E}_{\mathbf{u} \in \mathbb{B}}[f(\mathbf{x}+\delta \mathbf{u})] f^δ(x)=EuB[f(x+δu)]
where u \mathbf{u} u is chosen uniformly at random from B \mathbb{B} B. We are implicitly assuming that f f f is defined on all points within distance δ \delta δ of K \mathcal{K} K.

Lemma 2.1 shows that f ^ δ \hat{f}_{\delta} f^δ is a good approximation of f f f:

Lemma 2.1 If f f f is convex and L-Lipschitz, then the function f ^ δ \hat{f}_{\delta} f^δ has the following properties:
1. f ^ δ \hat{f}_{\delta} f^δ is convex and L-Lipschitz;
2. for any x ∈ K , ∇ f ^ δ ( x ) = d δ E u ∈ B [ f ( x + δ u ) u ] \mathbf{x} \in \mathcal{K}, \nabla \hat{f}_{\delta}(\mathbf{x})=\frac{d}{\delta} \mathbf{E}_{\mathbf{u} \in \mathbb{B}}[f(\mathbf{x}+\delta \mathbf{u}) \mathbf{u}] xK,f^δ(x)=δdEuB[f(x+δu)u];
3. for any x ∈ K , ∥ ∇ f δ ( x ) ∥ ≤ d L \mathbf{x} \in \mathcal{K},\left\|\nabla f_{\delta}(\mathbf{x})\right\| \leq d L xK,fδ(x)dL
4. f ^ δ \hat{f}_{\delta} f^δ is d L δ \frac{dL}{\delta} δdL-smooth;

5. for any x ∈ K , ∣ f ( x ) − f δ ( x ) ∣ ≤ δ L \mathbf{x} \in \mathcal{K},\left|f(\mathbf{x})-f_{\delta}(\mathbf{x})\right| \leq \delta L xK,f(x)fδ(x)δL.

A feature of the proposed algorithms is that they predict with sparse solutions, where sparsity is defined in the following manner:

Definition 2.2 Let K ⊆ R n \mathcal{K} \subseteq \mathbb{R}^{n} KRn be a convex, compact set and let x ∈ K \mathbf{x} \in \mathcal{K} xK. We say that x \mathbf{x} x is t-sparse w.r.t K \mathcal{K} K if it can be written as a convex combination of t boundary points of K \mathcal{K} K.

Notice that all the algorithms proposed in the paper produce t t t-sparse prediction at iteration t t t w.r.t. the underlying decision set K \mathcal{K} K.

The Online Frank-Wolfe (OFW) algorithm is as follows:

在这里插入图片描述
The OFW algorithm is comopared to the classic OGD algorithm. To evaluate the performance benefits of OFW over OGD, experiment as carried out with a simple test application, viz. online collaborative filtering. This problem is the following. In each round, the learner is required to produce an m × n m \times n m×n matrix X \mathbf{X} X with trace norm (i.e. sum of singular values) bounded by τ \tau τ , a parameter. This matrix is to be interpreted as supplying by users i ∈ [ m ] i \in [m] i[m] rating for each item j ∈ [ n ] j \in [n] j[n]. The adversary then chooses an entry ( i , j ) (i,j) (i,j) and reveals the true rating for it, viz. y ∈ R y \in \mathbb{R} yR. The learner suffers the squared loss ( X ( i , j ) − y ) 2 (X(i, j)-y)^{2} (X(i,j)y)2. The goal is to compete with the set of all m × n m \times n m×n matrices of trace norm bounded by τ \tau τ.

Significance of Paper ‾ \underline{\text{Significance of Paper}} Significance of Paper

In this paper, an efficient algorithmic scheme is given for online convex optimization that performs one linear optimization per iteration rather than one quadratic optimization. The advantages over traditional gradient-descent techniques are speed of implementation, parameter-independence, explicit sampling scheme for iterates, sparsity, and natural lazy implementation. The disadvantage is that the provable regret bounds are not always optimal. The major open problem left is to improve the regret bounds, or show lower bounds on the number of linear optimizations necessary to obtain optimal regret with only one linear-optimization operation per iteration

Reference {\text{\Large Reference}} Reference

[1] Hazan E, Kale S. Projection-free online learning[J]. arXiv preprint arXiv:1206.4657, 2012.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值