2021-03-29

datou1596

于 2021-03-29 10:29:05 发布

阅读量105

点赞数

本文链接：https://blog.csdn.net/datou1596/article/details/115293692

版权

Projection-free Online Learning

Mar. 29, 2021

$\underline{\text{Aim}}$

In this paper, using the Frank-Wolfe technique, efficient learning algorithms that eschew projections are proposed, in favor of much more efficient linear optimization steps. Other advantages of the algorithms are that they are parameter-free in the stochastic case and produce sparse decisions.

$\underline{\text{Background}}$

Projections are usually the computational bottleneck in algorithms, as the $l_2$ -norm and quadratic programs are involved. In constrast, in many settings of practical interest, while solving convex quadratic programs is out of the question, linear optimization can be carried out efficiently. In this paper, efficient online learning algorithms are given, which replace the projection step with a linear optimization step for a variety of settings, as given in the following theorem:

Theorem 1.1. There is an algorithm scheme for online convex optimization that performs one linear optimization over the convex domain per iteration, and with appropriate modifications for each setting, obtains the following regret bounds:
在这里插入图片描述
Furthermore, in each iteration t, the algorithm maintains an explicit, efficiently sampleable distribution over at most t boundary points with expectation equal to the current iterate.

The above theorem entails appealing advantages: computational efficiency, parameter free algorithm implementation and efficient representation and sparsity. The proposed algorithm explicitly maintains a distribution over the vertices (or more generally boundary points) of the decision set, thereby eliminating the need for any further decomposition. In fact, in round t the distribution is supported on at most t boundary points, thus giving a form of sparsity.

$\underline{\text{Brief Project Description}}$

Still, Online Convex Optimization (OCO) is the interest of this paper. Iteratively in each round $t=1,2,\cdots, T$ a learner is required to produce a point $\mathbf{x}_t$ from a convex, compact set $\mathcal{K} \in \mathbb{R}^n$ . In response, an adversary produces a convex cost function $f_t: \mathcal{K}\rightarrow\mathbb{R}$ , and the learner suffers the cost $f_t(\mathbf{x}_t)$ . The goal of the learner is to produce points $\mathbf{x}_t$ so that the regret,
$\text { Regret }:=\sum_{t=1}^{T} f_{t}\left(\mathbf{x}_{t}\right)-\min _{\mathbf{x} \in \mathcal{K}} \sum_{t} f_{t}(\mathbf{x})$

is sublinear in $T$ . If the cost functions are stochastic, regret is measured using the expected cost function $f=\mathbf{E}\left[f_{t}\right]$ instead of the actual costs.

After the conceptions $\beta$ -smooth and $\sigma$ -strongly convex are introduced, Smoothed Functions are brought about:

Let $\mathbb{B}$ and $\mathbb{S}$ denote the unit ball and unit sphere in $\mathbb{R}^n$ respectively. Given $\delta > 0$ , let the $\delta$ -smoothing of a function $f$ be,
$\hat{f}_{\delta}(\mathbf{x})=\mathbf{E}_{\mathbf{u} \in \mathbb{B}}[f(\mathbf{x}+\delta \mathbf{u})]$
where $\mathbf{u}$ is chosen uniformly at random from $\mathbb{B}$ . We are implicitly assuming that $f$ is defined on all points within distance $\delta$ of $\mathcal{K}$ .

Lemma 2.1 shows that $\hat{f}_{\delta}$ is a good approximation of $f$ :

Lemma 2.1 If $f$ is convex and L-Lipschitz, then the function $\hat{f}_{\delta}$ has the following properties:
1. $\hat{f}_{\delta}$ is convex and L-Lipschitz;
2. for any $\mathbf{x} \in \mathcal{K}, \nabla \hat{f}_{\delta}(\mathbf{x})=\frac{d}{\delta} \mathbf{E}_{\mathbf{u} \in \mathbb{B}}[f(\mathbf{x}+\delta \mathbf{u}) \mathbf{u}]$ ;
3. for any $\mathbf{x} \in \mathcal{K},\left\|\nabla f_{\delta}(\mathbf{x})\right\| \leq d L$
4. $\hat{f}_{\delta}$ is $\frac{dL}{\delta}$ -smooth;
5. for any $\mathbf{x} \in \mathcal{K},\left|f(\mathbf{x})-f_{\delta}(\mathbf{x})\right| \leq \delta L$ .

A feature of the proposed algorithms is that they predict with sparse solutions, where sparsity is defined in the following manner:

Definition 2.2 Let $\mathcal{K} \subseteq \mathbb{R}^{n}$ be a convex, compact set and let $\mathbf{x} \in \mathcal{K}$ . We say that $\mathbf{x}$ is t-sparse w.r.t $\mathcal{K}$ if it can be written as a convex combination of t boundary points of $\mathcal{K}$ .

Notice that all the algorithms proposed in the paper produce $t$ -sparse prediction at iteration $t$ w.r.t. the underlying decision set $\mathcal{K}$ .

The Online Frank-Wolfe (OFW) algorithm is as follows:

在这里插入图片描述
The OFW algorithm is comopared to the classic OGD algorithm. To evaluate the performance benefits of OFW over OGD, experiment as carried out with a simple test application, viz. online collaborative filtering. This problem is the following. In each round, the learner is required to produce an $\times n$ matrix $\mathbf{X}$ with trace norm (i.e. sum of singular values) bounded by $\tau$ , a parameter. This matrix is to be interpreted as supplying by users $\in [m]$ rating for each item $\in [n]$ . The adversary then chooses an entry $(i, j)$ and reveals the true rating for it, viz. $\in \mathbb{R}$ . The learner suffers the squared loss $X(i, j)-y)^{2}$ . The goal is to compete with the set of all $\times n$ matrices of trace norm bounded by $\tau$ .

$\underline{\text{Significance of Paper}}$

In this paper, an efficient algorithmic scheme is given for online convex optimization that performs one linear optimization per iteration rather than one quadratic optimization. The advantages over traditional gradient-descent techniques are speed of implementation, parameter-independence, explicit sampling scheme for iterates, sparsity, and natural lazy implementation. The disadvantage is that the provable regret bounds are not always optimal. The major open problem left is to improve the regret bounds, or show lower bounds on the number of linear optimizations necessary to obtain optimal regret with only one linear-optimization operation per iteration

${\text{\Large Reference}}$

[1] Hazan E, Kale S. Projection-free online learning[J]. arXiv preprint arXiv:1206.4657, 2012.

datou1596

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫