Online Convex Optimization

OCO

简单梳理,很粗糙,详细看书!

First-Order Methods for OCO

该类算法的目标是最小化遗憾界,遗憾定义如下:
R e g r e t T = ∑ t = 1 T f t ( x t ) − min ⁡ x ∈ K ∑ t = 1 T f t ( x ) . (1) \mathrm{Regret}_{T} =\sum_{t=1}^T f_t(\mathbf{x}_t)-\min\limits_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^T f_t(\mathbf{x}). \tag{1} RegretT=t=1Tft(xt)xKmint=1Tft(x).(1)

Online Gradient Descent(OGD)

OGD算法是GD算法的在线版本,第一次引入是通过 Zinkevich1

OGD算法的更新规则如下:
y t + 1 = x t − η t ∇ f t ( x t ) x t + 1 = Π K ( y t + 1 ) (2) \mathbf{y}_{t+1} = \mathbf{x}_t - \eta_t \nabla f_t(\mathbf{x}_t) \\ \tag{2} \mathbf{x}_{t+1} = \Pi_{\mathcal{K}}(\mathbf{y}_{t+1}) yt+1=xtηtft(xt)xt+1=ΠK(yt+1)(2)

通过设置步长序列 { η t = D G t , t ∈ [ T ] } \{ \eta_t = \frac{D}{G \sqrt{t}}, t \in [T] \} {ηt=Gt D,t[T]},我们有
R e g r e t T = ∑ t = 1 T f t ( x t ) − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ( x ∗ ) ≤ 3 2 G D T (3) Regret_T = \sum_{t=1}^T f_t(\mathbf{x}_t)-\min\limits_{\mathbf{x^*}\in\mathcal{K}}\sum_{t=1}^T f_t(\mathbf{x}^*) \leq \frac{3}{2} G D \sqrt{T} \tag{3} RegretT=t=1Tft(xt)xKmint=1Tft(x)23GDT (3)

简略证明. x ∗ ∈ arg ⁡ min ⁡ x ∈ K ∑ t = 1 T f t ( x ) x^* \in \arg\min_{x \in \mathcal{K}} \sum^T_{t=1} f_t(\mathbf{x}) xargminxKt=1Tft(x),定义 ∇ t ≜ ∇ f t ( x t ) \nabla_t \triangleq \nabla f_t(\mathbf{x}_t) tft(xt),通过凸函数的凸性(凸函数的一阶判定条件):
f t ( x t ) − f t ( x ∗ ) ≤ ∇ t ⊤ ( x t − x ∗ ) (4) f_t(\mathbf{x}_t) - f_t(\mathbf{x}^*) \leq \nabla_t^{\top}(\mathbf{x}_t - \mathbf{x}^*) \tag{4} ft(xt)ft(x)t(xtx)(4)
和算法的更新规则:
∥ x t + 1 − x ⋆ ∥ 2 = ∥ Π K ( x t − η t ∇ ι ) − x ⋆ ∥ 2 ≤ ∥ x t − η t ∇ ι − x ⋆ ∥ 2 ↓ 2 ∇ t ⊤ ( x t − x ⋆ ) ≤ ∥ x t − x ⋆ ∥ 2 − ∥ x t + 1 − x ⋆ ∥ 2 η t + η t G 2 (5) \left\|\mathbf{x}_{t+1}-\mathbf{x}^{\star}\right\|^2=\left\|\Pi_{\mathcal{K}}\left(\mathbf{x}_t-\eta_t \nabla_\iota\right)-\mathbf{x}^{\star}\right\|^2 \leq\left\|\mathbf{x}_t-\eta_t \nabla_\iota-\mathbf{x}^{\star}\right\|^2 \\ \tag{5} \downarrow \\ 2 \nabla_t^{\top}\left(\mathrm{x}_t-\mathrm{x}^{\star}\right) \leq \frac{\left\|\mathbf{x}_t-\mathbf{x}^{\star}\right\|^2-\left\|\mathbf{x}_{t+1}-\mathbf{x}^{\star}\right\|^2}{\eta_t}+\eta_t G^2 xt+1x2=ΠK(xtηtι)x2xtηtιx22t(xtx)ηtxtx2xt+1x2+ηtG2(5)
然后联合公式4和5,取 1 , … , T 1,\dots,T 1,,T的和,证明完毕。

o ( D G T ) o(DG \sqrt T) o(DGT ) 是OGD最好的遗憾界,相关证明可以参考2中3.2节

Online gradient descent for strongly convex functions

对于强凸损失函数,通过设置步长序列 η t = 1 α t \eta_t = \frac{1}{\alpha t} ηt=αt1,其中 α \alpha α是强凸因子,OGD算法有:
R e g r e t T ≤ G 2 2 α ( 1 + log ⁡ T ) (6) Regret_T \leq \frac{G^2}{2 \alpha}(1+\log{T}) \tag{6} RegretT2αG2(1+logT)(6)

简略证明: 类似一般情况下OGD的证明,区别是步长的设置和利用强凸函数的强凸性(强凸函数的一阶判定条件):
2 ( f t ( x t ) − f t ( x ⋆ ) ) ≤ 2 ∇ t ⊤ ( x t − x ⋆ ) − α ∥ x ⋆ − x t ∥ 2 (7) 2\left(f_t\left(\mathbf{x}_t\right)-f_t\left(\mathbf{x}^{\star}\right)\right) \leq 2 \nabla_t^{\top}\left(\mathbf{x}_t-\mathbf{x}^{\star}\right)-\alpha\left\|\mathbf{x}^{\star}-\mathbf{x}_t\right\|^2 \tag{7} 2(ft(xt)ft(x))2t(xtx)αxxt2(7)

Stochastic Gradient Descent3 4

随即优化是在线凸优化中一个特殊的场景,其目标是最小化凸域上的凸函数,具体定义如下:
min ⁡ x ∈ K f ( x ) . (8) \min_{x \in \mathcal{K}} f(x). \tag{8} xKminf(x).(8)
同时,与离线场景下不同,该场景下优化器无法获得准确的梯度信息,而是带噪声的梯度,即:
O ( x ) ≜ ∇ ~ x        s . t .     E [ ∇ ~ x ] = ∇ f ( x ) ,      E [ ∥ ∇ ~ x ∥ ] ≤ G 2 (9) \mathcal{O}(\mathbf{x}) \triangleq \tilde{\nabla}_\mathbf{x} \ \ \ \ \ \ s.t. \ \ \ \mathbb{E}[\tilde{\nabla}_\mathbf{x}] = \nabla f(\mathbf{x}), \;\; \mathbb{E}[\| \tilde{\nabla}_{\mathbf{x}}\|] \leq G^2 \tag{9} O(x)~x      s.t.   E[~x]=f(x),E[~x]G2(9)
也就是说,每次从决策集取一个点 x x x,然后获得该点带噪声的梯度,带噪声的梯度期望是函数 f f f在该点的梯度,并且期望的方差是 G 2 G^2 G2。与OGD类似,区别即获得的梯度是带噪声的。该处证明SGD方法的convergence rate,上边是regre bound。

在这里插入图片描述

通过设置步长序列 η t = D G T \eta_t = \frac{D}{G \sqrt T} ηt=GT D,有:
E [ f ( x ˉ T ) ] ≤ min ⁡ x ⋆ ∈ K f ( x ⋆ ) + 3 G D 2 T . (10) \mathbf{E}[f(\bar{\mathbf{x}}_T)]\leq\min\limits_{\mathbf{x}^{\star}\in\mathcal{K}}f(\mathbf{x}^{\star})+\frac{3GD}{2\sqrt{T}}. \tag{10} E[f(xˉT)]xKminf(x)+2T 3GD.(10)
可以发现,将 min ⁡ x ⋆ ∈ K f ( x ⋆ ) \min\limits_{\mathbf{x}^{\star}\in\mathcal{K}}f(\mathbf{x}^{\star}) xKminf(x)移到左边,然后取 T T T轮累积和,即可得到regre bound = O ( D G T ) O(DG\sqrt T) O(DGT )

简略证明: 首先定义线性函数 f t ( x ) ≜ ∇ ~ t ⊤ x f_t(x) \triangleq \tilde{\nabla}_t^{\top} \mathbf{x} ft(x)~tx,然后有
E [ f ( x ˉ T ) ] − f ( x ∗ ) ≤ E [ 1 T ∑ t f ( x t ) ] − f ( x ⋆ ) convexity of  f  (Jensen’s inequality) ≤ 1 T E ⁡ [ ∑ t ∇ f ( x t ) T ( x t − x ⋆ ) ] convexity again(for last step  f ( x ∗ )   d o   1 T ⋅ T ⋅ f ( x ∗ ) ) = 1 T E [ ∑ t ∇ ~ t ⊤ ( x t − x ⋆ ) ] noisy gradient estimator = 1 T E [ ∑ t f t ( x t ) − f t ( x ⋆ ) ] f t ( x )  definition ≤ Regret ⁡ T T Regret definition ≤ 3 G D 2 T OGD conclusion (11) \begin{aligned} &\mathbf{E}[f(\bar{\mathbf{x}}_T)]-f(\mathbf{x}^*) \\ &\leq\mathbf{E}[\frac{1}{T}\sum\limits_t f(\mathbf{x}_t)]-f(\mathbf{x}^\star)& \text{convexity of}\ f\ \text{(Jensen's inequality)} \\ &\leq\frac{1}{T}\operatorname{E}[\sum_t\nabla f(\mathbf{x}_t)^{\mathsf{T}}(\mathbf{x}_t-\mathbf{x}^{\star})]& \text{convexity again(for last step}\ f(x^*)\ do\ \frac{1}{T} \cdot T \cdot f(x^*) \text{)} \\ &=\frac{1}{T}\mathbf{E}[\sum_t\tilde{\nabla}_t^{\top}(\mathbf{x}_t-\mathbf{x}^{\star})]& \text{noisy gradient estimator} \\ &=\frac{1}{T}\mathbf{E}[\sum_t f_t(\mathbf{x}_t)-f_t(\mathbf{x}^{\star})] & f_t(\mathbf{x}) \text{ definition} \\ &\leq\frac{\operatorname{Regret}_T}{T}& \text{Regret definition} \\ &\leq\frac{3GD}{2\sqrt{T}}& \text{OGD conclusion} \end{aligned} \tag{11} E[f(xˉT)]f(x)E[T1tf(xt)]f(x)T1E[tf(xt)T(xtx)]=T1E[t~t(xtx)]=T1E[tft(xt)ft(x)]TRegretT2T 3GDconvexity of f (Jensen’s inequality)convexity again(for last step f(x) do T1Tf(x))noisy gradient estimatorft(x) definitionRegret definitionOGD conclusion(11)
Jensen’s inequality5

假设凸函数 f f f,可行域内的一系列点集 { x 1 , … , x n } \{x_1, \dots, x_n\} {x1,,xn},若 λ i ≥ 0 ,   a n d   ∑ i λ i = 1 \lambda_i \geq 0,\ and \ \sum_i \lambda_i =1 λi0, and iλi=1,则 f ( x ) f(x) f(x)满足:
f ( ∑ i = 1 M λ i x i ) ≤ ∑ i = 1 M λ i f ( x i ) (12) f(\sum_{i=1}^M\lambda_i x_i)\le\sum_{i=1}^M\lambda_if(x_i) \tag{12} f(i=1Mλixi)i=1Mλif(xi)(12)


  1. Online Convex Programming and Generalized Infinitesimal Gradient Ascent ↩︎

  2. Introduction to Online Convex Optimization ↩︎

  3. Convex Optimization: Algorithms and Complexity ↩︎

  4. An Optimal Method for Stochastic Composite Optimization ↩︎

  5. Jensen’s inequality wiki ↩︎

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值