OCO
简单梳理,很粗糙,详细看书!
First-Order Methods for OCO
该类算法的目标是最小化遗憾界,遗憾定义如下:
R
e
g
r
e
t
T
=
∑
t
=
1
T
f
t
(
x
t
)
−
min
x
∈
K
∑
t
=
1
T
f
t
(
x
)
.
(1)
\mathrm{Regret}_{T} =\sum_{t=1}^T f_t(\mathbf{x}_t)-\min\limits_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^T f_t(\mathbf{x}). \tag{1}
RegretT=t=1∑Tft(xt)−x∈Kmint=1∑Tft(x).(1)
Online Gradient Descent(OGD)
OGD算法是GD算法的在线版本,第一次引入是通过 Zinkevich1
OGD算法的更新规则如下:
y
t
+
1
=
x
t
−
η
t
∇
f
t
(
x
t
)
x
t
+
1
=
Π
K
(
y
t
+
1
)
(2)
\mathbf{y}_{t+1} = \mathbf{x}_t - \eta_t \nabla f_t(\mathbf{x}_t) \\ \tag{2} \mathbf{x}_{t+1} = \Pi_{\mathcal{K}}(\mathbf{y}_{t+1})
yt+1=xt−ηt∇ft(xt)xt+1=ΠK(yt+1)(2)
通过设置步长序列 { η t = D G t , t ∈ [ T ] } \{ \eta_t = \frac{D}{G \sqrt{t}}, t \in [T] \} {ηt=GtD,t∈[T]},我们有
R e g r e t T = ∑ t = 1 T f t ( x t ) − min x ∗ ∈ K ∑ t = 1 T f t ( x ∗ ) ≤ 3 2 G D T (3) Regret_T = \sum_{t=1}^T f_t(\mathbf{x}_t)-\min\limits_{\mathbf{x^*}\in\mathcal{K}}\sum_{t=1}^T f_t(\mathbf{x}^*) \leq \frac{3}{2} G D \sqrt{T} \tag{3} RegretT=t=1∑Tft(xt)−x∗∈Kmint=1∑Tft(x∗)≤23GDT(3)
简略证明. 让
x
∗
∈
arg
min
x
∈
K
∑
t
=
1
T
f
t
(
x
)
x^* \in \arg\min_{x \in \mathcal{K}} \sum^T_{t=1} f_t(\mathbf{x})
x∗∈argminx∈K∑t=1Tft(x),定义
∇
t
≜
∇
f
t
(
x
t
)
\nabla_t \triangleq \nabla f_t(\mathbf{x}_t)
∇t≜∇ft(xt),通过凸函数的凸性(凸函数的一阶判定条件):
f
t
(
x
t
)
−
f
t
(
x
∗
)
≤
∇
t
⊤
(
x
t
−
x
∗
)
(4)
f_t(\mathbf{x}_t) - f_t(\mathbf{x}^*) \leq \nabla_t^{\top}(\mathbf{x}_t - \mathbf{x}^*) \tag{4}
ft(xt)−ft(x∗)≤∇t⊤(xt−x∗)(4)
和算法的更新规则:
∥
x
t
+
1
−
x
⋆
∥
2
=
∥
Π
K
(
x
t
−
η
t
∇
ι
)
−
x
⋆
∥
2
≤
∥
x
t
−
η
t
∇
ι
−
x
⋆
∥
2
↓
2
∇
t
⊤
(
x
t
−
x
⋆
)
≤
∥
x
t
−
x
⋆
∥
2
−
∥
x
t
+
1
−
x
⋆
∥
2
η
t
+
η
t
G
2
(5)
\left\|\mathbf{x}_{t+1}-\mathbf{x}^{\star}\right\|^2=\left\|\Pi_{\mathcal{K}}\left(\mathbf{x}_t-\eta_t \nabla_\iota\right)-\mathbf{x}^{\star}\right\|^2 \leq\left\|\mathbf{x}_t-\eta_t \nabla_\iota-\mathbf{x}^{\star}\right\|^2 \\ \tag{5} \downarrow \\ 2 \nabla_t^{\top}\left(\mathrm{x}_t-\mathrm{x}^{\star}\right) \leq \frac{\left\|\mathbf{x}_t-\mathbf{x}^{\star}\right\|^2-\left\|\mathbf{x}_{t+1}-\mathbf{x}^{\star}\right\|^2}{\eta_t}+\eta_t G^2
∥xt+1−x⋆∥2=∥ΠK(xt−ηt∇ι)−x⋆∥2≤∥xt−ηt∇ι−x⋆∥2↓2∇t⊤(xt−x⋆)≤ηt∥xt−x⋆∥2−∥xt+1−x⋆∥2+ηtG2(5)
然后联合公式4和5,取
1
,
…
,
T
1,\dots,T
1,…,T的和,证明完毕。
o ( D G T ) o(DG \sqrt T) o(DGT) 是OGD最好的遗憾界,相关证明可以参考2中3.2节
Online gradient descent for strongly convex functions
对于强凸损失函数,通过设置步长序列 η t = 1 α t \eta_t = \frac{1}{\alpha t} ηt=αt1,其中 α \alpha α是强凸因子,OGD算法有:
R e g r e t T ≤ G 2 2 α ( 1 + log T ) (6) Regret_T \leq \frac{G^2}{2 \alpha}(1+\log{T}) \tag{6} RegretT≤2αG2(1+logT)(6)
简略证明: 类似一般情况下OGD的证明,区别是步长的设置和利用强凸函数的强凸性(强凸函数的一阶判定条件):
2
(
f
t
(
x
t
)
−
f
t
(
x
⋆
)
)
≤
2
∇
t
⊤
(
x
t
−
x
⋆
)
−
α
∥
x
⋆
−
x
t
∥
2
(7)
2\left(f_t\left(\mathbf{x}_t\right)-f_t\left(\mathbf{x}^{\star}\right)\right) \leq 2 \nabla_t^{\top}\left(\mathbf{x}_t-\mathbf{x}^{\star}\right)-\alpha\left\|\mathbf{x}^{\star}-\mathbf{x}_t\right\|^2 \tag{7}
2(ft(xt)−ft(x⋆))≤2∇t⊤(xt−x⋆)−α∥x⋆−xt∥2(7)
Stochastic Gradient Descent3 4
随即优化是在线凸优化中一个特殊的场景,其目标是最小化凸域上的凸函数,具体定义如下:
min
x
∈
K
f
(
x
)
.
(8)
\min_{x \in \mathcal{K}} f(x). \tag{8}
x∈Kminf(x).(8)
同时,与离线场景下不同,该场景下优化器无法获得准确的梯度信息,而是带噪声的梯度,即:
O
(
x
)
≜
∇
~
x
s
.
t
.
E
[
∇
~
x
]
=
∇
f
(
x
)
,
E
[
∥
∇
~
x
∥
]
≤
G
2
(9)
\mathcal{O}(\mathbf{x}) \triangleq \tilde{\nabla}_\mathbf{x} \ \ \ \ \ \ s.t. \ \ \ \mathbb{E}[\tilde{\nabla}_\mathbf{x}] = \nabla f(\mathbf{x}), \;\; \mathbb{E}[\| \tilde{\nabla}_{\mathbf{x}}\|] \leq G^2 \tag{9}
O(x)≜∇~x s.t. E[∇~x]=∇f(x),E[∥∇~x∥]≤G2(9)
也就是说,每次从决策集取一个点
x
x
x,然后获得该点带噪声的梯度,带噪声的梯度的期望是函数
f
f
f在该点的梯度,并且期望的方差是
G
2
G^2
G2。与OGD类似,区别即获得的梯度是带噪声的。该处证明SGD方法的convergence rate,上边是regre bound。
通过设置步长序列 η t = D G T \eta_t = \frac{D}{G \sqrt T} ηt=GTD,有:
E [ f ( x ˉ T ) ] ≤ min x ⋆ ∈ K f ( x ⋆ ) + 3 G D 2 T . (10) \mathbf{E}[f(\bar{\mathbf{x}}_T)]\leq\min\limits_{\mathbf{x}^{\star}\in\mathcal{K}}f(\mathbf{x}^{\star})+\frac{3GD}{2\sqrt{T}}. \tag{10} E[f(xˉT)]≤x⋆∈Kminf(x⋆)+2T3GD.(10)
可以发现,将 min x ⋆ ∈ K f ( x ⋆ ) \min\limits_{\mathbf{x}^{\star}\in\mathcal{K}}f(\mathbf{x}^{\star}) x⋆∈Kminf(x⋆)移到左边,然后取 T T T轮累积和,即可得到regre bound = O ( D G T ) O(DG\sqrt T) O(DGT)
简略证明: 首先定义线性函数
f
t
(
x
)
≜
∇
~
t
⊤
x
f_t(x) \triangleq \tilde{\nabla}_t^{\top} \mathbf{x}
ft(x)≜∇~t⊤x,然后有
E
[
f
(
x
ˉ
T
)
]
−
f
(
x
∗
)
≤
E
[
1
T
∑
t
f
(
x
t
)
]
−
f
(
x
⋆
)
convexity of
f
(Jensen’s inequality)
≤
1
T
E
[
∑
t
∇
f
(
x
t
)
T
(
x
t
−
x
⋆
)
]
convexity again(for last step
f
(
x
∗
)
d
o
1
T
⋅
T
⋅
f
(
x
∗
)
)
=
1
T
E
[
∑
t
∇
~
t
⊤
(
x
t
−
x
⋆
)
]
noisy gradient estimator
=
1
T
E
[
∑
t
f
t
(
x
t
)
−
f
t
(
x
⋆
)
]
f
t
(
x
)
definition
≤
Regret
T
T
Regret definition
≤
3
G
D
2
T
OGD conclusion
(11)
\begin{aligned} &\mathbf{E}[f(\bar{\mathbf{x}}_T)]-f(\mathbf{x}^*) \\ &\leq\mathbf{E}[\frac{1}{T}\sum\limits_t f(\mathbf{x}_t)]-f(\mathbf{x}^\star)& \text{convexity of}\ f\ \text{(Jensen's inequality)} \\ &\leq\frac{1}{T}\operatorname{E}[\sum_t\nabla f(\mathbf{x}_t)^{\mathsf{T}}(\mathbf{x}_t-\mathbf{x}^{\star})]& \text{convexity again(for last step}\ f(x^*)\ do\ \frac{1}{T} \cdot T \cdot f(x^*) \text{)} \\ &=\frac{1}{T}\mathbf{E}[\sum_t\tilde{\nabla}_t^{\top}(\mathbf{x}_t-\mathbf{x}^{\star})]& \text{noisy gradient estimator} \\ &=\frac{1}{T}\mathbf{E}[\sum_t f_t(\mathbf{x}_t)-f_t(\mathbf{x}^{\star})] & f_t(\mathbf{x}) \text{ definition} \\ &\leq\frac{\operatorname{Regret}_T}{T}& \text{Regret definition} \\ &\leq\frac{3GD}{2\sqrt{T}}& \text{OGD conclusion} \end{aligned} \tag{11}
E[f(xˉT)]−f(x∗)≤E[T1t∑f(xt)]−f(x⋆)≤T1E[t∑∇f(xt)T(xt−x⋆)]=T1E[t∑∇~t⊤(xt−x⋆)]=T1E[t∑ft(xt)−ft(x⋆)]≤TRegretT≤2T3GDconvexity of f (Jensen’s inequality)convexity again(for last step f(x∗) do T1⋅T⋅f(x∗))noisy gradient estimatorft(x) definitionRegret definitionOGD conclusion(11)
Jensen’s inequality5
假设凸函数
f
f
f,可行域内的一系列点集
{
x
1
,
…
,
x
n
}
\{x_1, \dots, x_n\}
{x1,…,xn},若
λ
i
≥
0
,
a
n
d
∑
i
λ
i
=
1
\lambda_i \geq 0,\ and \ \sum_i \lambda_i =1
λi≥0, and ∑iλi=1,则
f
(
x
)
f(x)
f(x)满足:
f
(
∑
i
=
1
M
λ
i
x
i
)
≤
∑
i
=
1
M
λ
i
f
(
x
i
)
(12)
f(\sum_{i=1}^M\lambda_i x_i)\le\sum_{i=1}^M\lambda_if(x_i) \tag{12}
f(i=1∑Mλixi)≤i=1∑Mλif(xi)(12)