文章目录
梯度下降加速理论分析
针对凸优化简介11 中最后的算法过程,下面分析 λ k \lambda_k λk趋向于0的速度。
定理:如果 γ 0 ≥ μ \gamma_0\geq \mu γ0≥μ, 那么 λ k ≤ min { ( 1 − μ L ) 2 , 4 L ( 2 L + k γ 0 ) 2 } \lambda_k\leq \min\{(1-\sqrt{\frac{\mu}{L}})^2, \frac{4L}{(2\sqrt{L}+k\sqrt\gamma_0)^2}\} λk≤min{(1−Lμ)2,(2L+kγ0)24L}
证明:如果
γ
k
≥
μ
\gamma_k\geq \mu
γk≥μ,那么
γ
k
+
1
=
L
a
k
2
=
(
1
−
a
k
)
γ
k
+
a
k
μ
≥
μ
\gamma_{k+1}=La^2_k=(1-a_k)\gamma_k+a_k\mu \geq \mu
γk+1=Lak2=(1−ak)γk+akμ≥μ,因为定理中有条件
γ
0
≥
μ
\gamma_0\geq \mu
γ0≥μ,所以得到
a
k
≥
μ
L
a_k\geq \sqrt{\frac{\mu}{L}}
ak≥Lμ。由凸优化简介11引理2得到
λ
k
=
∏
i
=
0
k
−
1
(
1
−
a
i
)
\lambda_k=\prod\limits_{i=0}^{k-1}(1-a_i)
λk=i=0∏k−1(1−ai),因此带入
a
k
≥
μ
L
a_k\geq \sqrt{\frac{\mu}{L}}
ak≥Lμ得到
λ
k
≤
(
1
−
μ
L
)
k
\lambda_k\leq (1-\sqrt{\frac{\mu}{L}})^k
λk≤(1−Lμ)k.
设
b
k
=
1
λ
k
b_k=\frac{1}{\sqrt{\lambda_k}}
bk=λk1,因为
{
λ
k
}
\{\lambda_k\}
{λk}是一个下降序列,因此可以得到:
b
k
+
1
−
b
k
=
λ
k
−
λ
k
+
1
λ
k
λ
k
+
1
=
λ
k
−
λ
k
+
1
λ
k
λ
k
+
1
(
λ
k
+
λ
k
+
1
)
≥
λ
k
−
λ
k
+
1
2
λ
k
λ
k
+
1
=
λ
k
−
(
1
−
a
k
)
λ
k
2
λ
k
λ
k
+
1
=
a
k
2
λ
k
+
1
≥
1
2
γ
0
L
\begin{aligned} &b_{k+1}-b_k=\frac{\sqrt{\lambda_k}-\sqrt{\lambda_{k+1}}}{\sqrt{\lambda_k\lambda_{k+1}}}\\ &=\frac{\lambda_k-\lambda_{k+1}}{\sqrt{\lambda_k\lambda_{k+1}}(\sqrt{\lambda_k}+\sqrt{\lambda_{k+1}})}\\ &\geq \frac{\lambda_k-\lambda_{k+1}}{2\lambda_k\sqrt{\lambda_{k+1}}}\\ &=\frac{\lambda_k-(1-a_k)\lambda_k}{2\lambda_k\sqrt{\lambda_{k+1}}}\\ &=\frac{a_k}{2\sqrt{\lambda_{k+1}}}\\ &\geq \frac{1}{2}\sqrt{\frac{\gamma_0}{L}} \end{aligned}
bk+1−bk=λkλk+1λk−λk+1=λkλk+1(λk+λk+1)λk−λk+1≥2λkλk+1λk−λk+1=2λkλk+1λk−(1−ak)λk=2λk+1ak≥21Lγ0
因此,可以得到
b
k
≥
1
+
k
2
γ
0
L
b_k\geq 1+\frac{k}{2}\sqrt{\frac{\gamma_0}{L}}
bk≥1+2kLγ0.
定理:若取 γ 0 = L \gamma_0=L γ0=L,那么这个过程产生的序列 { x k } k = 0 ∞ \{x_k\}^{\infty}_{k=0} {xk}k=0∞,满足 f ( x k ) − f ∗ ≤ L min { ( 1 − μ L ) k , 4 ( k + 1 ) 2 } ∥ x 0 − x ∗ ∥ 2 f(x_k)-f^*\leq L \min \{(1-\sqrt{\frac{\mu}{L}})^k,\frac{4}{(k+1)^2}\}\|x_0-x^*\|^2 f(xk)−f∗≤Lmin{(1−Lμ)k,(k+1)24}∥x0−x∗∥2. 这说明,对于来自 F μ , L 1 , 1 ( R n ) \mathfrak{F}_{\mu,L}^{1,1}(\mathbb{R}^n) Fμ,L1,1(Rn)的函数, μ ≥ 0 \mu \geq 0 μ≥0,其无约束最小化是最优的
证明:我们使用
f
(
x
0
)
−
f
∗
≤
L
2
∥
x
0
−
x
∗
∥
2
f(x_0)-f^*\leq \frac{L}{2}\|x_0-x^*\|^2
f(x0)−f∗≤2L∥x0−x∗∥2以及前面的定理得到上面的不等式。
下面是凸优化简介11 最后的算法过程的变种,不同之处在于 步长的选择。
- 选择 x 0 ∈ R n x_0\in \mathbb{R}^n x0∈Rn,并且 γ 0 > 0 \gamma_0 > 0 γ0>0,设 v 0 = x 0 v_0=x_0 v0=x0;
- 迭代 k k k 次:
2.1 从等式 L a k 2 = ( 1 − a k ) γ k + a k μ La_k^2=(1-a_k)\gamma_k+a_k\mu Lak2=(1−ak)γk+akμ计算得到 a k ∈ ( 0 , 1 ) a_k\in(0,1) ak∈(0,1),设 γ k + 1 = ( 1 − a k ) γ k + a k μ \gamma_{k+1}=(1-a_k)\gamma_k+a_k\mu γk+1=(1−ak)γk+akμ
2.2 选择 y k = a k γ k v k + γ k + 1 x k γ k + a K μ y_k=\frac{a_k\gamma_kv_k+\gamma_{k+1}x_k}{\gamma_k+a_K\mu} yk=γk+aKμakγkvk+γk+1xk,并计算 f ( y k ) f(y_k) f(yk)和 ∇ f ( y k ) \nabla f(y_k) ∇f(yk);
2.3 找到 x k + 1 = y k − 1 L ∇ f ( y k ) x_{k+1}=y_k-\frac{1}{L}\nabla f(y_k) xk+1=yk−L1∇f(yk);
2.4 设 v k + 1 = ( 1 − a k ) γ k v k + a k μ y k − a k ∇ f ( y k ) γ k + 1 v_{k+1}=\frac{(1-a_k)\gamma_kv_k+a_k\mu y_k-a_k\nabla f(y_k)}{\gamma_{k+1}} vk+1=γk+1(1−ak)γkvk+akμyk−ak∇f(yk)
根据上面算法中的等式,可以尝试消去一些变量。首先消去
v
k
v_k
vk。
v
k
+
1
=
1
γ
k
+
1
{
1
−
a
k
a
k
[
(
γ
k
+
a
k
μ
)
y
k
−
γ
k
+
1
x
k
]
+
a
k
μ
y
k
−
a
k
∇
f
(
y
k
)
}
=
1
γ
k
+
1
{
(
1
−
a
k
)
γ
k
a
k
y
k
+
μ
y
k
}
−
1
−
a
k
a
k
x
k
−
a
k
γ
k
+
1
∇
f
(
y
k
)
=
x
k
+
1
a
k
(
y
k
−
x
k
)
−
1
a
k
L
∇
f
(
y
k
)
=
x
k
+
1
a
k
[
(
y
k
−
x
k
)
−
1
L
∇
f
(
y
k
)
]
=
x
k
+
1
a
k
(
x
k
+
1
−
x
k
)
\begin{aligned} &v_{k+1}=\frac{1}{\gamma_{k+1}}\{\frac{1-a_k}{a_k}[(\gamma_k+a_k\mu)y_k-\gamma_{k+1}x_k]+a_k\mu y_k-a_k\nabla f(y_k)\}\\ &=\frac{1}{\gamma_{k+1}}\{\frac{(1-a_k)\gamma_k}{a_k}y_k+\mu y_k\}-\frac{1-a_k}{a_k}x_k-\frac{a_k}{\gamma_{k+1}}\nabla f(y_k)\\ &=x_k+\frac{1}{a_k}(y_k-x_k)-\frac{1}{a_kL}\nabla f(y_k)\\ &=x_k+\frac{1}{a_k}[(y_k-x_k)-\frac{1}{L}\nabla f(y_k)]\\ &=x_k+\frac{1}{a_k}(x_{k+1}-x_k) \end{aligned}
vk+1=γk+11{ak1−ak[(γk+akμ)yk−γk+1xk]+akμyk−ak∇f(yk)}=γk+11{ak(1−ak)γkyk+μyk}−ak1−akxk−γk+1ak∇f(yk)=xk+ak1(yk−xk)−akL1∇f(yk)=xk+ak1[(yk−xk)−L1∇f(yk)]=xk+ak1(xk+1−xk)
因此,
y
k
+
1
=
1
γ
k
+
1
+
a
k
+
1
μ
(
a
k
+
1
γ
k
+
1
v
k
+
1
+
γ
k
+
2
x
k
+
1
)
=
x
k
+
1
+
a
k
+
1
γ
k
+
1
(
v
k
+
1
−
x
k
+
1
)
γ
k
+
1
+
a
k
+
1
μ
=
x
k
+
1
+
β
k
(
x
k
+
1
−
x
k
)
y_{k+1}=\frac{1}{\gamma_{k+1}+a_{k+1}\mu}(a_{k+1}\gamma_{k+1}v_{k+1}+\gamma_{k+2}x_{k+1})\\ =x_{k+1}+\frac{a_{k+1}\gamma_{k+1}(v_{k+1}-x_{k+1})}{\gamma_{k+1}+a_{k+1}\mu}\\ =x_{k+1}+\beta_{k}(x_{k+1}-x_{k})
yk+1=γk+1+ak+1μ1(ak+1γk+1vk+1+γk+2xk+1)=xk+1+γk+1+ak+1μak+1γk+1(vk+1−xk+1)=xk+1+βk(xk+1−xk),其中
β
k
=
a
k
+
1
γ
k
+
1
(
1
−
a
k
)
a
k
(
γ
k
+
1
+
a
k
+
1
μ
)
\beta_{k}=\frac{a_{k+1}\gamma_{k+1}(1-a_k)}{a_k(\gamma_{k+1}+a_{k+1}\mu)}
βk=ak(γk+1+ak+1μ)ak+1γk+1(1−ak).
接着消去
{
v
k
}
\{v_k\}
{vk},使用等式
a
k
2
L
=
(
1
−
a
k
)
γ
k
+
μ
a
k
≡
γ
k
+
1
a^2_{k}L=(1-a_k)\gamma_k+\mu a_k\equiv \gamma_{k+1}
ak2L=(1−ak)γk+μak≡γk+1。因此,
β
k
=
a
k
+
1
γ
k
+
1
(
1
−
a
k
)
a
k
(
γ
k
+
1
+
a
k
+
1
μ
)
=
a
k
+
1
γ
k
+
1
(
1
−
a
k
)
a
k
(
γ
k
+
1
+
a
k
+
1
2
L
−
(
1
−
a
k
+
1
)
γ
k
+
1
)
=
γ
k
+
1
(
1
−
a
k
)
a
k
(
γ
k
+
1
+
a
k
+
1
L
)
=
a
k
(
1
−
a
k
)
a
k
2
+
a
k
+
1
\beta_k=\frac{a_{k+1}\gamma_{k+1}(1-a_k)}{a_k(\gamma_{k+1}+a_{k+1}\mu)}=\frac{a_{k+1}\gamma_{k+1}(1-a_k)}{a_k(\gamma_{k+1}+a^2_{k+1}L-(1-a_{k+1})\gamma_{k+1})}=\frac{\gamma_{k+1}(1-a_k)}{a_k(\gamma_{k+1}+a_{k+1}L)}=\frac{a_k(1-a_k)}{a^2_{k}+a_{k+1}}
βk=ak(γk+1+ak+1μ)ak+1γk+1(1−ak)=ak(γk+1+ak+12L−(1−ak+1)γk+1)ak+1γk+1(1−ak)=ak(γk+1+ak+1L)γk+1(1−ak)=ak2+ak+1ak(1−ak).
因此,上面的算法过程可以写成下面的形式:
- 选择 x 0 ∈ R n x_0\in \mathbb{R}^n x0∈Rn且 a 0 ∈ ( 0 , 1 ) a_0\in (0,1) a0∈(0,1)。设置 y 0 = x 0 y_0=x_0 y0=x0, q = μ L q=\frac{\mu}{L} q=Lμ
- 迭代 k k k次
2.1 计算 f ( y k ) f(y_k) f(yk)和 ∇ f ( y k ) \nabla f(y_k) ∇f(yk),设 x k + 1 = y k − 1 L ∇ f ( y k ) x_{k+1}=y_k-\frac{1}{L}\nabla f(y_k) xk+1=yk−L1∇f(yk)
2.2 从等式 a k + 1 2 = ( 1 − a k + 1 ) a k 2 + q a k + 1 a^2_{k+1}=(1-a_{k+1})a^2_{k}+qa_{k+1} ak+12=(1−ak+1)ak2+qak+1计算 a k + 1 ∈ ( 0 , 1 ) a_{k+1}\in (0,1) ak+1∈(0,1),且设置 β k = a k ( 1 − a k ) a k 2 + a k + 1 \beta_k=\frac{a_k(1-a_k)}{a^2_{k}+a_{k+1}} βk=ak2+ak+1ak(1−ak), y k + 1 = x k + 1 + β k ( x k + 1 − x k ) y_{k+1}=x_{k+1}+\beta_k(x_{k+1}-x_k) yk+1=xk+1+βk(xk+1−xk).
定理:如果在上面的过程中, a 0 ≥ μ L a_0\geq \sqrt{\frac{\mu}{L}} a0≥Lμ,那么 f ( x k ) − f ∗ ≤ min { ( 1 − μ L ) k , 4 L ( 2 L + k γ 0 ) 2 } ⋅ [ f ( x 0 − f ∗ + γ 0 2 ∥ x 0 − x ∗ ∥ 2 ) ] f(x_k)-f^*\leq \min\{(1-\sqrt{\frac{\mu}{L}})^k, \frac{4L}{(2\sqrt{L}+k\sqrt{\gamma_0})^2}\}\cdot[f(x_0-f^*+\frac{\gamma_0}{2}\|x_0-x^*\|^2)] f(xk)−f∗≤min{(1−Lμ)k,(2L+kγ0)24L}⋅[f(x0−f∗+2γ0∥x0−x∗∥2)],其中 γ 0 = a 0 ( a 0 L − μ ) 1 − a 0 \gamma_0=\frac{a_0(a_0L-\mu)}{1-a_0} γ0=1−a0a0(a0L−μ)。
如果选择
a
0
=
μ
L
a_0=\sqrt{\frac{\mu}{L}}
a0=Lμ对应于选择
γ
0
=
μ
\gamma_0=\mu
γ0=μ,那么算法里面
a
k
=
μ
L
,
β
k
=
L
−
μ
L
−
μ
a_k=\sqrt{\frac{\mu}{L}}, \beta_k=\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}-\sqrt{\mu}}
ak=Lμ,βk=L−μL−μ。那么算法里面,迭代过程中,
x
k
+
1
=
y
k
−
1
L
∇
f
(
y
k
)
x_{k+1}=y_k-\frac{1}{L}\nabla f(y_k)
xk+1=yk−L1∇f(yk),
y
k
+
1
=
x
k
+
1
+
L
−
μ
L
+
μ
(
x
k
+
1
−
x
k
)
y_{k+1}=x_{k+1}+\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}(x_{k+1}-x_{k})
yk+1=xk+1+L+μL−μ(xk+1−xk). 这种方案就是 Nesterov
在1983年提出的。论文链接
此外还有 Polyak 在 1964年提出的heavy-ball方案论文链接:
x
t
+
1
=
x
t
−
a
∇
f
(
x
t
)
+
β
(
x
t
−
x
t
−
1
)
x_{t+1}=x_t-a\nabla f(x_t)+\beta (x_t-x_{t-1})
xt+1=xt−a∇f(xt)+β(xt−xt−1),取
a
=
4
L
+
μ
,
β
=
(
L
−
μ
L
+
μ
)
2
a=\frac{4}{\sqrt{L}+\sqrt{\mu}},\beta=(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}})^2
a=L+μ4,β=(L+μL−μ)2。迭代中的更新规则为
x
t
+
1
=
y
t
−
4
(
L
+
μ
)
2
f
(
y
t
)
,
y
t
+
1
=
x
t
+
1
+
(
L
−
μ
L
+
μ
)
2
(
x
t
+
1
−
x
t
)
x_{t+1}=y_t-\frac{4}{(\sqrt{L}+\sqrt{\mu})^2}f(y_t),y_{t+1}=x_{t+1}+(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}})^2(x_{t+1}-x_t)
xt+1=yt−(L+μ)24f(yt),yt+1=xt+1+(L+μL−μ)2(xt+1−xt).
Beck 和 Teboulle 在 2009年提出的 FISTA方案论文链接:
x
t
+
1
=
y
t
−
1
L
∇
f
(
y
t
)
,
y
t
+
1
=
x
t
+
1
+
λ
t
−
1
λ
t
(
x
t
+
1
−
x
t
)
x_{t+1}=y_t-\frac{1}{L}\nabla f(y_t), y_{t+1}=x_{t+1}+\frac{\lambda_t-1}{\lambda_t}(x_{t+1}-x_t)
xt+1=yt−L1∇f(yt),yt+1=xt+1+λtλt−1(xt+1−xt). 并且对于所有的
t
≥
0
t \geq 0
t≥0,
λ
0
=
0
,
λ
t
+
1
=
1
+
1
+
4
λ
t
2
2
\lambda_0=0, \lambda_{t+1}=\frac{1+\sqrt{1+4\lambda_t^2}}{2}
λ0=0,λt+1=21+1+4λt2