【论文笔记】利用平滑度加速分布式优化——梯度跟踪法(Gradient Tracking)

写在前面

原论文:Harnessing Smoothness to Accelerate Distributed Optimization.

本文是Qu 20181的笔记,主要是对离散系统分布式梯度跟踪算法证明过程的总结。

问题描述和算法

问题描述
min ⁡ x ∈ R d f ( x ) = 1 n ∑ i = 1 n f i ( x ) \min_{x\in\mathbb R^d} f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x) xRdminf(x)=n1i=1nfi(x)
假设1: f i f_i fi α \alpha α强凸和 β \beta β光滑的函数。即
f i ( y ) − f i ( x ) ≥ ∇ f i ( x ) T ( y − x ) + α 2 ∥ y − x ∥ 2 ∥ ∇ f i ( x ) − ∇ f i ( y ) ∥ ≤ β ∥ x − y ∥ \begin{aligned} f_i(y)-f_i(x)&\geq \nabla f_i(x)^T(y-x)+\frac{\alpha}{2}\|y-x\|^2\\ \|\nabla f_i(x)-\nabla f_i(y)\|&\leq \beta\|x-y\| \end{aligned} fi(y)fi(x)fi(x)fi(y)fi(x)T(yx)+2αyx2βxy
离散系统分布式优化算法
x i ( t + 1 ) = ∑ i = 1 n w i j x j ( t ) − η s i ( t ) s i ( t + 1 ) = ∑ j = 1 n w i j s j ( t ) + ∇ f i ( x i ( t + 1 ) ) − ∇ f i ( x i ( t ) ) ( 1 ) \begin{aligned} x_i(t+1)&=\sum_{i=1}^nw_{ij}x_j(t)-\eta s_i(t)\\ s_i(t+1)&=\sum_{j=1}^nw_{ij}s_j(t)+\nabla f_i(x_i(t+1))-\nabla f_i(x_i(t)) \end{aligned}\qquad (1) xi(t+1)si(t+1)=i=1nwijxj(t)ηsi(t)=j=1nwijsj(t)+fi(xi(t+1))fi(xi(t))(1)
其中 x i ( t ) ∈ R 1 × d x_i(t)\in \mathbb R^{1\times d} xi(t)R1×d s i ( t ) ∈ R 1 × d s_i(t)\in \mathbb R^{1\times d} si(t)R1×d写成行向量形式,同时满足 s i ( 0 ) = ∇ f i ( x i ( 0 ) ) s_i(0)=\nabla f_i(x_i(0)) si(0)=fi(xi(0))。可以看出, s i ( t ) s_i(t) si(t)用来跟踪梯度的均值,即 1 n ∇ f i ( x i ( t ) ) \frac{1}{n}\nabla f_i(x_i(t)) n1fi(xi(t))。上述算法称为分布式梯度跟踪法(distributed gradient tracking, DGT)。

收敛性证明

给定 W W W双随机(doubly stochastic)矩阵。算法(1)矩阵形式
x ( t + 1 ) = W x ( t ) − η s ( t ) s ( t + 1 ) = W s ( t ) + ∇ ( t + 1 ) − ∇ ( t ) ( 2 ) \begin{aligned} x(t+1)&=W x(t)-\eta s(t)\\ s(t+1)&=W s(t)+\nabla(t+1)-\nabla (t) \end{aligned}\qquad (2) x(t+1)s(t+1)=Wx(t)ηs(t)=Ws(t)+(t+1)(t)(2)
其中 s ( 0 ) = ∇ ( 0 ) s(0)=\nabla (0) s(0)=(0) ∇ ∈ R n × d \nabla\in\mathbb R^{n\times d} Rn×d是梯度 ∇ f i ( x i ( t ) ) \nabla f_i(x_i(t)) fi(xi(t))以行向量形式堆叠的矩阵,同样的 x ∈ R n × d x\in\mathbb R^{n\times d} xRn×d s ∈ R n × d s\in\mathbb R^{n\times d} sRn×d也是以行向量形式堆叠的矩阵。

x ˉ ( t ) = ( 1 / n ) 1 n T x ( t ) \bar x(t)=(1/n)1_n^T x(t) xˉ(t)=(1/n)1nTx(t) s ˉ ( t ) = ( 1 / n ) 1 n T s ( t ) \bar s(t)=(1/n)1_n^T s(t) sˉ(t)=(1/n)1nTs(t) g ( t ) = ( 1 / n ) 1 n T ∇ ( t ) g(t)=(1/n)1_n^T \nabla(t) g(t)=(1/n)1nT(t)

引理1(Lemma 7):以下等式成立:

  • s ˉ ( t + 1 ) = s ˉ ( t ) + g ( t + 1 ) − g ( t ) = g ( t + 1 ) \bar s(t+1)=\bar s(t)+g(t+1)-g(t)=g(t+1) sˉ(t+1)=sˉ(t)+g(t+1)g(t)=g(t+1)
  • x ˉ ( t + 1 ) = x ˉ ( t ) − η s ˉ ( t ) = x ˉ ( t ) − η g ( t ) \bar x(t+1)=\bar x(t)-\eta \bar s(t)=\bar x(t)-\eta g(t) xˉ(t+1)=xˉ(t)ηsˉ(t)=xˉ(t)ηg(t)

证明:式(2)中两边同乘 ( 1 / n ) 1 n T (1/n)1_n^T (1/n)1nT可得 g ( t ) = s ˉ ( t ) = ( 1 / n ) 1 n T s ( t ) g(t)=\bar s(t)=(1/n)1_n^T s(t) g(t)=sˉ(t)=(1/n)1nTs(t),因为 s ( 0 ) = ∇ ( 0 ) s(0)=\nabla (0) s(0)=(0)。同样的, x ˉ ( t + 1 ) = x ˉ ( t ) − η g ( t ) \bar x(t+1)=\bar x(t)-\eta g(t) xˉ(t+1)=xˉ(t)ηg(t)

下面证明收敛性,证明思路是分别证明梯度跟踪误差一致性误差最优解误差的收敛性,最后证明目标误差的收敛性。

第一步,先看梯度跟踪误差 ∥ s ( k ) − 1 n g ( k ) ∥ \|s(k)-1_ng(k)\| s(k)1ng(k),有
s ( k ) − 1 n g ( k ) = [ W s ( k − 1 ) − 1 n g ( k − 1 ) ] + [ ∇ ( k ) − ∇ ( k − 1 ) ] − 1 n [ g ( k ) − g ( k − 1 ) ] . ( 3 ) s(k)-1_ng(k)=[W s(k-1)-1_ng(k-1)]+[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)].\qquad (3) s(k)1ng(k)=[Ws(k1)1ng(k1)]+[(k)(k1)]1n[g(k)g(k1)].(3)
σ ∈ ( 0 , 1 ) \sigma\in(0,1) σ(0,1) W − ( 1 / n ) 1 n 1 n T W-(1/n)1_n1_n^T W(1/n)1n1nT谱范数(spectral norm)。对任意 ω ∈ R n \omega\in \mathbb R^n ωRn,都有
∥ W ω − ( 1 / n ) 1 n 1 n T ω ∥ = ∥ ( W − ( 1 / n ) 1 n 1 n T ) ( ω − ( 1 / n ) 1 n 1 n T ω ) ∥ ≤ σ ∥ ω − ( 1 / n ) 1 n 1 n T ω ∥ . ( 4 ) \|W\omega-(1/n)1_n1_n^T \omega\|=\|(W-(1/n)1_n1_n^T)(\omega-(1/n)1_n1_n^T \omega)\|\leq \sigma\|\omega-(1/n)1_n1_n^T \omega\|.\qquad (4) Wω(1/n)1n1nTω=(W(1/n)1n1nT)(ω(1/n)1n1nTω)σω(1/n)1n1nTω.(4)
由式(4)和引理1,式(3)可以取范数得到
∥ s ( k ) − 1 n g ( k ) ∥ ≤ ∥ W s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + ∥ [ ∇ ( k ) − ∇ ( k − 1 ) ] − 1 n [ g ( k ) − g ( k − 1 ) ] ∥ ≤ σ ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + ∥ [ ∇ ( k ) − ∇ ( k − 1 ) ] − 1 n [ g ( k ) − g ( k − 1 ) ] ∥ . ( 5 ) \begin{aligned} \|s(k)-1_ng(k)\|&\leq \|Ws(k-1)-1_ng(k-1)\|+\|[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)]\|\\ &\leq \sigma\|s(k-1)-1_ng(k-1)\|+\|[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)]\|. \end{aligned}\qquad (5) s(k)1ng(k)Ws(k1)1ng(k1)+[(k)(k1)]1n[g(k)g(k1)]σs(k1)1ng(k1)+[(k)(k1)]1n[g(k)g(k1)].(5)
式(5)后半部分可得
∥ [ ∇ ( k ) − ∇ ( k − 1 ) ] − 1 n [ g ( k ) − g ( k − 1 ) ] ∥ 2 = ∥ ∇ ( k ) − ∇ ( k − 1 ) ∥ 2 + ∥ ( 1 / n ) 1 n 1 n T ( ∇ ( k ) − ∇ ( k − 1 ) ) ∥ 2 − ( ∇ ( k ) − ∇ ( k − 1 ) ) T ( 1 / n ) 1 n 1 n T ( ∇ ( k ) − ∇ ( k − 1 ) ) ≤ ∥ ∇ ( k ) − ∇ ( k − 1 ) ∥ 2 . \begin{aligned} \|[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)]\|^2&=\|\nabla(k)-\nabla(k-1)\|^2+\|(1/n)1_n1_n^T(\nabla (k)-\nabla(k-1))\|^2\\ &\quad -(\nabla(k)-\nabla(k-1))^T(1/n)1_n1_n^T(\nabla (k)-\nabla(k-1))\\ &\leq \|\nabla(k)-\nabla(k-1)\|^2. \end{aligned} [(k)(k1)]1n[g(k)g(k1)]2=(k)(k1)2+(1/n)1n1nT((k)(k1))2((k)(k1))T(1/n)1n1nT((k)(k1))(k)(k1)2.
代入(3)再结合假设1得到
∥ s ( k ) − 1 n g ( k ) ∥ ≤ σ ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + β ∥ x ( k ) − x ( k − 1 ) ∥ . ( 6 ) \|s(k)-1_ng(k)\|\leq \sigma\|s(k-1)-1_ng(k-1)\|+\beta\|x(k)-x(k-1)\|.\qquad (6) s(k)1ng(k)σs(k1)1ng(k1)+βx(k)x(k1).(6)
第二步,同样的,一致性误差由式(4)可得
∥ x ( k ) − 1 n x ˉ ( k ) ∥ = ∥ W x ( k − 1 ) − η s ( k − 1 ) − x ˉ ( k − 1 ) + η 1 n g ( k − 1 ) ∥ ≤ σ ∥ x ( k − 1 ) − x ˉ ( k − 1 ) ∥ + η ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ ( 7 ) \begin{aligned} \|x(k)-1_n\bar x(k)\|&= \|Wx(k-1)-\eta s(k-1)-\bar x(k-1)+\eta1_n g(k-1)\|\\ &\leq \sigma\|x(k-1)-\bar x(k-1)\|+\eta\|s(k-1)-1_ng(k-1)\| \end{aligned}\qquad (7) x(k)1nxˉ(k)=Wx(k1)ηs(k1)xˉ(k1)+η1ng(k1)σx(k1)xˉ(k1)+ηs(k1)1ng(k1)(7)
定义 f = 1 n ∑ i = 1 n f i f=\frac{1}{n}\sum_{i=1}^n f_i f=n1i=1nfi x ˉ ( t ) \bar x(t) xˉ(t)的梯度向量为 h ( t ) = ∇ f ( x ˉ ( t ) ) ∈ R n × d h(t)=\nabla f(\bar x(t))\in\mathbb R^{n\times d} h(t)=f(xˉ(t))Rn×d

第三步,再看最优解误差 ∥ x ˉ − x ∗ ∥ \|\bar x-x^*\| xˉx,得到
x ˉ ( k ) = x ˉ ( k − 1 ) − η h ( k − 1 ) − η [ g ( k − 1 ) − h ( k − 1 ) ] . ( 8 ) \bar x(k)=\bar x(k-1)-\eta h(k-1)-\eta[g(k-1)-h(k-1)].\qquad (8) xˉ(k)=xˉ(k1)ηh(k1)η[g(k1)h(k1)].(8)

引理2(Lemma 3.11 2):如果 f : R d → R f:\mathbb R^d\to\mathbb R f:RdR满足假设1,那么 ∀ x , y ∈ R d \forall x,y\in\mathbb R^d x,yRd,有
( ∇ f ( x ) − ∇ f ( y ) ) T ( x − y ) ≥ α β α + β ∥ x − y ∥ 2 + 1 α + β ∥ ∇ f ( x ) − ∇ f ( y ) ∥ 2 . (\nabla f(x)-\nabla f(y))^T(x-y)\geq \frac{\alpha\beta}{\alpha+\beta}\|x-y\|^2+\frac{1}{\alpha+\beta}\|\nabla f(x)-\nabla f(y)\|^2. (f(x)f(y))T(xy)α+βαβxy2+α+β1f(x)f(y)2.

引理3: ∀ x ∈ R d \forall x\in\mathbb R^d xRd,定义 x + = x − η ∇ f ( x ) x^+=x-\eta \nabla f(x) x+=xηf(x),其中 0 < η < 2 β 0<\eta<\frac{2}{\beta} 0<η<β2 f f f满足假设1,那么
∥ x + − x ∗ ∥ ≤ λ ∥ x − x ∗ ∥ \|x^+-x^*\|\leq \lambda \|x-x^*\| x+xλxx
其中 λ = max ⁡ ( ∣ 1 − η α ∣ , ∣ 1 − η β ∣ ) \lambda =\max(|1-\eta \alpha|,|1-\eta \beta|) λ=max(1ηα,1ηβ)

证明:如果 0 < η ≤ 2 α + β 0<\eta\leq \frac{2}{\alpha+\beta} 0<ηα+β2 α < β \alpha< \beta α<β,那么 2 η − α ≥ β \frac{2}{\eta}-\alpha\geq\beta η2αβ ∣ 1 − η α ∣ ≥ ∣ 1 − η β ∣ |1-\eta\alpha|\geq |1-\eta\beta| 1ηα1ηβ。可知 f f f也是 2 η − α \frac{2}{\eta}-\alpha η2α光滑的函数,我们有
∥ x − x ∗ − η ∇ f ( x ) ∥ 2 = ∥ x − x ∗ ∥ 2 + η 2 ∥ ∇ f ( x ) ∥ 2 − 2 η ∇ f ( x ) T ( x − x ∗ ) ≤ ( 1 − 2 η α ( 2 η − α ) α + ( 2 η − α ) ) ∥ x − x ∗ ∥ 2 + ( η 2 − 2 η α + ( 2 η − α ) ) ∥ ∇ f ( x ) ∥ 2 = ( 1 − η α ) 2 ∥ x − x ∗ ∥ 2 = λ 2 ∥ x − x ∗ ∥ 2 \begin{aligned} \|x-x^*-\eta \nabla f(x)\|^2&=\|x-x^*\|^2+\eta^2\|\nabla f(x)\|^2-2\eta \nabla f(x)^T(x-x^*)\\ &\leq \left(1-2\eta\frac{\alpha(\frac{2}{\eta}-\alpha)}{\alpha+(\frac{2}{\eta}-\alpha)}\right)\|x-x^*\|^2+\left(\eta^2-\frac{2\eta}{\alpha+(\frac{2}{\eta}-\alpha)}\right)\|\nabla f(x)\|^2\\ &=(1-\eta \alpha)^2\|x-x^*\|^2\\ &=\lambda^2\|x-x^*\|^2 \end{aligned} xxηf(x)2=xx2+η2f(x)22ηf(x)T(xx)(12ηα+(η2α)α(η2α))xx2+(η2α+(η2α)2η)f(x)2=(1ηα)2xx2=λ2xx2
如果 2 α + β < η < 2 β \frac{2}{\alpha+\beta}<\eta<\frac{2}{\beta} α+β2<η<β2 α ≥ β \alpha\geq\beta αβ,同样可以进行上面的分析。

由引理3和式(8)可得
∥ x ˉ ( k ) − x ∗ ∥ ≤ λ ∥ x ˉ ( k − 1 ) − x ∗ ∥ + η ∥ g ( k − 1 ) − h ( k − 1 ) ∥ ≤ λ ∥ x ˉ ( k − 1 ) − x ∗ ∥ + ( η β / n ) ∥ x ( k − 1 ) − 1 n x ˉ ( k − 1 ) ∥ ( 9 ) \begin{aligned} \|\bar x(k)-x^*\|&\leq \lambda \|\bar x(k-1)-x^*\|+\eta \|g(k-1)-h(k-1)\|\\ &\leq \lambda \|\bar x(k-1)-x^*\|+(\eta\beta/\sqrt{n}) \|x(k-1)-1_n\bar x(k-1)\| \end{aligned}\qquad(9) xˉ(k)xλxˉ(k1)x+ηg(k1)h(k1)λxˉ(k1)x+(ηβ/n )x(k1)1nxˉ(k1)(9)
其中 λ = max ⁡ ( ∣ 1 − η α ∣ , ∣ 1 − η β ∣ ) \lambda =\max(|1-\eta \alpha|,|1-\eta \beta|) λ=max(1ηα,1ηβ)

第四步,看 ∥ x ( k ) − x ( k − 1 ) ∥ \|x(k)-x(k-1)\| x(k)x(k1)的有界性,由假设1得到
∥ h ( k − 1 ) ∥ = ∥ ∇ f ( x ˉ ( k − 1 ) ) ∥ ≤ β ∥ x ˉ ( k − 1 ) − x ∗ ∥ . \|h(k-1) \|=\|\nabla f(\bar x(k-1)) \|\leq \beta\|\bar x(k-1)-x^* \|. h(k1)=f(xˉ(k1))βxˉ(k1)x.
结合式(9)的方法
∥ s ( k − 1 ) ∥ ≤ ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + ∥ 1 n g ( k − 1 ) − 1 n h ( k − 1 ) ∥ + ∥ 1 n h ( k − 1 ) ∥ ≤ ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + β ∥ x ( k − 1 ) − 1 n x ˉ ( k − 1 ) ∥ + β n ∥ x ˉ ( k − 1 ) − x ∗ ∥ . \begin{aligned} \|s(k-1)\|&\leq \|s(k-1)-1_ng(k-1) \|+\|1_ng(k-1)-1_n h(k-1) \|+\|1_n h(k-1)\|\\ &\leq \|s(k-1)-1_ng(k-1) \|+\beta\|x(k-1)-1_n \bar x(k-1) \|+\beta \sqrt{n}\|\bar x(k-1)-x^*\|. \end{aligned} s(k1)s(k1)1ng(k1)+1ng(k1)1nh(k1)+1nh(k1)s(k1)1ng(k1)+βx(k1)1nxˉ(k1)+βn xˉ(k1)x.
因此
∥ x ( k ) − x ( k − 1 ) ∥ = ∥ W x ( k − 1 ) − x ( k − 1 ) − η s ( k − 1 ) ∥ = ∥ ( W − I ) ( x ( k − 1 ) − 1 n x ˉ ( k − 1 ) ) − η s ( k − 1 ) ∥ ≤ 2 ∥ ( x ( k − 1 ) − 1 n x ˉ ( k − 1 ) ) ∥ + η ∥ s ( k − 1 ) ∥ ≤ η ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + ( η β + 2 ) ∥ x ( k − 1 ) − 1 n x ˉ ( k − 1 ) ∥ + η β n ∥ x ˉ ( k − 1 ) − x ∗ ∥ . \begin{aligned} \|x(k)-x(k-1) \|&=\|Wx(k-1)-x(k-1)-\eta s(k-1) \|\\ &=\|(W-I)(x(k-1)-1_n\bar x(k-1))-\eta s(k-1) \|\\ &\leq 2\|(x(k-1)-1_n\bar x(k-1))\|+\eta\|s(k-1) \|\\ &\leq \eta\|s(k-1)-1_ng(k-1) \|\\ &\quad+(\eta\beta+2)\|x(k-1)-1_n \bar x(k-1) \|+\eta\beta \sqrt{n}\|\bar x(k-1)-x^*\|. \end{aligned} x(k)x(k1)=Wx(k1)x(k1)ηs(k1)=(WI)(x(k1)1nxˉ(k1))ηs(k1)2(x(k1)1nxˉ(k1))+ηs(k1)ηs(k1)1ng(k1)+(ηβ+2)x(k1)1nxˉ(k1)+ηβn xˉ(k1)x.
将上式代入(6)得到
∥ s ( k ) − 1 n g ( k ) ∥ ≤ ( σ + β η ) ∥ s ( k − 1 ) − 1 n g ( k − 1 ) ∥ + β ( η β + 2 ) ∥ x ( k − 1 ) − 1 n x ˉ ( k − 1 ) ∥ + η β 2 n ∥ x ˉ ( k − 1 ) − x ∗ ∥ ( 10 ) \begin{aligned} \|s(k)-1_n g(k)\|&\leq (\sigma+\beta \eta)\|s(k-1)-1_n g(k-1) \|\\ &\quad +\beta(\eta \beta+2)\|x(k-1)-1_n \bar x(k-1) \|+\eta\beta^2 \sqrt{n}\|\bar x(k-1)-x^*\| \end{aligned}\qquad (10) s(k)1ng(k)(σ+βη)s(k1)1ng(k1)+β(ηβ+2)x(k1)1nxˉ(k1)+ηβ2n xˉ(k1)x(10)
结合(7)(9)和(10)得到

因为 z ( k ) z(k) z(k) G ( η ) G(\eta) G(η)非负,可以直接迭代展开得到
z ( k ) ≤ G ( η ) k z ( 0 ) . z(k)\leq G(\eta)^kz(0). z(k)G(η)kz(0).
因为 G ( η ) G(\eta) G(η)非负, G ( η ) 2 G(\eta)^2 G(η)2为正,所以 G ( η ) k G(\eta)^k G(η)k每一项都是 O ( ρ ( G ( η ) ) k ) O(\rho(G(\eta))^k) O(ρ(G(η))k)3 z ( h ) z(h) z(h)每一项以 ρ ( G ( η ) ) k \rho(G(\eta))^k ρ(G(η))k速度收敛。


  1. Qu, G., & Li, N. (2018). Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3), 1245–1260. https://doi.org/10.1109/TCNS.2017.2698261 ↩︎

  2. Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3–4), 231–357. https://doi.org/10.1561/2200000050. ↩︎

  3. R. A. Horn and C. R. Johnson. (2012). Matrix Analysis. Cambridge. U.K.: Cam- bridge Univ. Press. ↩︎

评论 13
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值