公式输入请参考: 在线Latex公式
定理1
一个光滑函数(smooth function)f满足L-Lipschitz条件,则对于任意
x
,
y
∈
R
d
x,y\in R^d
x,y∈Rd,我们有:
∣
∣
▽
f
(
x
)
−
▽
f
(
y
)
∣
∣
≤
L
∣
∣
x
−
y
∣
∣
(定理1)
||\triangledown f(x)-\triangledown f(y)||\leq L||x-y||\tag{定理1}
∣∣▽f(x)−▽f(y)∣∣≤L∣∣x−y∣∣(定理1)
这里的L是常数。
用线性回归为例,先写出线性回归的损失函数:
L
=
1
n
∣
∣
X
w
−
y
∣
∣
2
L=\cfrac{1}{n}||Xw-y||^2
L=n1∣∣Xw−y∣∣2
这里X是训练数据,w是参数
现在求:
∣
∣
▽
f
(
x
)
−
▽
f
(
y
)
∣
∣
=
∣
∣
▽
f
(
w
1
)
−
▽
f
(
w
2
)
∣
∣
=
2
n
∣
∣
X
T
(
X
w
1
−
y
)
−
X
T
(
X
w
2
−
y
)
∣
∣
=
2
n
∣
∣
X
T
X
(
w
1
−
w
2
)
∣
∣
≤
2
n
∣
∣
X
T
X
∣
∣
⋅
∣
∣
(
w
1
−
w
2
)
∣
∣
\begin{aligned}||\triangledown f(x)-\triangledown f(y)||&=||\triangledown f(w_1)-\triangledown f(w_2)||\\ &=\cfrac{2}{n}||X^T(Xw_1-y)-X^T(Xw_2-y)||\\&=\cfrac{2}{n}||X^TX(w_1-w_2)||\\&\leq \cfrac{2}{n}||X^TX||\cdot||(w_1-w_2)||\end{aligned}
∣∣▽f(x)−▽f(y)∣∣=∣∣▽f(w1)−▽f(w2)∣∣=n2∣∣XT(Xw1−y)−XT(Xw2−y)∣∣=n2∣∣XTX(w1−w2)∣∣≤n2∣∣XTX∣∣⋅∣∣(w1−w2)∣∣
由于X是训练数据,是已知的。所以
∣
∣
X
T
X
∣
∣
||X^TX||
∣∣XTX∣∣相当于L-Lipschitz中的常数项L。
定理2
一个光滑函数(smooth function)
f
(
x
)
f(x)
f(x)满足L-Lipschitz条件,并且是凸函数,则对于任意
x
,
y
∈
R
d
x,y\in R^d
x,y∈Rd,我们有:
f
(
y
)
≤
f
(
x
)
+
▽
f
(
x
)
(
y
−
x
)
+
L
2
∣
∣
y
−
x
∣
∣
2
(定理2)
f(y)\leq f(x)+\triangledown f(x)(y-x)+\cfrac{L}{2}||y-x||^2\tag{定理2}
f(y)≤f(x)+▽f(x)(y−x)+2L∣∣y−x∣∣2(定理2)
这里的L是常数。
证明:
根据积分的性质有:
h
(
x
)
:
h
(
1
)
=
h
(
0
)
+
∫
0
1
h
′
(
τ
)
d
τ
(1)
h(x):h(1)=h(0)+\int_0^1h'(\tau)d\tau\tag1
h(x):h(1)=h(0)+∫01h′(τ)dτ(1)
自己定义(为什么要这样定义不知道):
h
(
τ
)
=
f
(
x
+
τ
(
y
−
x
)
)
h(\tau)=f(x+\tau(y-x))
h(τ)=f(x+τ(y−x))
然后有:
h
(
1
)
=
f
(
y
)
,
h
(
0
)
=
f
(
x
)
(2)
h(1)=f(y),h(0)=f(x)\tag2
h(1)=f(y),h(0)=f(x)(2)
把公式2带入上面公式1:
f
(
y
)
=
f
(
x
)
+
∫
0
1
h
′
(
τ
)
d
τ
f(y)=f(x)+\int_0^1h'(\tau)d\tau
f(y)=f(x)+∫01h′(τ)dτ
把求导看做是复合函数求导即可:
f
(
y
)
=
f
(
x
)
+
∫
0
1
▽
f
(
x
+
τ
(
y
−
x
)
)
(
y
−
x
)
d
τ
f(y)=f(x)+\int_0^1\triangledown f(x+\tau(y-x))(y-x)d\tau
f(y)=f(x)+∫01▽f(x+τ(y−x))(y−x)dτ
加一项
▽
f
(
x
)
(
y
−
x
)
\triangledown f(x)(y-x)
▽f(x)(y−x),积分里面减一项
▽
f
(
x
)
(
y
−
x
)
\triangledown f(x)(y-x)
▽f(x)(y−x)
f
(
y
)
=
f
(
x
)
+
▽
f
(
x
)
(
y
−
x
)
+
∫
0
1
(
▽
f
(
x
+
τ
(
y
−
x
)
)
−
▽
f
(
x
)
)
(
y
−
x
)
d
τ
f(y)=f(x)+\triangledown f(x)(y-x)+\int_0^1(\triangledown f(x+\tau(y-x))-\triangledown f(x))(y-x)d\tau
f(y)=f(x)+▽f(x)(y−x)+∫01(▽f(x+τ(y−x))−▽f(x))(y−x)dτ
根据前面的定理1,处理一下积分这一项:
f
(
y
)
=
f
(
x
)
+
▽
f
(
x
)
(
y
−
x
)
+
∫
0
1
(
▽
f
(
x
+
τ
(
y
−
x
)
)
−
▽
f
(
x
)
)
(
y
−
x
)
d
τ
≤
f
(
x
)
+
▽
f
(
x
)
(
y
−
x
)
+
∫
0
1
L
∣
∣
τ
(
y
−
x
)
∣
∣
∣
∣
y
−
x
∣
∣
d
τ
f(y)=f(x)+\triangledown f(x)(y-x)+\int_0^1(\triangledown f(x+\tau(y-x))-\triangledown f(x))(y-x)d\tau\\ \leq f(x)+\triangledown f(x)(y-x)+\int_0^1L||\tau(y-x)||||y-x||d\tau
f(y)=f(x)+▽f(x)(y−x)+∫01(▽f(x+τ(y−x))−▽f(x))(y−x)dτ≤f(x)+▽f(x)(y−x)+∫01L∣∣τ(y−x)∣∣∣∣y−x∣∣dτ
后面那个
∣
∣
y
−
x
∣
∣
||y-x||
∣∣y−x∣∣是根据
a
b
≤
∣
a
∣
∣
b
∣
ab\leq|a||b|
ab≤∣a∣∣b∣性质得到的,化简,积分后得:
f
(
y
)
≤
f
(
x
)
+
▽
f
(
x
)
(
y
−
x
)
+
L
2
∣
∣
y
−
x
∣
∣
2
f(y)\leq f(x)+\triangledown f(x)(y-x)+\cfrac{L}{2}||y-x||^2
f(y)≤f(x)+▽f(x)(y−x)+2L∣∣y−x∣∣2
上面
∫
0
1
τ
d
τ
=
1
2
\int_0^1\tau d\tau=\cfrac{1}{2}
∫01τdτ=21,
y
−
x
y-x
y−x和积分无关可以放积分号外面。证明完毕。
推论1
根据定理2,把
f
(
x
i
+
1
)
f(x_{i+1})
f(xi+1)看做
f
(
y
)
f(y)
f(y),把
f
(
x
i
)
f(x_{i})
f(xi)看做
f
(
x
)
f(x)
f(x):
f
(
x
i
+
1
)
≤
f
(
x
i
)
+
▽
f
(
x
i
)
(
x
i
+
1
−
x
i
)
+
L
2
∣
∣
x
i
+
1
−
x
i
∣
∣
2
f(x_{i+1})\leq f(x_{i})+\triangledown f(x_{i})(x_{i+1}-x_{i})+\cfrac{L}{2}||x_{i+1}-x_{i}||^2
f(xi+1)≤f(xi)+▽f(xi)(xi+1−xi)+2L∣∣xi+1−xi∣∣2
由于:
f
(
x
i
+
1
)
=
f
(
x
i
)
−
η
t
▽
f
(
x
i
)
→
f
(
x
i
+
1
)
−
f
(
x
i
)
=
−
η
t
▽
f
(
x
i
)
f(x_{i+1})=f(x_{i})-\eta_t \triangledown f(x_{i})\to f(x_{i+1})-f(x_{i})\\=-\eta_t \triangledown f(x_{i})
f(xi+1)=f(xi)−ηt▽f(xi)→f(xi+1)−f(xi)=−ηt▽f(xi)
所以有:
f
(
x
i
+
1
)
≤
f
(
x
i
)
+
▽
f
(
x
i
)
(
−
1
)
η
t
▽
f
(
x
i
)
+
L
2
η
t
2
▽
f
(
x
i
)
2
f(x_{i+1})\leq f(x_{i})+\triangledown f(x_{i})(-1)\eta_t \triangledown f(x_{i})+\cfrac{L}{2}\eta_t ^2\triangledown f(x_{i})^2
f(xi+1)≤f(xi)+▽f(xi)(−1)ηt▽f(xi)+2Lηt2▽f(xi)2
f
(
x
i
+
1
)
≤
f
(
x
i
)
−
η
t
∣
∣
▽
f
(
x
i
)
∣
∣
2
+
L
η
t
2
2
∣
∣
▽
f
(
x
i
)
∣
∣
2
f(x_{i+1})\leq f(x_{i})-\eta_t ||\triangledown f(x_{i})||^2+\cfrac{L\eta_t ^2}{2}||\triangledown f(x_{i})||^2
f(xi+1)≤f(xi)−ηt∣∣▽f(xi)∣∣2+2Lηt2∣∣▽f(xi)∣∣2
f
(
x
i
+
1
)
≤
f
(
x
i
)
−
η
t
(
1
−
L
η
t
2
)
∣
∣
▽
f
(
x
i
)
∣
∣
2
(3)
f(x_{i+1})\leq f(x_{i})-\eta_t (1-\cfrac{L\eta_t }{2})||\triangledown f(x_{i})||^2\tag3
f(xi+1)≤f(xi)−ηt(1−2Lηt)∣∣▽f(xi)∣∣2(3)
补充说明:Convergence Analysis of Gradient Descent
迭代式的梯度下降的迭代次数的收敛分析
定理
假设函数满足L-Lipschitz条件(条件1),并且是凸函数(条件2),设定
x
∗
=
a
r
g
min
f
(
x
)
x^*=arg\min f(x)
x∗=argminf(x),那么对于步长
η
t
≤
1
L
\eta_t\leq\cfrac{1}{L}
ηt≤L1(L是常数),满足:
.
.
.
f
(
x
k
)
≤
f
(
x
∗
)
+
∣
∣
x
0
−
x
∗
∣
∣
2
2
2
η
t
k
f(x_k)\leq f(x^*)+\cfrac{||x_0-x^*||^2_2}{2\eta_tk}
f(xk)≤f(x∗)+2ηtk∣∣x0−x∗∣∣22
当我们迭代
k
=
L
∣
∣
x
0
−
x
∗
∣
∣
2
2
ε
k=\cfrac{L||x_0-x^*||^2_2}{\varepsilon}
k=εL∣∣x0−x∗∣∣22次之后我们可以保证得到
ε
\varepsilon
ε approximation optimal value
x
x
x。(
η
t
=
1
L
\eta_t=\cfrac{1}{L}
ηt=L1)
x
k
x_k
xk是第k次迭代的x值。
x
k
x_k
xk慢慢接近
x
∗
x^*
x∗也就是说不等式右边的最后一项是随着k变大慢慢变小的,如果变小的速度快,收敛速度就快。例如:
B方案的收敛速度比A要慢(A方案要好)。
继续分析 ∣ ∣ x 0 − x ∗ ∣ ∣ 2 2 2 η t k \cfrac{||x_0-x^*||^2_2}{2\eta_tk} 2ηtk∣∣x0−x∗∣∣22, x 0 x_0 x0是不变的, x ∗ x^* x∗是最优解也是不变的,也就是分子是不变的;分母中 2 η t 2\eta_t 2ηt也是不变的,所以整个这一项随着k的变大慢慢变小。
而 k = L ∣ ∣ x 0 − x ∗ ∣ ∣ 2 2 ε k=\cfrac{L||x_0-x^*||^2_2}{\varepsilon} k=εL∣∣x0−x∗∣∣22,相当于我们得到一个 ε \varepsilon ε的估计值,把这个带入上面:
∣
∣
x
0
−
x
∗
∣
∣
2
2
2
η
t
k
=
∣
∣
x
0
−
x
∗
∣
∣
2
2
2
η
t
⋅
L
∣
∣
x
0
−
x
∗
∣
∣
2
2
ε
=
ε
2
η
t
L
\cfrac{||x_0-x^*||^2_2}{2\eta_tk}=\cfrac{||x_0-x^*||^2_2}{2\eta_t\cdot\cfrac{L||x_0-x^*||^2_2}{\varepsilon}}=\cfrac{\varepsilon}{2\eta_tL}
2ηtk∣∣x0−x∗∣∣22=2ηt⋅εL∣∣x0−x∗∣∣22∣∣x0−x∗∣∣22=2ηtLε
当我们把步长设置为:
η
t
=
1
L
→
L
=
1
η
t
\eta_t=\cfrac{1}{L} \to L=\cfrac{1}{\eta_t}
ηt=L1→L=ηt1,带入上面:
ε
2
η
t
L
=
ε
2
η
t
1
η
t
=
ε
2
\cfrac{\varepsilon}{2\eta_tL}=\cfrac{\varepsilon}{2\eta_t\cfrac{1}{\eta_t}}=\cfrac{\varepsilon}{2}
2ηtLε=2ηtηt1ε=2ε
整理一下就是说,当
k
=
L
∣
∣
x
0
−
x
∗
∣
∣
2
2
ε
k=\cfrac{L||x_0-x^*||^2_2}{\varepsilon}
k=εL∣∣x0−x∗∣∣22时:
f
(
x
k
)
≤
f
(
x
∗
)
+
ε
2
f(x_k)\leq f(x^*)+\cfrac{\varepsilon}{2}
f(xk)≤f(x∗)+2ε
也写为:
f
(
x
k
)
≤
f
(
x
∗
)
+
O
(
ε
)
f(x_k)\leq f(x^*)+O(\varepsilon)
f(xk)≤f(x∗)+O(ε)
当
ε
\varepsilon
ε很小的时候,
x
k
x_k
xk与
x
∗
x^*
x∗差距也很小
根据补充说明中的一个条件:
η
t
≤
1
L
\eta_t\leq\cfrac{1}{L}
ηt≤L1,把这个条件带入(3):
f
(
x
i
+
1
)
≤
f
(
x
i
)
−
η
t
(
1
−
L
η
t
2
)
∣
∣
▽
f
(
x
i
)
∣
∣
2
≤
f
(
x
i
)
−
η
t
(
1
−
L
1
L
2
)
∣
∣
▽
f
(
x
i
)
∣
∣
2
=
f
(
x
i
)
−
η
t
2
∣
∣
▽
f
(
x
i
)
∣
∣
2
f(x_{i+1})\leq f(x_{i})-\eta_t (1-\cfrac{L\eta_t }{2})||\triangledown f(x_{i})||^2\\ \leq f(x_{i})-\eta_t (1-\cfrac{L\cfrac{1}{L} }{2})||\triangledown f(x_{i})||^2=f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2
f(xi+1)≤f(xi)−ηt(1−2Lηt)∣∣▽f(xi)∣∣2≤f(xi)−ηt(1−2LL1)∣∣▽f(xi)∣∣2=f(xi)−2ηt∣∣▽f(xi)∣∣2
推论2
根据推论1:
f
(
x
i
+
1
)
≤
f
(
x
i
)
−
η
t
2
∣
∣
▽
f
(
x
i
)
∣
∣
2
(4)
f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\tag4
f(xi+1)≤f(xi)−2ηt∣∣▽f(xi)∣∣2(4)
根据凸函数的First order convexity,(图片来源见水印https://zhuanlan.zhihu.com/p/57652786)
我们把(4)的第一项写开:
f
(
x
i
+
1
)
≤
f
(
x
i
)
−
η
t
2
∣
∣
▽
f
(
x
i
)
∣
∣
2
≤
f
(
x
∗
)
+
▽
f
(
x
i
)
(
x
i
−
x
∗
)
−
η
t
2
∣
∣
▽
f
(
x
i
)
∣
∣
2
(5)
f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\\ \leq f(x^*)+\triangledown f(x_i)(x_i-x^*)-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\tag5
f(xi+1)≤f(xi)−2ηt∣∣▽f(xi)∣∣2≤f(x∗)+▽f(xi)(xi−x∗)−2ηt∣∣▽f(xi)∣∣2(5)
这里这个条件不知道哪里出来的,看上去像是梯度更新的公式:
x
i
+
1
=
x
i
−
η
t
▽
f
(
x
i
)
→
▽
f
(
x
i
)
=
x
i
−
x
i
+
1
η
t
x_{i+1}=x_i-\eta_t\triangledown f(x_{i})\to \triangledown f(x_{i})=\cfrac{x_{i}-x_{i+1}}{\eta_t}
xi+1=xi−ηt▽f(xi)→▽f(xi)=ηtxi−xi+1
x
i
+
1
=
x
i
−
η
t
▽
f
(
x
i
)
→
η
t
▽
f
(
x
i
)
=
x
i
−
x
i
+
1
x_{i+1}=x_i-\eta_t\triangledown f(x_{i})\to \eta_t\triangledown f(x_{i})=x_{i}-x_{i+1}
xi+1=xi−ηt▽f(xi)→ηt▽f(xi)=xi−xi+1
带入(5):
f ( x i + 1 ) ≤ f ( x i ) − η t 2 ∣ ∣ ▽ f ( x i ) ∣ ∣ 2 ≤ f ( x ∗ ) + x i − x i + 1 η t ( x i − x ∗ ) − η t 2 ∣ ∣ x i − x i + 1 η t ∣ ∣ 2 = f ( x ∗ ) + x i − x i + 1 η t ( x i − x ∗ ) − 1 2 η t ∣ ∣ x i − x i + 1 ∣ ∣ 2 = f ( x ∗ ) + 2 ( x i 2 − x i x ∗ − x i x i + 1 + x i + 1 x ∗ ) 2 η t − x i 2 − 2 x i x i + 1 + x i + 1 2 2 η t 这 里 通 过 加 一 项 减 一 项 , 最 后 整 合 为 : = f ( x ∗ ) + 1 2 η t ∣ ∣ x i − x ∗ ∣ ∣ 2 − 1 2 η t ( ∣ ∣ x i − x ∗ ∣ ∣ 2 − 2 η t ▽ f ( x i ) ( x i − x ∗ ) + ∣ ∣ η t ▽ f ( x i ) ∣ ∣ 2 ) = f ( x ∗ ) + 1 2 η t ∣ ∣ x i − x ∗ ∣ ∣ 2 − 1 2 η t ∣ ∣ x i − x ∗ − η t ▽ f ( x i ) ∣ ∣ 2 \begin{aligned}f(x_{i+1})&\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\\&\leq f(x^*)+\cfrac{x_{i}-x_{i+1}}{\eta_t}(x_i-x^*)-\cfrac{\eta_t }{2}||\cfrac{x_{i}-x_{i+1}}{\eta_t}||^2\\ &=f(x^*)+\cfrac{x_{i}-x_{i+1}}{\eta_t}(x_i-x^*)-\cfrac{1 }{2\eta_t}||{x_{i}-x_{i+1}}||^2\\ &=f(x^*)+\cfrac{2(x_{i}^2-x_ix^*-x_ix_{i+1}+x_{i+1}x^*)}{2\eta_t}-\cfrac{x_i^2-2x_ix_{i+1}+x^2_{i+1} }{2\eta_t}\\ &这里通过加一项减一项,最后整合为:\\ &=f(x^*)+\cfrac{1 }{2\eta_t}||x_i-x^*||^2-\cfrac{1 }{2\eta_t}\left (||x_i-x^*||^2-2\eta_t\triangledown f(x_{i})(x_i-x^*)+ ||\eta_t\triangledown f(x_{i})||^2\right )\\ &=f(x^*)+\cfrac{1 }{2\eta_t}||x_i-x^*||^2-\cfrac{1 }{2\eta_t}||x_i-x^*-\eta_t\triangledown f(x_{i})||^2\end{aligned} f(xi+1)≤f(xi)−2ηt∣∣▽f(xi)∣∣2≤f(x∗)+ηtxi−xi+1(xi−x∗)−2ηt∣∣ηtxi−xi+1∣∣2=f(x∗)+ηtxi−xi+1(xi−x∗)−2ηt1∣∣xi−xi+1∣∣2=f(x∗)+2ηt2(xi2−xix∗−xixi+1+xi+1x∗)−2ηtxi2−2xixi+1+xi+12这里通过加一项减一项,最后整合为:=f(x∗)+2ηt1∣∣xi−x∗∣∣2−2ηt1(∣∣xi−x∗∣∣2−2ηt▽f(xi)(xi−x∗)+∣∣ηt▽f(xi)∣∣2)=f(x∗)+2ηt1∣∣xi−x∗∣∣2−2ηt1∣∣xi−x∗−ηt▽f(xi)∣∣2
=
f
(
x
∗
)
+
1
2
η
t
∣
∣
x
i
−
x
∗
∣
∣
2
−
1
2
η
t
∣
∣
x
i
+
1
−
x
∗
∣
∣
2
=
f
(
x
∗
)
+
1
2
η
t
(
∣
∣
x
i
−
x
∗
∣
∣
2
−
∣
∣
x
i
+
1
−
x
∗
∣
∣
2
)
=f(x^*)+\cfrac{1 }{2\eta_t}||x_i-x^*||^2-\cfrac{1 }{2\eta_t}||x_{i+1}-x^*||^2\\ =f(x^*)+\cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)
=f(x∗)+2ηt1∣∣xi−x∗∣∣2−2ηt1∣∣xi+1−x∗∣∣2=f(x∗)+2ηt1(∣∣xi−x∗∣∣2−∣∣xi+1−x∗∣∣2)
推论2结束,结果如下:
f
(
x
i
+
1
)
≤
f
(
x
∗
)
+
1
2
η
t
(
∣
∣
x
i
−
x
∗
∣
∣
2
−
∣
∣
x
i
+
1
−
x
∗
∣
∣
2
)
(推论2)
f(x_{i+1})\leq f(x^*)+\cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)\tag{推论2}
f(xi+1)≤f(x∗)+2ηt1(∣∣xi−x∗∣∣2−∣∣xi+1−x∗∣∣2)(推论2)
推论3
将推论2移项:
f
(
x
i
+
1
)
−
f
(
x
∗
)
≤
1
2
η
t
(
∣
∣
x
i
−
x
∗
∣
∣
2
−
∣
∣
x
i
+
1
−
x
∗
∣
∣
2
)
f(x_{i+1})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)
f(xi+1)−f(x∗)≤2ηt1(∣∣xi−x∗∣∣2−∣∣xi+1−x∗∣∣2)
下面考虑从
i
=
0
i=0
i=0开始看:
f
(
x
1
)
−
f
(
x
∗
)
≤
1
2
η
t
(
∣
∣
x
0
−
x
∗
∣
∣
2
−
∣
∣
x
1
−
x
∗
∣
∣
2
)
f(x_{1})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_0-x^*||^2-||x_{1}-x^*||^2)
f(x1)−f(x∗)≤2ηt1(∣∣x0−x∗∣∣2−∣∣x1−x∗∣∣2)
f
(
x
2
)
−
f
(
x
∗
)
≤
1
2
η
t
(
∣
∣
x
1
−
x
∗
∣
∣
2
−
∣
∣
x
2
−
x
∗
∣
∣
2
)
f(x_{2})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_1-x^*||^2-||x_{2}-x^*||^2)
f(x2)−f(x∗)≤2ηt1(∣∣x1−x∗∣∣2−∣∣x2−x∗∣∣2)
f
(
x
3
)
−
f
(
x
∗
)
≤
1
2
η
t
(
∣
∣
x
2
−
x
∗
∣
∣
2
−
∣
∣
x
3
−
x
∗
∣
∣
2
)
f(x_{3})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_2-x^*||^2-||x_{3}-x^*||^2)
f(x3)−f(x∗)≤2ηt1(∣∣x2−x∗∣∣2−∣∣x3−x∗∣∣2)
以此类推:
f
(
x
k
)
−
f
(
x
∗
)
≤
1
2
η
t
(
∣
∣
x
k
−
1
−
x
∗
∣
∣
2
−
∣
∣
x
k
−
x
∗
∣
∣
2
)
f(x_{k})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_{k-1}-x^*||^2-||x_{k}-x^*||^2)
f(xk)−f(x∗)≤2ηt1(∣∣xk−1−x∗∣∣2−∣∣xk−x∗∣∣2)
如果我们把上面的不等式左右两边分开累加到一起。
∑
i
=
1
k
f
(
x
k
)
−
k
f
(
x
∗
)
≤
1
2
η
t
(
∣
∣
x
i
−
x
∗
∣
∣
2
−
∣
∣
x
k
−
x
∗
∣
∣
2
)
\sum_{i=1}^kf(x_{k})-kf(x^*)\leq \cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{k}-x^*||^2)
i=1∑kf(xk)−kf(x∗)≤2ηt1(∣∣xi−x∗∣∣2−∣∣xk−x∗∣∣2)
右边放大一点,去掉一个平方项:
∑
i
=
1
k
f
(
x
k
)
−
k
f
(
x
∗
)
≤
1
2
η
t
∣
∣
x
i
−
x
∗
∣
∣
2
(6)
\sum_{i=1}^kf(x_{k})-kf(x^*)\leq \cfrac{1 }{2\eta_t}||x_i-x^*||^2\tag6
i=1∑kf(xk)−kf(x∗)≤2ηt1∣∣xi−x∗∣∣2(6)
根据推论1的结论:
f
(
x
i
+
1
)
≤
f
(
x
i
)
−
η
t
2
∣
∣
▽
f
(
x
i
)
∣
∣
2
f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2
f(xi+1)≤f(xi)−2ηt∣∣▽f(xi)∣∣2
以及:
η
t
2
≥
0
\cfrac{\eta_t }{2}\geq0
2ηt≥0和
∣
∣
▽
f
(
x
i
)
∣
∣
2
≥
0
||\triangledown f(x_{i})||^2\geq0
∣∣▽f(xi)∣∣2≥0两个条件可知:
f
(
x
i
+
1
)
≤
f
(
x
i
)
f(x_{i+1})\leq f(x_{i})
f(xi+1)≤f(xi)
注意:上面的式子在随机梯度下降法中是不成立的。
因此我们可以写出:
f
(
x
k
)
≤
f
(
x
k
−
1
)
≤
f
(
x
k
−
2
)
≤
⋯
≤
f
(
x
0
)
f(x_{k})\leq f(x_{k-1})\leq f(x_{k-2})\leq\cdots\leq f(x_{0})
f(xk)≤f(xk−1)≤f(xk−2)≤⋯≤f(x0)
根据这个,我们把(6)的左边进行缩小,就是把累加都替换为最小的
f
(
x
k
)
f(x_k)
f(xk):
k
f
(
x
k
)
−
k
f
(
x
∗
)
≤
∑
i
=
1
k
f
(
x
k
)
−
k
f
(
x
∗
)
kf(x_{k})-kf(x^*)\leq \sum_{i=1}^kf(x_{k})-kf(x^*)
kf(xk)−kf(x∗)≤i=1∑kf(xk)−kf(x∗)
整理:
k
f
(
x
k
)
−
k
f
(
x
∗
)
≤
1
2
η
t
∣
∣
x
i
−
x
∗
∣
∣
2
kf(x_{k})-kf(x^*)\leq \cfrac{1 }{2\eta_t}||x_i-x^*||^2
kf(xk)−kf(x∗)≤2ηt1∣∣xi−x∗∣∣2
f
(
x
k
)
−
f
(
x
∗
)
≤
∣
∣
x
i
−
x
∗
∣
∣
2
2
η
t
k
f(x_{k})-f(x^*)\leq \frac{||x_i-x^*||^2}{2\eta_tk}
f(xk)−f(x∗)≤2ηtk∣∣xi−x∗∣∣2
结束,上面这个式子就是补充说明里面梯度下降收敛的证明。