凸优化和梯度下降

@(Aaron) [机器学习, 凸优化和梯度下降]

主要内容包括:

  • 深度学习中的优化问题和凸性介绍
  • 介绍梯度下降、随机梯度下降和小批量梯度下降的原理及实现

凸优化

优化与深度学习

  • 优化与估计

  尽管优化方法可以最小化深度学习中的损失函数值,但本质上优化方法达到的目标与深度学习的目标并不相同。

  • 优化方法目标:训练集损失函数值
  • 深度学习目标:测试集损失函数值(泛化性)

%matplotlib inline
import sys
sys.path.append('/home/kesci/input')
import d2lzh1981 as d2l
from mpl_toolkits import mplot3d # 三维画图
import numpy as np

def f(x): return x * np.cos(np.pi * x)
def g(x): return f(x) + 0.2 * np.cos(5 * np.pi * x)

d2l.set_figsize((5, 3))
x = np.arange(0.5, 1.5, 0.01)
fig_f, = d2l.plt.plot(x, f(x),label="train error")
fig_g, = d2l.plt.plot(x, g(x),'--', c='purple', label="test error")
fig_f.axes.annotate('empirical risk', (1.0, -1.2), (0.5, -1.1),arrowprops=dict(arrowstyle='->'))
fig_g.axes.annotate('expected risk', (1.1, -1.05), (0.95, -0.5),arrowprops=dict(arrowstyle='->'))
d2l.plt.xlabel('x')
d2l.plt.ylabel('risk')
d2l.plt.legend(loc="upper right")

在这里插入图片描述

  • 优化在深度学习中的挑战
      1. 局部最小值
      2. 鞍点
      3. 梯度消失

f ( x ) = x cos ⁡ π x f(x) = x\cos \pi x f(x)=xcosπx


def f(x):
    return x * np.cos(np.pi * x)

d2l.set_figsize((4.5, 2.5))
x = np.arange(-1.0, 2.0, 0.1)
fig,  = d2l.plt.plot(x, f(x))
fig.axes.annotate('local minimum', xy=(-0.3, -0.25), xytext=(-0.77, -1.0),
                  arrowprops=dict(arrowstyle='->'))
fig.axes.annotate('global minimum', xy=(1.1, -0.95), xytext=(0.6, 0.8),
                  arrowprops=dict(arrowstyle='->'))
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)');

在这里插入图片描述
局 部 最 小 值 局部最小值

  鞍点是对所有自变量一阶偏导数都为0,且Hessian矩阵特征值有正有负的点


x = np.arange(-2.0, 2.0, 0.1)
fig, = d2l.plt.plot(x, x**3)
fig.axes.annotate('saddle point', xy=(0, -0.2), xytext=(-0.52, -5.0),
                  arrowprops=dict(arrowstyle='->'))
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)');

在这里插入图片描述

一 维 鞍 点 一维鞍点

A = [ ∂ 2 f ∂ x 1 2 ∂ 2 f ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ∂ x 2 2 ⋯ ∂ 2 f ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x n ∂ x 1 ∂ 2 f ∂ x n ∂ x 2 ⋯ ∂ 2 f ∂ x n 2 ] A=\left[\begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{n}^{2}}}\end{array}\right] A=x122fx2x12fxnx12fx1x22fx222fxnx22fx1xn2fx2xn2fxn22f
海 森 矩 阵 海森矩阵


x, y = np.mgrid[-1: 1: 31j, -1: 1: 31j]
z = x**2 - y**2

d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 2, 'cstride': 2})
ax.plot([0], [0], [0], 'ro', markersize=10)
ticks = [-1,  0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');

在这里插入图片描述

二 维 鞍 点 二维鞍点


x = np.arange(-2.0, 5.0, 0.01)
fig, = d2l.plt.plot(x, np.tanh(x))
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)')
fig.axes.annotate('vanishing gradient', (4, 1), (2, 0.0) ,arrowprops=dict(arrowstyle='->'))

在这里插入图片描述

梯 度 消 失 梯度消失
  梯度消失是传统神经网络训练中非常致命的一个问题,其本质是由于链式法则的乘法
特性导致的。比如来考虑深度学习之前在神经网络中最流行激活函数之一Sigmoid, 其表
达式和导数如下:

σ ( x ) = 1 1 + e − x σ ′ ( x ) = e − x ( 1 + e − x ) 2 = 1 1 + e − x ( 1 − 1 1 + e − x ) = σ ( x ) ( 1 − σ ( x ) ) \begin{array}{l} {\sigma(x)=\frac{1}{1+\mathrm{e}^{-x}}} \\ {\sigma^{\prime}(x)=\frac{\mathrm{e}^{-x}}{\left(1+\mathrm{e}^{-x}\right)^{2}}=\frac{1}{1+\mathrm{e}^{-x}}\left(1-\frac{1}{1+\mathrm{e}^{-x}}\right)=\sigma(x)(1-\sigma(x))} \end{array} σ(x)=1+ex1σ(x)=(1+ex)2ex=1+ex1(11+ex1)=σ(x)(1σ(x))

  对于Sigmoid, 导数的最大值在输入为0 处,值为0.25 。考虑一个激活函数都是S i gmoid 的多层神经网络,则梯度向后传导时,每经过一个Sigmoid 层就需要乘以一个小于0 . 25 的梯度。而每乘一个梯度, 则梯度的值又变得更小一些。况且在优化的过程中,每个激活层输入都在0 附近的概率非常非常低。也就是说随着层数的加深, 梯度的衰减会非常大, 迅速接近0,这就是梯度消失问题。

凸性 (Convexity)

  • 集合

在这里插入图片描述
在这里插入图片描述在这里插入图片描述

  • 凸函数条件

λ f ( x ) + ( 1 − λ ) f ( x ′ ) ≥ f ( λ x + ( 1 − λ ) x ′ ) \lambda f(x)+(1-\lambda) f\left(x^{\prime}\right) \geq f\left(\lambda x+(1-\lambda) x^{\prime}\right) λf(x)+(1λ)f(x)f(λx+(1λ)x)


def f(x):
    return 0.5 * x**2  # Convex

def g(x):
    return np.cos(np.pi * x)  # Nonconvex

def h(x):
    return np.exp(0.5 * x)  # Convex

x, segment = np.arange(-2, 2, 0.01), np.array([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))

for ax, func in zip(axes, [f, g, h]):
    ax.plot(x, func(x))
    ax.plot(segment, func(segment),'--', color="purple")
    # d2l.plt.plot([x, segment], [func(x), func(segment)], axes=ax)

在这里插入图片描述

  • Jensen 不等式

Jensen不等式是关于凸函数性质的不等式,它和凸函数的定义是息息相关的。凸函数的条件式是说凸函数任意两点的割线位于函数图形上方, 这也是Jensen不等式的两点形式。

若对于任意点集 x i {x_{i}} xi,若 λ i ≥ 0 \lambda_{i} \geq 0 λi0 ∑ i λ i = 1 \sum_{i} \lambda_{i}=1 iλi=1,使用数学归纳法,可以证明凸函数 f (x) 满足:

f ( ∑ i = 1 M λ i x i ) ≤ ∑ i = 1 M λ i f ( x i ) f\left(\sum_{i=1}^{M} \lambda_{i} x_{i}\right) \leq \sum_{i=1}^{M} \lambda_{i} f\left(x_{i}\right) f(i=1Mλixi)i=1Mλif(xi)

上式称为 Jensen 不等式,它是凸函数条件式的泛化形式。在概率论中,如果把 λ i \lambda_{i} λi看成取值为 x i x_{i} xi的离散变量 x x x 的概率分布,那么上式就可以写成如下期望式。

∑ i α i f ( x i ) ≥ f ( ∑ i α i x i )  and  E x [ f ( x ) ] ≥ f ( E x [ x ] ) \sum_{i} \alpha_{i} f\left(x_{i}\right) \geq f\left(\sum_{i} \alpha_{i} x_{i}\right) \text { and } E_{x}[f(x)] \geq f\left(E_{x}[x]\right) iαif(xi)f(iαixi) and Ex[f(x)]f(Ex[x])

  • 性质
      1. 无局部极小值
      2. 与凸集的关系
      3. 二阶条件

  无局部极小值:

  证明:假设存在 x ∈ X x \in X xX 是局部最小值,则存在全局最小值 x ′ ∈ X x' \in X xX, 使得 f ( x ) > f ( x ′ ) f(x) > f(x') f(x)>f(x), 则对 λ ∈ ( 0 , 1 ] \lambda \in(0,1] λ(0,1]:

f ( x ) > λ f ( x ) + ( 1 − λ ) f ( x ′ ) ≥ f ( λ x + ( 1 − λ ) x ′ ) f(x)>\lambda f(x)+(1-\lambda) f(x^{\prime}) \geq f(\lambda x+(1-\lambda) x^{\prime}) f(x)>λf(x)+(1λ)f(x)f(λx+(1λ)x)

  与凸集的关系

  结论:对于凸函数 f ( x ) f(x) f(x),定义集合 S b : = { x ∣ x ∈ X  and  f ( x ) ≤ b } S_{b}:=\{x | x \in X \text { and } f(x) \leq b\} Sb:={xxX and f(x)b},则集合 S b S_b Sb 为凸集

  证明:对于点 x , x ′ ∈ S b x,x' \in S_b x,xSb, 有 f ( λ x + ( 1 − λ ) x ′ ) ≤ λ f ( x ) + ( 1 − λ ) f ( x ′ ) ≤ b f\left(\lambda x+(1-\lambda) x^{\prime}\right) \leq \lambda f(x)+(1-\lambda) f\left(x^{\prime}\right) \leq b f(λx+(1λ)x)λf(x)+(1λ)f(x)b, 故 λ x + ( 1 − λ ) x ′ ∈ S b \lambda x+(1-\lambda) x^{\prime} \in S_{b} λx+(1λ)xSb

f ( x , y ) = 0.5 x 2 + cos ⁡ ( 2 π y ) f(x, y)=0.5 x^{2}+\cos (2 \pi y) f(x,y)=0.5x2+cos(2πy)


x, y = np.meshgrid(np.linspace(-1, 1, 101), np.linspace(-1, 1, 101),
                   indexing='ij')

z = x**2 + 0.5 * np.cos(2 * np.pi * y)

# Plot the 3D surface
d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.contour(x, y, z, offset=-1)
ax.set_zlim(-1, 1.5)

# Adjust labels
for func in [d2l.plt.xticks, d2l.plt.yticks, ax.set_zticks]:
    func([-1, 0, 1])

  凸函数与二阶导数
  结论: f ′ ′ ( x ) ≥ 0 ⟺ f ( x ) f^{''}(x) \ge 0 \Longleftrightarrow f(x) f(x)0f(x) 是凸函数

  必要性 ( ⇐ \Leftarrow ):

  对于凸函数:

1 2 f ( x + ϵ ) + 1 2 f ( x − ϵ ) ≥ f ( x + ϵ 2 + x − ϵ 2 ) = f ( x ) \frac{1}{2} f(x+\epsilon)+\frac{1}{2} f(x-\epsilon) \geq f\left(\frac{x+\epsilon}{2}+\frac{x-\epsilon}{2}\right)=f(x) 21f(x+ϵ)+21f(xϵ)f(2x+ϵ+2xϵ)=f(x)

  故:

f ′ ′ ( x ) = lim ⁡ ε → 0 f ( x + ϵ ) − f ( x ) ϵ − f ( x ) − f ( x − ϵ ) ϵ ϵ f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{\frac{f(x+\epsilon) - f(x)}{\epsilon}-\frac{f(x) - f(x-\epsilon)}{\epsilon}}{\epsilon} f(x)=ε0limϵϵf(x+ϵ)f(x)ϵf(x)f(xϵ)

f ′ ′ ( x ) = lim ⁡ ε → 0 f ( x + ϵ ) + f ( x − ϵ ) − 2 f ( x ) ϵ 2 ≥ 0 f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{f(x+\epsilon)+f(x-\epsilon)-2 f(x)}{\epsilon^{2}} \geq 0 f(x)=ε0limϵ2f(x+ϵ)+f(xϵ)2f(x)0

  充分性 ( ⇒ \Rightarrow ):

a < x < b a < x < b a<x<b f ( x ) f(x) f(x) 上的三个点,由拉格朗日中值定理:

f ( x ) − f ( a ) = ( x − a ) f ′ ( α )  for some  α ∈ [ a , x ]  and  f ( b ) − f ( x ) = ( b − x ) f ′ ( β )  for some  β ∈ [ x , b ] \begin{array}{l}{f(x)-f(a)=(x-a) f^{\prime}(\alpha) \text { for some } \alpha \in[a, x] \text { and }} \\ {f(b)-f(x)=(b-x) f^{\prime}(\beta) \text { for some } \beta \in[x, b]}\end{array} f(x)f(a)=(xa)f(α) for some α[a,x] and f(b)f(x)=(bx)f(β) for some β[x,b]

  根据单调性,有 f ′ ( β ) ≥ f ′ ( α ) f^{\prime}(\beta) \geq f^{\prime}(\alpha) f(β)f(α), 故:

f ( b ) − f ( a ) = f ( b ) − f ( x ) + f ( x ) − f ( a ) = ( b − x ) f ′ ( β ) + ( x − a ) f ′ ( α ) ≥ ( b − a ) f ′ ( α ) \begin{aligned} f(b)-f(a) &=f(b)-f(x)+f(x)-f(a) \\ &=(b-x) f^{\prime}(\beta)+(x-a) f^{\prime}(\alpha) \\ & \geq(b-a) f^{\prime}(\alpha) \end{aligned} f(b)f(a)=f(b)f(x)+f(x)f(a)=(bx)f(β)+(xa)f(α)(ba)f(α)


def f(x):
    return 0.5 * x**2

x = np.arange(-2, 2, 0.01)
axb, ab = np.array([-1.5, -0.5, 1]), np.array([-1.5, 1])

d2l.set_figsize((3.5, 2.5))
fig_x, = d2l.plt.plot(x, f(x))
fig_axb, = d2l.plt.plot(axb, f(axb), '-.',color="purple")
fig_ab, = d2l.plt.plot(ab, f(ab),'g-.')

fig_x.axes.annotate('a', (-1.5, f(-1.5)), (-1.5, 1.5),arrowprops=dict(arrowstyle='->'))
fig_x.axes.annotate('b', (1, f(1)), (1, 1.5),arrowprops=dict(arrowstyle='->'))
fig_x.axes.annotate('x', (-0.5, f(-0.5)), (-1.5, f(-0.5)),arrowprops=dict(arrowstyle='->'))

在这里插入图片描述

梯度下降

Boyd & Vandenberghe, 2004


%matplotlib inline
import numpy as np
import torch
import time
from torch import nn, optim
import math
import sys
sys.path.append('/home/kesci/input')
import d2lzh1981 as d2l

一维梯度下降

证明:沿梯度反方向移动自变量可以减小函数值

  泰勒展开:

f ( x + ϵ ) = f ( x ) + ϵ f ′ ( x ) + O ( ϵ 2 ) f(x+\epsilon)=f(x)+\epsilon f^{\prime}(x)+\mathcal{O}\left(\epsilon^{2}\right) f(x+ϵ)=f(x)+ϵf(x)+O(ϵ2)

  代入沿梯度方向的移动量 η f ′ ( x ) \eta f^{\prime}(x) ηf(x)

f ( x − η f ′ ( x ) ) = f ( x ) − η f ′ 2 ( x ) + O ( η 2 f ′ 2 ( x ) ) f\left(x-\eta f^{\prime}(x)\right)=f(x)-\eta f^{\prime 2}(x)+\mathcal{O}\left(\eta^{2} f^{\prime 2}(x)\right) f(xηf(x))=f(x)ηf2(x)+O(η2f2(x))

f ( x − η f ′ ( x ) ) ≲ f ( x ) f\left(x-\eta f^{\prime}(x)\right) \lesssim f(x) f(xηf(x))f(x)

x ← x − η f ′ ( x ) x \leftarrow x-\eta f^{\prime}(x) xxηf(x)

e.g.

f ( x ) = x 2 f(x) = x^2 f(x)=x2


def f(x):
    return x**2  # Objective function

def gradf(x):
    return 2 * x  # Its derivative

def gd(eta):
    x = 10
    results = [x]
    for i in range(10):
        x -= eta * gradf(x)
        results.append(x)
    print('epoch 10, x:', x)
    return results

res = gd(0.2)

def show_trace(res):
    n = max(abs(min(res)), abs(max(res)))
    f_line = np.arange(-n, n, 0.01)
    d2l.set_figsize((3.5, 2.5))
    d2l.plt.plot(f_line, [f(x) for x in f_line],'-')
    d2l.plt.plot(res, [f(x) for x in res],'-o')
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('f(x)')
    

show_trace(res)

在这里插入图片描述

  • 学习率

show_trace(gd(0.05))

1582184235171


show_trace(gd(1.1))

在这里插入图片描述

  • 局部极小值

e.g.

f ( x ) = x cos ⁡ c x f(x) = x\cos cx f(x)=xcoscx


c = 0.15 * np.pi

def f(x):
    return x * np.cos(c * x)

def gradf(x):
    return np.cos(c * x) - c * x * np.sin(c * x)

show_trace(gd(2))

在这里插入图片描述

多维梯度下降

∇ f ( x ) = [ ∂ f ( x ) ∂ x 1 , ∂ f ( x ) ∂ x 2 , … , ∂ f ( x ) ∂ x d ] ⊤ \nabla f(\mathbf{x})=\left[\frac{\partial f(\mathbf{x})}{\partial x_{1}}, \frac{\partial f(\mathbf{x})}{\partial x_{2}}, \dots, \frac{\partial f(\mathbf{x})}{\partial x_{d}}\right]^{\top} f(x)=[x1f(x),x2f(x),,xdf(x)]

f ( x + ϵ ) = f ( x ) + ϵ ⊤ ∇ f ( x ) + O ( ∥ ϵ ∥ 2 ) f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\mathcal{O}\left(\|\epsilon\|^{2}\right) f(x+ϵ)=f(x)+ϵf(x)+O(ϵ2)

x ← x − η ∇ f ( x ) \mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f(\mathbf{x}) xxηf(x)


def train_2d(trainer, steps=20):
    x1, x2 = -5, -2
    results = [(x1, x2)]
    for i in range(steps):
        x1, x2 = trainer(x1, x2)
        results.append((x1, x2))
    print('epoch %d, x1 %f, x2 %f' % (i + 1, x1, x2))
    return results

def show_trace_2d(f, results): 
    d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
    x1, x2 = np.meshgrid(np.arange(-5.5, 1.0, 0.1), np.arange(-3.0, 1.0, 0.1))
    d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
    d2l.plt.xlabel('x1')
    d2l.plt.ylabel('x2')

f ( x ) = x 1 2 + 2 x 2 2 f(x) = x_1^2 + 2x_2^2 f(x)=x12+2x22


eta = 0.1

def f_2d(x1, x2):  # 目标函数
    return x1 ** 2 + 2 * x2 ** 2

def gd_2d(x1, x2):
    return (x1 - eta * 2 * x1, x2 - eta * 4 * x2)

show_trace_2d(f_2d, train_2d(gd_2d))

1582184489443

自适应方法

  • 牛顿法

  在 x + ϵ x + \epsilon x+ϵ 处泰勒展开:

f ( x + ϵ ) = f ( x ) + ϵ ⊤ ∇ f ( x ) + 1 2 ϵ ⊤ ∇ ∇ ⊤ f ( x ) ϵ + O ( ∥ ϵ ∥ 3 ) f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\frac{1}{2} \epsilon^{\top} \nabla \nabla^{\top} f(\mathbf{x}) \epsilon+\mathcal{O}\left(\|\epsilon\|^{3}\right) f(x+ϵ)=f(x)+ϵf(x)+21ϵf(x)ϵ+O(ϵ3)

  最小值点处满足: ∇ f ( x ) = 0 \nabla f(\mathbf{x})=0 f(x)=0, 即我们希望 ∇ f ( x + ϵ ) = 0 \nabla f(\mathbf{x} + \epsilon)=0 f(x+ϵ)=0, 对上式关于 ϵ \epsilon ϵ 求导,忽略高阶无穷小,有:

∇ f ( x ) + H f ϵ = 0  and hence  ϵ = − H f − 1 ∇ f ( x ) \nabla f(\mathbf{x})+\boldsymbol{H}_{f} \boldsymbol{\epsilon}=0 \text { and hence } \epsilon=-\boldsymbol{H}_{f}^{-1} \nabla f(\mathbf{x}) f(x)+Hfϵ=0 and hence ϵ=Hf1f(x)


c = 0.5

def f(x):
    return np.cosh(c * x)  # Objective

def gradf(x):
    return c * np.sinh(c * x)  # Derivative

def hessf(x):
    return c**2 * np.cosh(c * x)  # Hessian

# Hide learning rate for now
def newton(eta=1):
    x = 10
    results = [x]
    for i in range(10):
        x -= eta * gradf(x) / hessf(x)
        results.append(x)
    print('epoch 10, x:', x)
    return results

show_trace(newton())

在这里插入图片描述


c = 0.15 * np.pi

def f(x):
    return x * np.cos(c * x)

def gradf(x):
    return np.cos(c * x) - c * x * np.sin(c * x)

def hessf(x):
    return - 2 * c * np.sin(c * x) - x * c**2 * np.cos(c * x)

show_trace(newton())

在这里插入图片描述


show_trace(newton(0.5))

在这里插入图片描述

  • 收敛性分析

  只考虑在函数为凸函数, 且最小值点上 f ′ ′ ( x ∗ ) > 0 f''(x^*) > 0 f(x)>0 时的收敛速度:

  令 x k x_k xk 为第 k k k 次迭代后 x x x 的值, e k : = x k − x ∗ e_{k}:=x_{k}-x^{*} ek:=xkx 表示 x k x_k xk 到最小值点 x ∗ x^{*} x 的距离,由 f ′ ( x ∗ ) = 0 f'(x^{*}) = 0 f(x)=0:

0 = f ′ ( x k − e k ) = f ′ ( x k ) − e k f ′ ′ ( x k ) + 1 2 e k 2 f ′ ′ ′ ( ξ k ) for some  ξ k ∈ [ x k − e k , x k ] 0=f^{\prime}\left(x_{k}-e_{k}\right)=f^{\prime}\left(x_{k}\right)-e_{k} f^{\prime \prime}\left(x_{k}\right)+\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) \text{for some } \xi_{k} \in\left[x_{k}-e_{k}, x_{k}\right] 0=f(xkek)=f(xk)ekf(xk)+21ek2f(ξk)for some ξk[xkek,xk]

  两边除以 f ′ ′ ( x k ) f''(x_k) f(xk), 有:

e k − f ′ ( x k ) / f ′ ′ ( x k ) = 1 2 e k 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k ) e_{k}-f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right)=\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) ekf(xk)/f(xk)=21ek2f(ξk)/f(xk)

  代入更新方程 x k + 1 = x k − f ′ ( x k ) / f ′ ′ ( x k ) x_{k+1} = x_{k} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right) xk+1=xkf(xk)/f(xk), 得到:

x k − x ∗ − f ′ ( x k ) / f ′ ′ ( x k ) = 1 2 e k 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k ) x_k - x^{*} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right) =\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) xkxf(xk)/f(xk)=21ek2f(ξk)/f(xk)

x k + 1 − x ∗ = e k + 1 = 1 2 e k 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k ) x_{k+1} - x^{*} = e_{k+1} = \frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) xk+1x=ek+1=21ek2f(ξk)/f(xk)

  当 1 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k ) ≤ c \frac{1}{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) \leq c 21f(ξk)/f(xk)c 时,有:

e k + 1 ≤ c e k 2 e_{k+1} \leq c e_{k}^{2} ek+1cek2

随机梯度下降

  • 随机梯度下降参数更新

  对于有 n n n 个样本对训练数据集,设 f i ( x ) f_i(x) fi(x) 是第 i i i 个样本的损失函数, 则目标函数为:

f ( x ) = 1 n ∑ i = 1 n f i ( x ) f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} f_{i}(\mathbf{x}) f(x)=n1i=1nfi(x)

  其梯度为:

∇ f ( x ) = 1 n ∑ i = 1 n ∇ f i ( x ) \nabla f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x}) f(x)=n1i=1nfi(x)

  使用该梯度的一次更新的时间复杂度为 O ( n ) \mathcal{O}(n) O(n)

  随机梯度下降更新公式 O ( 1 ) \mathcal{O}(1) O(1):

x ← x − η ∇ f i ( x ) \mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f_{i}(\mathbf{x}) xxηfi(x)

  且有:

E i ∇ f i ( x ) = 1 n ∑ i = 1 n ∇ f i ( x ) = ∇ f ( x ) \mathbb{E}_{i} \nabla f_{i}(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x}) Eifi(x)=n1i=1nfi(x)=f(x)

e.g.

f ( x 1 , x 2 ) = x 1 2 + 2 x 2 2 f(x_1, x_2) = x_1^2 + 2 x_2^2 f(x1,x2)=x12+2x22


def f(x1, x2):
    return x1 ** 2 + 2 * x2 ** 2  # Objective

def gradf(x1, x2):
    return (2 * x1, 4 * x2)  # Gradient

def sgd(x1, x2):  # Simulate noisy gradient
    global lr  # Learning rate scheduler
    (g1, g2) = gradf(x1, x2)  # Compute gradient
    (g1, g2) = (g1 + np.random.normal(0.1), g2 + np.random.normal(0.1))
    eta_t = eta * lr()  # Learning rate at time t
    return (x1 - eta_t * g1, x2 - eta_t * g2)  # Update variables

eta = 0.1
lr = (lambda: 1)  # Constant learning rate
show_trace_2d(f, train_2d(sgd, steps=50))

在这里插入图片描述

  • 动态学习率

η ( t ) = η i  if  t i ≤ t ≤ t i + 1  piecewise constant  η ( t ) = η 0 ⋅ e − λ t  exponential  η ( t ) = η 0 ⋅ ( β t + 1 ) − α  polynomial  \begin{array}{ll}{\eta(t)=\eta_{i} \text { if } t_{i} \leq t \leq t_{i+1}} & {\text { piecewise constant }} \\ {\eta(t)=\eta_{0} \cdot e^{-\lambda t}} & {\text { exponential }} \\ {\eta(t)=\eta_{0} \cdot(\beta t+1)^{-\alpha}} & {\text { polynomial }}\end{array} η(t)=ηi if titti+1η(t)=η0eλtη(t)=η0(βt+1)α piecewise constant  exponential  polynomial 


def exponential():
    global ctr
    ctr += 1
    return math.exp(-0.1 * ctr)

ctr = 1
lr = exponential  # Set up learning rate
show_trace_2d(f, train_2d(sgd, steps=1000))

在这里插入图片描述


def polynomial():
    global ctr
    ctr += 1
    return (1 + 0.1 * ctr)**(-0.5)

ctr = 1
lr = polynomial  # Set up learning rate
show_trace_2d(f, train_2d(sgd, steps=50))

在这里插入图片描述

小批量随机梯度下降

  • 读取数据
      data

def get_data_ch7():  # 本函数已保存在d2lzh_pytorch包中方便以后使用
    data = np.genfromtxt('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t')
    data = (data - data.mean(axis=0)) / data.std(axis=0) # 标准化
    return torch.tensor(data[:1500, :-1], dtype=torch.float32), \
           torch.tensor(data[:1500, -1], dtype=torch.float32) # 前1500个样本(每个样本5个特征)

features, labels = get_data_ch7()
features.shape

import pandas as pd
df = pd.read_csv('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t', header=None)
df.head(10)
  • 从零开始实现

def sgd(params, states, hyperparams):
    for p in params:
        p.data -= hyperparams['lr'] * p.grad.data

# 本函数已保存在d2lzh_pytorch包中方便以后使用
def train_ch7(optimizer_fn, states, hyperparams, features, labels,
              batch_size=10, num_epochs=2):
    # 初始化模型
    net, loss = d2l.linreg, d2l.squared_loss
    
    w = torch.nn.Parameter(torch.tensor(np.random.normal(0, 0.01, size=(features.shape[1], 1)), dtype=torch.float32),
                           requires_grad=True)
    b = torch.nn.Parameter(torch.zeros(1, dtype=torch.float32), requires_grad=True)

    def eval_loss():
        return loss(net(features, w, b), labels).mean().item()

    ls = [eval_loss()]
    data_iter = torch.utils.data.DataLoader(
        torch.utils.data.TensorDataset(features, labels), batch_size, shuffle=True)
    
    for _ in range(num_epochs):
        start = time.time()
        for batch_i, (X, y) in enumerate(data_iter):
            l = loss(net(X, w, b), y).mean()  # 使用平均损失
            
            # 梯度清零
            if w.grad is not None:
                w.grad.data.zero_()
                b.grad.data.zero_()
                
            l.backward()
            optimizer_fn([w, b], states, hyperparams)  # 迭代模型参数
            if (batch_i + 1) * batch_size % 100 == 0:
                ls.append(eval_loss())  # 每100个样本记录下当前训练误差
    # 打印结果和作图
    print('loss: %f, %f sec per epoch' % (ls[-1], time.time() - start))
    d2l.set_figsize()
    d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)
    d2l.plt.xlabel('epoch')
    d2l.plt.ylabel('loss')

def train_sgd(lr, batch_size, num_epochs=2):
    train_ch7(sgd, None, {'lr': lr}, features, labels, batch_size, num_epochs)

  对比


train_sgd(1, 1500, 6)

在这里插入图片描述


train_sgd(0.005, 1)

在这里插入图片描述


train_sgd(0.05, 10)

在这里插入图片描述

  • 简洁实现

# 本函数与原书不同的是这里第一个参数优化器函数而不是优化器的名字
# 例如: optimizer_fn=torch.optim.SGD, optimizer_hyperparams={"lr": 0.05}
def train_pytorch_ch7(optimizer_fn, optimizer_hyperparams, features, labels,
                    batch_size=10, num_epochs=2):
    # 初始化模型
    net = nn.Sequential(
        nn.Linear(features.shape[-1], 1)
    )
    loss = nn.MSELoss()
    optimizer = optimizer_fn(net.parameters(), **optimizer_hyperparams)

    def eval_loss():
        return loss(net(features).view(-1), labels).item() / 2

    ls = [eval_loss()]
    data_iter = torch.utils.data.DataLoader(
        torch.utils.data.TensorDataset(features, labels), batch_size, shuffle=True)

    for _ in range(num_epochs):
        start = time.time()
        for batch_i, (X, y) in enumerate(data_iter):
            # 除以2是为了和train_ch7保持一致, 因为squared_loss中除了2
            l = loss(net(X).view(-1), y) / 2 
            
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            if (batch_i + 1) * batch_size % 100 == 0:
                ls.append(eval_loss())
    # 打印结果和作图
    print('loss: %f, %f sec per epoch' % (ls[-1], time.time() - start))
    d2l.set_figsize()
    d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)
    d2l.plt.xlabel('epoch')
    d2l.plt.ylabel('loss')

train_pytorch_ch7(optim.SGD, {"lr": 0.05}, features, labels, 10)

在这里插入图片描述

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值