节选自“ElitesAI·动手学深度学习PyTorch版”.
- (Task06:批量归一化和残差网络;凸优化;梯度下降(1天))
- Task09:目标检测基础;图像风格迁移;图像分类案例1(1天)
- Task10:图像分类案例2;GAN;DCGAN(1天)
批量归一化(BatchNormalization)
对输入的标准化(浅层模型)
处理后的任意一个特征在数据集中所有样本上的均值为0、标准差为1。
标准化处理输入数据使各个特征的分布相近
批量归一化(深度模型)¶
利用小批量上的均值和标准差,不断调整神经网络中间输出,从而使整个神经网络在各层的中间输出的数值更稳定。
1.对全连接层做批量归一化
位置:全连接层中的仿射变换和激活函数之间。
全连接:
x=Wu+boutput=ϕ(x)x=Wu+boutput=ϕ(x)
批量归一化:
output=ϕ(BN(x))output=ϕ(BN(x))
y(i)=BN(x(i))y(i)=BN(x(i))
μB←1m∑i=1mx(i),μB←1m∑i=1mx(i),
σ2B←1m∑i=1m(x(i)−μB)2,σB2←1m∑i=1m(x(i)−μB)2,
x^(i)←x(i)−μBσ2B+ϵ−−−−−−√,x^(i)←x(i)−μBσB2+ϵ,
这⾥ϵ > 0是个很小的常数,保证分母大于0
y(i)←γ⊙x^(i)+β.y(i)←γ⊙x^(i)+β.
引入可学习参数:拉伸参数γ和偏移参数β。若γ=σ2B+ϵ−−−−−−√γ=σB2+ϵ和β=μBβ=μB,批量归一化无效。
2.对卷积层做批量归⼀化
位置:卷积计算之后、应⽤激活函数之前。
如果卷积计算输出多个通道,我们需要对这些通道的输出分别做批量归一化,且每个通道都拥有独立的拉伸和偏移参数。 计算:对单通道,batchsize=m,卷积计算输出=pxq 对该通道中m×p×q个元素同时做批量归一化,使用相同的均值和方差。
3.预测时的批量归⼀化
训练:以batch为单位,对每个batch计算均值和方差。
预测:用移动平均估算整个训练数据集的样本均值和方差。
从零实现
In [2]:
#目前GPU算力资源预计17日上线,在此之前本代码只能使用CPU运行。
#考虑到本代码中的模型过大,CPU训练较慢,
#我们还将代码上传了一份到 https://www.kaggle.com/boyuai/boyu-d2l-deepcnn
#如希望提前使用gpu运行请至kaggle。
In [1]:
import time
import torch
from torch import nn, optim
import torch.nn.functional as F
import torchvision
import sys
sys.path.append("/home/kesci/input/")
import d2lzh1981 as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
# 判断当前模式是训练模式还是预测模式
if not is_training:
# 如果是在预测模式下,直接使用传入的移动平均所得的均值和方差
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# 使用全连接层的情况,计算特征维上的均值和方差
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差。这里我们需要保持
# X的形状以便后面可以做广播运算
mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
# 训练模式下用当前的均值和方差做标准化
X_hat = (X - mean) / torch.sqrt(var + eps)
# 更新移动平均的均值和方差
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # 拉伸和偏移
return Y, moving_mean, moving_var
In [3]:
class BatchNorm(nn.Module):
def __init__(self, num_features, num_dims):
super(BatchNorm, self).__init__()
if num_dims == 2:
shape = (1, num_features) #全连接层输出神经元
else:
shape = (1, num_features, 1, 1) #通道数
# 参与求梯度和迭代的拉伸和偏移参数,分别初始化成0和1
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# 不参与求梯度和迭代的变量,全在内存上初始化成0
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.zeros(shape)
def forward(self, X):
# 如果X不在内存上,将moving_mean和moving_var复制到X所在显存上
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# 保存更新过的moving_mean和moving_var, Module实例的traning属性默认为true, 调用.eval()后设成false
Y, self.moving_mean, self.moving_var = batch_norm(self.training,
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
基于LeNet的应用
In [4]:
net = nn.Sequential(
nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
BatchNorm(6, num_dims=4),
nn.Sigmoid(),
nn.MaxPool2d(2, 2), # kernel_size, stride
nn.Conv2d(6, 16, 5),
BatchNorm(16, num_dims=4),
nn.Sigmoid(),
nn.MaxPool2d(2, 2),
d2l.FlattenLayer(),
nn.Linear(16*4*4, 120),
BatchNorm(120, num_dims=2),
nn.Sigmoid(),
nn.Linear(120, 84),
BatchNorm(84, num_dims=2),
nn.Sigmoid(),
nn.Linear(84, 10)
)
print(net)
Sequential(
(0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(1): BatchNorm()
(2): Sigmoid()
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(5): BatchNorm()
(6): Sigmoid()
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(8): FlattenLayer()
(9): Linear(in_features=256, out_features=120, bias=True)
(10): BatchNorm()
(11): Sigmoid()
(12): Linear(in_features=120, out_features=84, bias=True)
(13): BatchNorm()
(14): Sigmoid()
(15): Linear(in_features=84, out_features=10, bias=True)
)
In [5]:
#batch_size = 256
##cpu要调小batchsize
batch_size=16
def load_data_fashion_mnist(batch_size, resize=None, root='/home/kesci/input/FashionMNIST2065'):
"""Download the fashion mnist dataset and then load into memory."""
trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())
transform = torchvision.transforms.Compose(trans)
mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)
return train_iter, test_iter
train_iter, test_iter = load_data_fashion_mnist(batch_size)
In [10]:
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
简洁实现
In [ ]:
net = nn.Sequential(
nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
nn.BatchNorm2d(6),
nn.Sigmoid(),
nn.MaxPool2d(2, 2), # kernel_size, stride
nn.Conv2d(6, 16, 5),
nn.BatchNorm2d(16),
nn.Sigmoid(),
nn.MaxPool2d(2, 2),
d2l.FlattenLayer(),
nn.Linear(16*4*4, 120),
nn.BatchNorm1d(120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.BatchNorm1d(84),
nn.Sigmoid(),
nn.Linear(84, 10)
)
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
残差网络(ResNet)
深度学习的问题:深度CNN网络达到一定深度后再一味地增加层数并不能带来进一步地分类性能提高,反而会招致网络收敛变得更慢,准确率也变得更差。
残差块(Residual Block)
恒等映射:
左边:f(x)=x
右边:f(x)-x=0 (易于捕捉恒等映射的细微波动)
在残差块中,输⼊可通过跨层的数据线路更快 地向前传播。
In [6]:
class Residual(nn.Module): # 本类已保存在d2lzh_pytorch包中方便以后使用
#可以设定输出通道数、是否使用额外的1x1卷积层来修改通道数以及卷积层的步幅。
def __init__(self, in_channels, out_channels, use_1x1conv=False, stride=1):
super(Residual, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, stride=stride)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm2d(out_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return F.relu(Y + X)
In [7]:
blk = Residual(3, 3)
X = torch.rand((4, 3, 6, 6))
blk(X).shape # torch.Size([4, 3, 6, 6])
Out[7]:
torch.Size([4, 3, 6, 6])
In [8]:
blk = Residual(3, 6, use_1x1conv=True, stride=2)
blk(X).shape # torch.Size([4, 6, 3, 3])
Out[8]:
torch.Size([4, 6, 3, 3])
ResNet模型
卷积(64,7x7,3)
批量一体化
最大池化(3x3,2)
残差块x4 (通过步幅为2的残差块在每个模块之间减小高和宽)
全局平均池化
全连接
In [9]:
net = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
In [10]:
def resnet_block(in_channels, out_channels, num_residuals, first_block=False):
if first_block:
assert in_channels == out_channels # 第一个模块的通道数同输入通道数一致
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))
else:
blk.append(Residual(out_channels, out_channels))
return nn.Sequential(*blk)
net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))
In [11]:
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的输出: (Batch, 512, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(512, 10)))
In [12]:
X = torch.rand((1, 1, 224, 224))
for name, layer in net.named_children():
X = layer(X)
print(name, ' output shape:\t', X.shape)
0 output shape: torch.Size([1, 64, 112, 112])
1 output shape: torch.Size([1, 64, 112, 112])
2 output shape: torch.Size([1, 64, 112, 112])
3 output shape: torch.Size([1, 64, 56, 56])
resnet_block1 output shape: torch.Size([1, 64, 56, 56])
resnet_block2 output shape: torch.Size([1, 128, 28, 28])
resnet_block3 output shape: torch.Size([1, 256, 14, 14])
resnet_block4 output shape: torch.Size([1, 512, 7, 7])
global_avg_pool output shape: torch.Size([1, 512, 1, 1])
fc output shape: torch.Size([1, 10])
In [13]:
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
稠密连接网络(DenseNet)
主要构建模块:
稠密块(dense block): 定义了输入和输出是如何连结的。
过渡层(transition layer):用来控制通道数,使之不过大。
稠密块
In [13]:
def conv_block(in_channels, out_channels):
blk = nn.Sequential(nn.BatchNorm2d(in_channels),
nn.ReLU(),
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
return blk
class DenseBlock(nn.Module):
def __init__(self, num_convs, in_channels, out_channels):
super(DenseBlock, self).__init__()
net = []
for i in range(num_convs):
in_c = in_channels + i * out_channels
net.append(conv_block(in_c, out_channels))
self.net = nn.ModuleList(net)
self.out_channels = in_channels + num_convs * out_channels # 计算输出通道数
def forward(self, X):
for blk in self.net:
Y = blk(X)
X = torch.cat((X, Y), dim=1) # 在通道维上将输入和输出连结
return X
In [14]:
blk = DenseBlock(2, 3, 10)
X = torch.rand(4, 3, 8, 8)
Y = blk(X)
Y.shape # torch.Size([4, 23, 8, 8])
Out[14]:
torch.Size([4, 23, 8, 8])
过渡层
1×11×1卷积层:来减小通道数
步幅为2的平均池化层:减半高和宽
In [15]:
def transition_block(in_channels, out_channels):
blk = nn.Sequential(
nn.BatchNorm2d(in_channels),
nn.ReLU(),
nn.Conv2d(in_channels, out_channels, kernel_size=1),
nn.AvgPool2d(kernel_size=2, stride=2))
return blk
blk = transition_block(23, 10)
blk(Y).shape # torch.Size([4, 10, 4, 4])
Out[15]:
torch.Size([4, 10, 4, 4])
DenseNet模型
In [16]:
net = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
In [17]:
num_channels, growth_rate = 64, 32 # num_channels为当前的通道数
num_convs_in_dense_blocks = [4, 4, 4, 4]
for i, num_convs in enumerate(num_convs_in_dense_blocks):
DB = DenseBlock(num_convs, num_channels, growth_rate)
net.add_module("DenseBlosk_%d" % i, DB)
# 上一个稠密块的输出通道数
num_channels = DB.out_channels
# 在稠密块之间加入通道数减半的过渡层
if i != len(num_convs_in_dense_blocks) - 1:
net.add_module("transition_block_%d" % i, transition_block(num_channels, num_channels // 2))
num_channels = num_channels // 2
In [18]:
net.add_module("BN", nn.BatchNorm2d(num_channels))
net.add_module("relu", nn.ReLU())
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的输出: (Batch, num_channels, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(num_channels, 10)))
X = torch.rand((1, 1, 96, 96))
for name, layer in net.named_children():
X = layer(X)
print(name, ' output shape:\t', X.shape)
0 output shape: torch.Size([1, 64, 48, 48])
1 output shape: torch.Size([1, 64, 48, 48])
2 output shape: torch.Size([1, 64, 48, 48])
3 output shape: torch.Size([1, 64, 24, 24])
DenseBlosk_0 output shape: torch.Size([1, 192, 24, 24])
transition_block_0 output shape: torch.Size([1, 96, 12, 12])
DenseBlosk_1 output shape: torch.Size([1, 224, 12, 12])
transition_block_1 output shape: torch.Size([1, 112, 6, 6])
DenseBlosk_2 output shape: torch.Size([1, 240, 6, 6])
transition_block_2 output shape: torch.Size([1, 120, 3, 3])
DenseBlosk_3 output shape: torch.Size([1, 248, 3, 3])
BN output shape: torch.Size([1, 248, 3, 3])
relu output shape: torch.Size([1, 248, 3, 3])
global_avg_pool output shape: torch.Size([1, 248, 1, 1])
fc output shape: torch.Size([1, 10])
In [20]:
#batch_size = 256
batch_size=16
# 如出现“out of memory”的报错信息,可减小batch_size或resize
train_iter, test_iter =load_data_fashion_mnist(batch_size, resize=96)
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
优化与深度学习
优化与估计
尽管优化方法可以最小化深度学习中的损失函数值,但本质上优化方法达到的目标与深度学习的目标并不相同。
- 优化方法目标:训练集损失函数值
- 深度学习目标:测试集损失函数值(泛化性)
In [1]:
%matplotlib inline
import sys
sys.path.append('/home/kesci/input')
import d2lzh1981 as d2l
from mpl_toolkits import mplot3d # 三维画图
import numpy as np
In [2]:
def f(x): return x * np.cos(np.pi * x)
def g(x): return f(x) + 0.2 * np.cos(5 * np.pi * x)
d2l.set_figsize((5, 3))
x = np.arange(0.5, 1.5, 0.01)
fig_f, = d2l.plt.plot(x, f(x),label="train error")
fig_g, = d2l.plt.plot(x, g(x),'--', c='purple', label="test error")
fig_f.axes.annotate('empirical risk', (1.0, -1.2), (0.5, -1.1),arrowprops=dict(arrowstyle='->'))
fig_g.axes.annotate('expected risk', (1.1, -1.05), (0.95, -0.5),arrowprops=dict(arrowstyle='->'))
d2l.plt.xlabel('x')
d2l.plt.ylabel('risk')
d2l.plt.legend(loc="upper right")
Out[2]:
<matplotlib.legend.Legend at 0x7fc092436080>
优化在深度学习中的挑战
- 局部最小值
- 鞍点
- 梯度消失
局部最小值
f(x)=xcosπxf(x)=xcosπx
In [3]:
def f(x):
return x * np.cos(np.pi * x)
d2l.set_figsize((4.5, 2.5))
x = np.arange(-1.0, 2.0, 0.1)
fig, = d2l.plt.plot(x, f(x))
fig.axes.annotate('local minimum', xy=(-0.3, -0.25), xytext=(-0.77, -1.0),
arrowprops=dict(arrowstyle='->'))
fig.axes.annotate('global minimum', xy=(1.1, -0.95), xytext=(0.6, 0.8),
arrowprops=dict(arrowstyle='->'))
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)');
鞍点
In [4]:
x = np.arange(-2.0, 2.0, 0.1)
fig, = d2l.plt.plot(x, x**3)
fig.axes.annotate('saddle point', xy=(0, -0.2), xytext=(-0.52, -5.0),
arrowprops=dict(arrowstyle='->'))
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)');
A=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂2f∂x21∂2f∂x2∂x1⋮∂2f∂xn∂x1∂2f∂x1∂x2∂2f∂x22⋮∂2f∂xn∂x2⋯⋯⋱⋯∂2f∂x1∂xn∂2f∂x2∂xn⋮∂2f∂x2n⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥A=[∂2f∂x12∂2f∂x1∂x2⋯∂2f∂x1∂xn∂2f∂x2∂x1∂2f∂x22⋯∂2f∂x2∂xn⋮⋮⋱⋮∂2f∂xn∂x1∂2f∂xn∂x2⋯∂2f∂xn2]
e.g.
In [5]:
x, y = np.mgrid[-1: 1: 31j, -1: 1: 31j]
z = x**2 - y**2
d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 2, 'cstride': 2})
ax.plot([0], [0], [0], 'ro', markersize=10)
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');
梯度消失
In [6]:
x = np.arange(-2.0, 5.0, 0.01)
fig, = d2l.plt.plot(x, np.tanh(x))
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)')
fig.axes.annotate('vanishing gradient', (4, 1), (2, 0.0) ,arrowprops=dict(arrowstyle='->'))
Out[6]:
Text(2, 0.0, 'vanishing gradient')
凸性 (Convexity)
基础
集合
函数
λf(x)+(1−λ)f(x′)≥f(λx+(1−λ)x′)λf(x)+(1−λ)f(x′)≥f(λx+(1−λ)x′)
In [10]:
def f(x):
return 0.5 * x**2 # Convex
def g(x):
return np.cos(np.pi * x) # Nonconvex
def h(x):
return np.exp(0.5 * x) # Convex
x, segment = np.arange(-2, 2, 0.01), np.array([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
ax.plot(x, func(x))
ax.plot(segment, func(segment),'--', color="purple")
# d2l.plt.plot([x, segment], [func(x), func(segment)], axes=ax)
Jensen 不等式
∑iαif(xi)≥f(∑iαixi) and Ex[f(x)]≥f(Ex[x])∑iαif(xi)≥f(∑iαixi) and Ex[f(x)]≥f(Ex[x])
性质
- 无局部极小值
- 与凸集的关系
- 二阶条件
无局部最小值
证明:假设存在 x∈Xx∈X 是局部最小值,则存在全局最小值 x′∈Xx′∈X, 使得 f(x)>f(x′)f(x)>f(x′), 则对 λ∈(0,1]λ∈(0,1]:
f(x)>λf(x)+(1−λ)f(x′)≥f(λx+(1−λ)x′)f(x)>λf(x)+(1−λ)f(x′)≥f(λx+(1−λ)x′)
与凸集的关系
对于凸函数 f(x)f(x),定义集合 Sb:={x|x∈X and f(x)≤b}Sb:={x|x∈X and f(x)≤b},则集合 SbSb 为凸集
证明:对于点 x,x′∈Sbx,x′∈Sb, 有 f(λx+(1−λ)x′)≤λf(x)+(1−λ)f(x′)≤bf(λx+(1−λ)x′)≤λf(x)+(1−λ)f(x′)≤b, 故 λx+(1−λ)x′∈Sbλx+(1−λ)x′∈Sb
f(x,y)=0.5x2+cos(2πy)f(x,y)=0.5x2+cos(2πy)
In [12]:
x, y = np.meshgrid(np.linspace(-1, 1, 101), np.linspace(-1, 1, 101),
indexing='ij')
z = x**2 + 0.5 * np.cos(2 * np.pi * y)
# Plot the 3D surface
d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.contour(x, y, z, offset=-1)
ax.set_zlim(-1, 1.5)
# Adjust labels
for func in [d2l.plt.xticks, d2l.plt.yticks, ax.set_zticks]:
func([-1, 0, 1])
凸函数与二阶导数
f′′(x)≥0⟺f(x)f″(x)≥0⟺f(x) 是凸函数
必要性 (⇐⇐):
对于凸函数:
12f(x+ϵ)+12f(x−ϵ)≥f(x+ϵ2+x−ϵ2)=f(x)12f(x+ϵ)+12f(x−ϵ)≥f(x+ϵ2+x−ϵ2)=f(x)
故:
f′′(x)=limε→0f(x+ϵ)−f(x)ϵ−f(x)−f(x−ϵ)ϵϵf′′(x)=limε→0f(x+ϵ)−f(x)ϵ−f(x)−f(x−ϵ)ϵϵ
f′′(x)=limε→0f(x+ϵ)+f(x−ϵ)−2f(x)ϵ2≥0f′′(x)=limε→0f(x+ϵ)+f(x−ϵ)−2f(x)ϵ2≥0
充分性 (⇒⇒):
令 a<x<ba<x<b 为 f(x)f(x) 上的三个点,由拉格朗日中值定理:
f(x)−f(a)=(x−a)f′(α) for some α∈[a,x] and f(b)−f(x)=(b−x)f′(β) for some β∈[x,b]f(x)−f(a)=(x−a)f′(α) for some α∈[a,x] and f(b)−f(x)=(b−x)f′(β) for some β∈[x,b]
根据单调性,有 f′(β)≥f′(α)f′(β)≥f′(α), 故:
f(b)−f(a)=f(b)−f(x)+f(x)−f(a)=(b−x)f′(β)+(x−a)f′(α)≥(b−a)f′(α)f(b)−f(a)=f(b)−f(x)+f(x)−f(a)=(b−x)f′(β)+(x−a)f′(α)≥(b−a)f′(α)
In [13]:
def f(x):
return 0.5 * x**2
x = np.arange(-2, 2, 0.01)
axb, ab = np.array([-1.5, -0.5, 1]), np.array([-1.5, 1])
d2l.set_figsize((3.5, 2.5))
fig_x, = d2l.plt.plot(x, f(x))
fig_axb, = d2l.plt.plot(axb, f(axb), '-.',color="purple")
fig_ab, = d2l.plt.plot(ab, f(ab),'g-.')
fig_x.axes.annotate('a', (-1.5, f(-1.5)), (-1.5, 1.5),arrowprops=dict(arrowstyle='->'))
fig_x.axes.annotate('b', (1, f(1)), (1, 1.5),arrowprops=dict(arrowstyle='->'))
fig_x.axes.annotate('x', (-0.5, f(-0.5)), (-1.5, f(-0.5)),arrowprops=dict(arrowstyle='->'))
Out[13]:
Text(-1.5, 0.125, 'x')
限制条件
minimizexf(x) subject to ci(x)≤0 for all i∈{1,…,N}minimizexf(x) subject to ci(x)≤0 for all i∈{1,…,N}
拉格朗日乘子法
L(x,α)=f(x)+∑iαici(x) where αi≥0L(x,α)=f(x)+∑iαici(x) where αi≥0
惩罚项
欲使 ci(x)≤0ci(x)≤0, 将项 αici(x)αici(x) 加入目标函数,如多层感知机章节中的 λ2||w||2λ2||w||2
投影
ProjX(x)=argminx′∈X∥x−x′∥2ProjX(x)=argminx′∈X‖x−x′‖2
梯度下降
In [1]:
%matplotlib inline
import numpy as np
import torch
import time
from torch import nn, optim
import math
import sys
sys.path.append('/home/kesci/input')
import d2lzh1981 as d2l
一维梯度下降
证明:沿梯度反方向移动自变量可以减小函数值
泰勒展开:
f(x+ϵ)=f(x)+ϵf′(x)+O(ϵ2)f(x+ϵ)=f(x)+ϵf′(x)+O(ϵ2)
代入沿梯度方向的移动量 ηf′(x)ηf′(x):
f(x−ηf′(x))=f(x)−ηf′2(x)+O(η2f′2(x))f(x−ηf′(x))=f(x)−ηf′2(x)+O(η2f′2(x))
f(x−ηf′(x))≲f(x)f(x−ηf′(x))≲f(x)
x←x−ηf′(x)x←x−ηf′(x)
e.g.
f(x)=x2f(x)=x2
In [2]:
def f(x):
return x**2 # Objective function
def gradf(x):
return 2 * x # Its derivative
def gd(eta):
x = 10
results = [x]
for i in range(10):
x -= eta * gradf(x)
results.append(x)
print('epoch 10, x:', x)
return results
res = gd(0.2)
epoch 10, x: 0.06046617599999997
In [3]:
def show_trace(res):
n = max(abs(min(res)), abs(max(res)))
f_line = np.arange(-n, n, 0.01)
d2l.set_figsize((3.5, 2.5))
d2l.plt.plot(f_line, [f(x) for x in f_line],'-')
d2l.plt.plot(res, [f(x) for x in res],'-o')
d2l.plt.xlabel('x')
d2l.plt.ylabel('f(x)')
show_trace(res)
学习率
In [4]:
show_trace(gd(0.05))
epoch 10, x: 3.4867844009999995
In [5]:
show_trace(gd(1.1))
epoch 10, x: 61.917364224000096
局部极小值
e.g.
f(x)=xcoscxf(x)=xcoscx
In [6]:
c = 0.15 * np.pi
def f(x):
return x * np.cos(c * x)
def gradf(x):
return np.cos(c * x) - c * x * np.sin(c * x)
show_trace(gd(2))
epoch 10, x: -1.528165927635083
多维梯度下降
∇f(x)=[∂f(x)∂x1,∂f(x)∂x2,…,∂f(x)∂xd]⊤∇f(x)=[∂f(x)∂x1,∂f(x)∂x2,…,∂f(x)∂xd]⊤
f(x+ϵ)=f(x)+ϵ⊤∇f(x)+O(∥ϵ∥2)f(x+ϵ)=f(x)+ϵ⊤∇f(x)+O(‖ϵ‖2)
x←x−η∇f(x)x←x−η∇f(x)
In [7]:
def train_2d(trainer, steps=20):
x1, x2 = -5, -2
results = [(x1, x2)]
for i in range(steps):
x1, x2 = trainer(x1, x2)
results.append((x1, x2))
print('epoch %d, x1 %f, x2 %f' % (i + 1, x1, x2))
return results
def show_trace_2d(f, results):
d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
x1, x2 = np.meshgrid(np.arange(-5.5, 1.0, 0.1), np.arange(-3.0, 1.0, 0.1))
d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
d2l.plt.xlabel('x1')
d2l.plt.ylabel('x2')
f(x)=x21+2x22f(x)=x12+2x22
In [8]:
eta = 0.1
def f_2d(x1, x2): # 目标函数
return x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2):
return (x1 - eta * 2 * x1, x2 - eta * 4 * x2)
show_trace_2d(f_2d, train_2d(gd_2d))
epoch 20, x1 -0.057646, x2 -0.000073
自适应方法
牛顿法
在 x+ϵx+ϵ 处泰勒展开:
f(x+ϵ)=f(x)+ϵ⊤∇f(x)+12ϵ⊤∇∇⊤f(x)ϵ+O(∥ϵ∥3)f(x+ϵ)=f(x)+ϵ⊤∇f(x)+12ϵ⊤∇∇⊤f(x)ϵ+O(‖ϵ‖3)
最小值点处满足: ∇f(x)=0∇f(x)=0, 即我们希望 ∇f(x+ϵ)=0∇f(x+ϵ)=0, 对上式关于 ϵϵ 求导,忽略高阶无穷小,有:
∇f(x)+Hfϵ=0 and hence ϵ=−H−1f∇f(x)∇f(x)+Hfϵ=0 and hence ϵ=−Hf−1∇f(x)
In [9]:
c = 0.5
def f(x):
return np.cosh(c * x) # Objective
def gradf(x):
return c * np.sinh(c * x) # Derivative
def hessf(x):
return c**2 * np.cosh(c * x) # Hessian
# Hide learning rate for now
def newton(eta=1):
x = 10
results = [x]
for i in range(10):
x -= eta * gradf(x) / hessf(x)
results.append(x)
print('epoch 10, x:', x)
return results
show_trace(newton())
epoch 10, x: 0.0
In [10]:
c = 0.15 * np.pi
def f(x):
return x * np.cos(c * x)
def gradf(x):
return np.cos(c * x) - c * x * np.sin(c * x)
def hessf(x):
return - 2 * c * np.sin(c * x) - x * c**2 * np.cos(c * x)
show_trace(newton())
epoch 10, x: 26.83413291324767
In [11]:
show_trace(newton(0.5))
epoch 10, x: 7.269860168684531
收敛性分析
只考虑在函数为凸函数, 且最小值点上 f′′(x∗)>0f″(x∗)>0 时的收敛速度:
令 xkxk 为第 kk 次迭代后 xx 的值, ek:=xk−x∗ek:=xk−x∗ 表示 xkxk 到最小值点 x∗x∗ 的距离,由 f′(x∗)=0f′(x∗)=0:
0=f′(xk−ek)=f′(xk)−ekf′′(xk)+12e2kf′′′(ξk)for some ξk∈[xk−ek,xk]0=f′(xk−ek)=f′(xk)−ekf′′(xk)+12ek2f′′′(ξk)for some ξk∈[xk−ek,xk]
两边除以 f′′(xk)f″(xk), 有:
ek−f′(xk)/f′′(xk)=12e2kf′′′(ξk)/f′′(xk)ek−f′(xk)/f′′(xk)=12ek2f′′′(ξk)/f′′(xk)
代入更新方程 xk+1=xk−f′(xk)/f′′(xk)xk+1=xk−f′(xk)/f′′(xk), 得到:
xk−x∗−f′(xk)/f′′(xk)=12e2kf′′′(ξk)/f′′(xk)xk−x∗−f′(xk)/f′′(xk)=12ek2f′′′(ξk)/f′′(xk)
xk+1−x∗=ek+1=12e2kf′′′(ξk)/f′′(xk)xk+1−x∗=ek+1=12ek2f′′′(ξk)/f′′(xk)
当 12f′′′(ξk)/f′′(xk)≤c12f′′′(ξk)/f′′(xk)≤c 时,有:
ek+1≤ce2kek+1≤cek2
预处理 (Heissan阵辅助梯度下降)
x←x−ηdiag(Hf)−1∇xx←x−ηdiag(Hf)−1∇x
梯度下降与线性搜索(共轭梯度法)
随机梯度下降
随机梯度下降参数更新
对于有 nn 个样本对训练数据集,设 fi(x)fi(x) 是第 ii 个样本的损失函数, 则目标函数为:
f(x)=1n∑i=1nfi(x)f(x)=1n∑i=1nfi(x)
其梯度为:
∇f(x)=1n∑i=1n∇fi(x)∇f(x)=1n∑i=1n∇fi(x)
使用该梯度的一次更新的时间复杂度为 O(n)O(n)
随机梯度下降更新公式 O(1)O(1):
x←x−η∇fi(x)x←x−η∇fi(x)
且有:
Ei∇fi(x)=1n∑i=1n∇fi(x)=∇f(x)Ei∇fi(x)=1n∑i=1n∇fi(x)=∇f(x)
e.g.
f(x1,x2)=x21+2x22f(x1,x2)=x12+2x22
In [12]:
def f(x1, x2):
return x1 ** 2 + 2 * x2 ** 2 # Objective
def gradf(x1, x2):
return (2 * x1, 4 * x2) # Gradient
def sgd(x1, x2): # Simulate noisy gradient
global lr # Learning rate scheduler
(g1, g2) = gradf(x1, x2) # Compute gradient
(g1, g2) = (g1 + np.random.normal(0.1), g2 + np.random.normal(0.1))
eta_t = eta * lr() # Learning rate at time t
return (x1 - eta_t * g1, x2 - eta_t * g2) # Update variables
eta = 0.1
lr = (lambda: 1) # Constant learning rate
show_trace_2d(f, train_2d(sgd, steps=50))
epoch 50, x1 -0.027566, x2 0.137605
动态学习率
η(t)=ηi if ti≤t≤ti+1η(t)=η0⋅e−λtη(t)=η0⋅(βt+1)−α piecewise constant exponential polynomial η(t)=ηi if ti≤t≤ti+1 piecewise constant η(t)=η0⋅e−λt exponential η(t)=η0⋅(βt+1)−α polynomial
In [13]:
def exponential():
global ctr
ctr += 1
return math.exp(-0.1 * ctr)
ctr = 1
lr = exponential # Set up learning rate
show_trace_2d(f, train_2d(sgd, steps=1000))
epoch 1000, x1 -0.677947, x2 -0.089379
In [14]:
def polynomial():
global ctr
ctr += 1
return (1 + 0.1 * ctr)**(-0.5)
ctr = 1
lr =