提出背景
自从2012年AlexNet将深度学习的方法应用到ImageNet的图像分类比赛中并取得state of the art的惊人结果后,大家都竞相效仿并在此基础上做了大量尝试和改进,先从两个性能提升的例子说起:
- 小卷积核,在第一个卷积层用了更小的卷积核和卷积stride(Zeiler & Fergus, 2013; Sermanet et al., 2014)
- 多尺度,训练和测试使用整张图的不同尺度(Sermanet et al., 2014; Howard, 2014)
作者也是看到这两个没有谈到深度的工作,因而受到启发,不仅将上面的两种方法应用到自己的网络设计和训练测试阶段,同时想再试试深度对结果的影响。
小卷积的作用
计算量
同样conv3x3、conv5x5、conv7x7、conv9x9和conv11x11,在224x224x3的RGB图上(设置pad=1,stride=4,output_channel=96)做卷积,卷积层的参数规模和得到的feature map的大小如下:
根据公式,输入大小为
W
∗
W
W * W
W∗W,那么输出大小:
(
W
−
k
e
r
n
e
l
s
i
z
e
+
2
∗
p
a
d
d
i
n
g
)
/
s
t
r
i
d
e
+
1
(W − kernel_size + 2*padding )/stride+1
(W−kernelsize+2∗padding)/stride+1,可以得到上图:
- 从上图中,大卷积核带来的特征图和卷积核的参数量并不大,无论是单独去看卷积核参数或者特征图参数,不同kernel大小下这二者加和的结构都是30万的参数量,也就是说,无论大的卷积核还是小的,对参数量来说影响不大甚至持平。
- 增大的反而是卷积的计算量,在表格中列出了计算量的公式,最后要乘以2,代表乘加操作。为了尽可能证一致,这里所有卷积核使用的stride均为4,可以看到,conv3x3、conv5x5、conv7x7、conv9x9、conv11x11的计算规模依次为:1600万,4500万,1.4亿、2亿,这种规模下的卷积,虽然参数量增长不大,但是计算量是惊人的
总结一下,可以得出两个结论:
- 同样stride下,不同卷积核大小的特征图和卷积参数差别不大;
- 越大的卷积核计算量越大。
假设这个网络输入为8个神个经元,三个网络分别对应stride=1,pad=0的conv3x3、conv5x5和conv7x7的卷积核在3层、1层、1层时的结果。因为这三个网络的输入都是8,也可看出2个3x3的卷积堆叠获得的感受野大小,相当1层5x5的卷积;而3层的3x3卷积堆叠获取到的感受野相当于一个7x7的卷积,如下图。
- 2个3X3相当于1个5x5
- 3个3X3相当于1个7x7
结合网上资料总结了以下几点用小卷积比大卷积的优势:
- 更多的激活函数、更丰富的特征,更强的辨别能力。卷积后都伴有激活函数,更多的卷积核的使用可使决策函数更加具有辨别能力;
- 卷积层的参数减少。比方input channel数和output channel数均为C,那么3层conv3x3卷积所需要的卷积层参数是: 3 x ( C x 3 x 3 x C ) = 27 C 2 3x(Cx3x3xC)=27C^2 3x(Cx3x3xC)=27C2,而一层conv7x7卷积所需要的卷积层参数是: C x 7 x 7 x C = 49 C 2 Cx7x7xC=49C^2 Cx7x7xC=49C2;
- 小卷积核代替大卷积核的带来性能提升。作者用三个conv3x3代替一个conv7x7,认为可以进一步分解(decomposition)原本用7x7大卷积核提到的特征。
vgg 模型
vgg框架:
不同的vgg模型:
代码复现
自定义vgg模型
# 导入模块
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
import torchvision
import torchvision.transforms as transforms
import time
from torch import nn, optim
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available() else 'cpu'
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
trainset = torchvision.datasets.CIFAR10(root='./data_cifar10', train=True, download=False, transform=transform_train)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
testset = torchvision.datasets.CIFAR10(root='./data_cifar10', train=False, download=False, transform=transform_test)
test_loader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)
cfg = {
'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
class VGG(nn.Module):
def __init__(self, vgg_name):
super(VGG, self).__init__()
self.features = self._make_layers(cfg[vgg_name])
self.classifier = nn.Linear(512, 10)
def forward(self, x):
out = self.features(x)
out = out.view(out.size(0), -1)#将向量展成
out = self.classifier(out)
return out
def _make_layers(self, cfg):
layers = []
in_channels = 3
for x in cfg:
if x == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
nn.BatchNorm2d(x),
nn.ReLU(inplace=True)]
in_channels = x
layers += [nn.AvgPool2d(kernel_size=1, stride=1)]
return nn.Sequential(*layers)
## 训练
#网络实例化
net = VGG(vgg_name='VGG16', num_classes=10).to(device)
# net = torch.load('./vgg16_net.pkl').cuda()
# 优化器
optimizer = optim.Adam(net.parameters())
# 损失函数
criterion = torch.nn.CrossEntropyLoss()
# 训练参数
num_epochs= 30
# 记录数据
writer = SummaryWriter('runs/example_demo3')
# 训练
def train(net, train_loader, optimizer, criterion, device, num_epochs):
net.train()
for epoch in range(num_epochs):
train_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(train_loader):
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets).cuda()
loss.backward()
optimizer.step()
train_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
print(correct/total)
for name, param in net.named_parameters():
writer.add_histogram(name, param.clone().cpu().data.numpy(), epoch + 1)
writer.add_scalar('train_acc(%)', correct/total,epoch+1)
writer.add_scalar('train_loss',train_loss/(batch_idx+1),epoch+1)
train(net, train_loader, optimizer, criterion, device, num_epochs)
直接导入vgg模型训练
import torchvision
import torch
from torch import nn, optim
device = 'cuda' if torch.cuda.is_available() else 'cpu'
from torch.utils.tensorboard import SummaryWriter
import torchvision.models as models
from torchvision import datasets, transforms
# 数据集加载
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
trainset = datasets.CIFAR10(root='./data_cifar10', train=True, download=False, transform=transform_train)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True)
testset = datasets.CIFAR10(root='./data_cifar10', train=False, download=False, transform=transform_test)
test_loader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)
# 训练参数
num_epochs= 30
# 模型
vgg16 = models.vgg16(pretrained=False, num_classes = 10).to(device)
# 优化器
optimizer = optim.Adam(vgg16.parameters())
# 损失函数
criterion = torch.nn.CrossEntropyLoss()
writer = SummaryWriter('runs/example1')
# 训练
def evaluate_accuracy(test_loader, net, device, criterion):
test_loss = 0
correct_test = 0
total_test = 0
with torch.no_grad():
for batch_idx, (inputs, targets) in enumerate(test_loader):
inputs, targets = inputs.to(device), targets.to(device)
outputs = net(inputs)
loss = criterion(outputs, targets)
test_loss += loss.item()
_, predicted = outputs.max(1)
total_test += targets.size(0)
correct_test += predicted.eq(targets).sum().item()
return correct_test/total_test, test_loss/(batch_idx+1)
def train(net, train_loader, test_loader, optimizer, criterion, device, num_epochs):
for epoch in range(num_epochs):
train_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(train_loader):
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets).cuda()
loss.backward()
optimizer.step()
train_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
# for name, param in net.named_parameters():
# writer.add_histogram(name, param.clone().cpu().data.numpy(), epoch + 1)
test_acc, test_loss = evaluate_accuracy(test_loader, net, device, criterion)
writer.add_scalar('train_acc', correct/total, epoch+1)
writer.add_scalar('train_loss',train_loss/(batch_idx+1), epoch+1) train_loss /(batch_idx + 1),test_acc, test_loss))
writer.add_scalar('test_acc', test_acc, epoch + 1)
writer.add_scalar('test_loss', test_loss, epoch + 1)
train(net, train_loader, test_loader, optimizer, criterion, device, num_epochs)
结果分析
tensorboard展示训练误差以及训练准确性
- 训练集上结果
- 测试集上结果
分析:
- 在30个epoch后在训练集上测试精度以及达到了94%,但是在测试精度上基本保持在0.88左右,但是增加epoch,训练精度会持续上升但是,训练精度提升不大。
改进
- 尝试选择bath_size=128,256。但是没有使测试精度明显提高;
- 尝试了VGG19,训练准确性能达到很高,但是测试精度还是上不去。
总结
- 相比于AlexNet,VGG模型使用的小卷积方法确实是能提升模型性能;
- VGG有16-19层,泛化能力强,能应用于更广泛的数据集;
- 在后面的研究中能将优化方法应用于模型中超参数的选择,可能提高模型性能;
- 在编码过程中学习了和之前resent复现采用resent block不同的形式,采用了一个循环的方式根据前面定义的list来循环添加卷积层,这也是编代码过程中可以学习的一个trick。