2017年提出 CVPR2017 best paper & oral
DenseNet —— Dense Convolutional Network(稠密连接网络)
DenseNet主要还是和ResNet及Inception网络做对比,思想上有借鉴,但却是全新的结构.网络结构并不复杂,却非常有效.
众所周知,最近一两年卷积神经网络提高效果的方向,要么深(比如ResNet,解决了网络深时候的梯度消失问题)要么宽(比如GoogleNet的Inception),而DenseNet则是从feature入手,通过对feature的极致利用达到更好的效果和更少的参数。
DenseNet 和 ResNet 不同在于 ResNet 是跨层求和, 而 DenseNet 是跨层将特征在通道维度进行拼接
DenseNet的实验结果是优于ResNet的
DenseNet的几个优点:
- 减轻了vanishing-gradient(梯度消失)
- 加强了feature的传递
- 更有效地利用了feature
- 一定程度上较少了参数数量
在传统的卷积神经网络中,如果你有L层,那么就会有L个连接,但是在DenseNet中,会有L(L+1)/2个连接。简单讲,就是每一层的输入来自前面所有层的输出。如下图:x0是input,H1的输入是x0(input),H2的输入是x0和x1(x1是H1的输出),H3的输入是x0和x1和x2……
DenseNet的一个优点是网络更窄,参数更少,很大一部分原因得益于这种dense block的设计。在dense block中每个卷积层的输出feature map的数量都很小(小于100),而不是像其他网络一样动不动就几百上千的宽度。同时这种连接方式使得特征和梯度的传递更加有效,网络也就更加容易训练。原文的一句话非常喜欢:Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision.直接解释了为什么这个网络的效果会很好。前面提到过梯度消失问题在网络深度越深的时候越容易出现,原因就是输入信息和梯度信息在很多层之间传递导致的,而现在这种dense connection相当于每一层都直接连接input和loss,因此就可以减轻梯度消失现象,这样更深网络不是问题
在这个结构图中包含了3个dense block。作者将DenseNet分成多个dense block,原因是希望各个dense block内的feature map的size统一,这样在做concatenation就不会有size的问题。
DenseNet的主要构建模块是稠密块(dense block)和过渡层(transition layer)。前者定义了了输入和输出是如何连结的,后者则用来控制通道数,使之不过大。
网络模型
DenseNet的主要构建模块是稠密块(dense block)和过渡层(transition layer)。前者定义了了输入和输出是如何连结的,后者则用来控制通道数,使之不过大。
稠密块由多个 conv_block 组成,每块使用相同的输出通道数(每个block的输出通道数相同,输入通道数逐层递增)。但在前向计算时,我们将每个块的输入和之前层的输出在通道维上连结。
前面的层的输出会依次在当前层的输入的channel上叠加。dense block 将每次的卷积的输出(即output_channel)称为 growth_rate , 因为如果输入是 in_channel , 有 n 层, 那么输出就是 in_channel + n * growh_rate
import numpy as np import torch from torch import nn from torch.autograd import Variable from torchvision.datasets import CIFAR10 #bn层是放在conv层的前面和后面都可以,一般是放在后面,这里放在了前面 def conv_block(in_channel, out_channel): layer = nn.Sequential( nn.BatchNorm2d(in_channel), nn.ReLU(True), nn.Conv2d(in_channel, out_channel, kernel_size=3, padding=1, bias=False) ) return layer #稠密块由多个conv_block 组成,每块使⽤用相同的输出通道数。但在前向计算时,我们将每块的输入和输出在通道维上连结。 class dense_block(nn.Module): # growth_rate即output_channel def __init__(self, in_channel, growth_rate, num_layers): super(dense_block, self).__init__() block = [] channel = in_channel for i in range(num_layers): block.append( conv_block(in_channel=channel, out_channel=growth_rate) ) channel += growth_rate self.net = nn.Sequential(*block) def forward(self, x): for layer in self.net: out = layer(x) x = torch.cat((out, x), dim=1) return x blk = dense_block(in_channel=3, growth_rate=10, num_layers=4) X = torch.rand(4, 3, 8, 8) Y = blk(X) print(Y.shape) # torch.Size([4, 43, 8, 8])
在本例中,我们定义了有4个block,输出通道数为10。使⽤通道数为3的输入时,我们会得到通道数为3+4*10的输出。卷积块的通道数控制了了输出通道数相对于输入通道数的增长,因此也被称为增⻓长率(growth rate)
过渡层( transition block), 由于每个dense block都会带来通道数的增加,使⽤用过多则会带来过于复杂的模型。过渡层用来控制模型复杂度。它通过1*1的卷积层来减小通道数,并使用步幅为2的平均池化层减半高和宽,从而进一步降低模型复杂度。
import numpy as np import torch from torch import nn from torch.autograd import Variable from torchvision.datasets import CIFAR10 def conv_block(in_channel, out_channel): layer = nn.Sequential( nn.BatchNorm2d(in_channel), nn.ReLU(True), nn.Conv2d(in_channel, out_channel, kernel_size=3, padding=1, bias=False) ) return layer #稠密块由多个conv_block 组成,每块使⽤用相同的输出通道数。但在前向计算时,我们将每块的输入和输出在通道维上连结。 class dense_block(nn.Module): # growth_rate即output_channel def __init__(self, in_channel, growth_rate, num_layers): super(dense_block, self).__init__() block = [] channel = in_channel for i in range(num_layers): block.append( conv_block(in_channel=channel, out_channel=growth_rate) ) channel += growth_rate self.net = nn.Sequential(*block) def forward(self, x): for layer in self.net: out = layer(x) x = torch.cat((out, x), dim=1) return x blk = dense_block(in_channel=3, growth_rate=10, num_layers=4) X = torch.rand(4, 3, 8, 8) Y = blk(X) print(Y.shape) # torch.Size([4, 43, 8, 8]) def transition_block(in_channel, out_channel): trans_layer = nn.Sequential( nn.BatchNorm2d(in_channel), nn.ReLU(True), nn.Conv2d(in_channel, out_channel, 1), nn.AvgPool2d(2, 2) ) return trans_layer blk = transition_block(in_channel=43, out_channel=10) print(blk(Y).shape) # torch.Size([4, 10, 4, 4])
对上一个例子中稠密块的输出使⽤用通道数为10的过渡层。此时输出的通道数减为10,高和宽均减半。
构造DenseNet模型
DenseNet首先使用同ResNet一样的单卷积层和最大池化层。
输入的size是96*96*3,将cifar10的图片resize到96
import numpy as np import torch from torch import nn,optim from torch.autograd import Variable from torchvision.datasets import CIFAR10 import torchvision from datetime import datetime def conv_block(in_channel, out_channel): layer = nn.Sequential( nn.BatchNorm2d(in_channel), nn.ReLU(True), nn.Conv2d(in_channel, out_channel, kernel_size=3, padding=1, bias=False) ) return layer #稠密块由多个conv_block 组成,每块使⽤用相同的输出通道数。但在前向计算时,我们将每块的输入和输出在通道维上连结。 class dense_block(nn.Module): # growth_rate即output_channel def __init__(self, in_channel, growth_rate, num_layers): super(dense_block, self).__init__() block = [] channel = in_channel for i in range(num_layers): block.append( conv_block(in_channel=channel, out_channel=growth_rate) ) channel += growth_rate self.net = nn.Sequential(*block) def forward(self, x): for layer in self.net: out = layer(x) x = torch.cat((out, x), dim=1) return x blk = dense_block(in_channel=3, growth_rate=10, num_layers=4) X = torch.rand(4, 3, 8, 8) Y = blk(X) print(Y.shape) # torch.Size([4, 43, 8, 8]) def transition_block(in_channel, out_channel): trans_layer = nn.Sequential( nn.BatchNorm2d(in_channel), nn.ReLU(True), nn.Conv2d(in_channel, out_channel, 1), nn.AvgPool2d(2, 2) ) return trans_layer blk = transition_block(in_channel=43, out_channel=10) print(blk(Y).shape) # torch.Size([4, 10, 4, 4]) class DenseNet(nn.Module): def __init__(self, in_channel, num_classes=10, growth_rate=32, block_layers=[6, 12, 24, 16]): super(DenseNet, self).__init__() self.block1 = nn.Sequential( nn.Conv2d(in_channels=in_channel, out_channels=64, kernel_size=7, stride=2, padding=3), nn.BatchNorm2d(64), nn.ReLU(True), nn.MaxPool2d(kernel_size=3, stride=2, padding=1) ) channels = 64 block = [] for i, layers in enumerate(block_layers): block.append(dense_block(channels, growth_rate, layers)) channels += layers * growth_rate if i != len(block_layers) - 1: block.append(transition_block(channels, channels // 2)) # 通过 transition 层将大小减半, 通道数减半 channels = channels // 2 self.block2 = nn.Sequential(*block) self.block2.add_module('bn', nn.BatchNorm2d(channels)) self.block2.add_module('relu', nn.ReLU(True)) self.block2.add_module('avg_pool', nn.AvgPool2d(3)) self.classifier = nn.Linear(channels, num_classes) def forward(self, x): x = self.block1(x) x = self.block2(x) x = x.view(x.shape[0], -1) x = self.classifier(x) return x def get_acc(output, label): total = output.shape[0] # output是概率,每行概率最高的就是预测值 _, pred_label = output.max(1) num_correct = (pred_label == label).sum().item() return num_correct / total batch_size = 32 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') transform = torchvision.transforms.Compose([ torchvision.transforms.Resize(size=96), torchvision.transforms.ToTensor() ]) train_set = torchvision.datasets.CIFAR10( root='dataset/', train=True, download=True, transform=transform ) # hand-out留出法划分 train_set, val_set = torch.utils.data.random_split(train_set, [40000, 10000]) test_set = torchvision.datasets.CIFAR10( root='dataset/', train=False, download=True, transform=transform ) train_loader = torch.utils.data.DataLoader( dataset=train_set, batch_size=batch_size, shuffle=True ) val_loader = torch.utils.data.DataLoader( dataset=val_set, batch_size=batch_size, shuffle=True ) test_loader = torch.utils.data.DataLoader( dataset=test_set, batch_size=batch_size, shuffle=False ) net = DenseNet(in_channel=3, num_classes=10) lr = 1e-2 optimizer = optim.SGD(net.parameters(), lr=lr) critetion = nn.CrossEntropyLoss() net = net.to(device) prev_time = datetime.now() valid_data = val_loader for epoch in range(3): train_loss = 0 train_acc = 0 net.train() for inputs, labels in train_loader: inputs = inputs.to(device) labels = labels.to(device) # forward outputs = net(inputs) loss = critetion(outputs, labels) # backward optimizer.zero_grad() loss.backward() optimizer.step() train_loss += loss.item() train_acc += get_acc(outputs, labels) # 最后还要求平均的 # 显示时间 cur_time = datetime.now() h, remainder = divmod((cur_time - prev_time).seconds, 3600) m, s = divmod(remainder, 60) # time_str = 'Time %02d:%02d:%02d'%(h,m,s) time_str = 'Time %02d:%02d:%02d(from %02d/%02d/%02d %02d:%02d:%02d to %02d/%02d/%02d %02d:%02d:%02d)' % ( h, m, s, prev_time.year, prev_time.month, prev_time.day, prev_time.hour, prev_time.minute, prev_time.second, cur_time.year, cur_time.month, cur_time.day, cur_time.hour, cur_time.minute, cur_time.second) prev_time = cur_time # validation with torch.no_grad(): net.eval() valid_loss = 0 valid_acc = 0 for inputs, labels in valid_data: inputs = inputs.to(device) labels = labels.to(device) outputs = net(inputs) loss = critetion(outputs, labels) valid_loss += loss.item() valid_acc += get_acc(outputs, labels) print("Epoch %d. Train Loss: %f, Train Acc: %f, Valid Loss: %f, Valid Acc: %f," % (epoch, train_loss / len(train_loader), train_acc / len(train_loader), valid_loss / len(valid_data), valid_acc / len(valid_data)) + time_str) torch.save(net.state_dict(), 'checkpoints/params.pkl') # 测试 with torch.no_grad(): net.eval() correct = 0 total = 0 for (images, labels) in test_loader: images, labels = images.to(device), labels.to(device) output = net(images) _, predicted = torch.max(output.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print("The accuracy of total {} val images: {}%".format(total, 100 * correct / total))
DenseNet121,169,201,161注意我这里实现的其实就是densenet121,我的densenet的参数有一点不同,in_channel不是这里的num_init_features,num_init_features的这个值64已经直接写在网络里了
densenet是第一个block是feature block,是将输入的图片(一般是3 channel)转成 64 channel
论文同时提出了DenseNet,DenseNet-B,DenseNet-BC三种结构,具体区别如下:
DenseNet:
Dense Block模块:BN+Relu+Conv(3*3)+dropout
transition layer模块:BN+Relu+Conv(1*1)(filternum:m)+dropout+Pooling(2*2)
DenseNet-B:
Dense Block模块:BN+Relu+Conv(1*1)(filternum:4K)+dropout+BN+Relu+Conv(3*3)+dropout
transition layer模块:BN+Relu+Conv(1*1)(filternum:m)+dropout+Pooling(2*2)
DenseNet-BC:
Dense Block模块:BN+Relu+Conv(1*1)(filternum:4K)+dropout+BN+Relu+Conv(3*3)+dropout
transition layer模块:BN+Relu+Conv(1*1)(filternum:θm,其中0<θ<1,文章取θ=0.5) +dropout +Pooling(2*2)
其中,DenseNet-B在原始DenseNet的基础上,加入Bottleneck layers, 主要是在Dense Block模块中加入了1*1卷积,使得将每一个layer输入的feature map都降为到4k的维度,大大的减少了计算量。
解决在Dense Block模块中每一层layer输入的feature maps随着层数增加而增大,则需要加入Bottleneck 模块,降维feature maps到4k维
DenseNet-BC在DenseNet-B的基础上,在transitionlayer模块中加入了压缩率θ参数,论文中将θ设置为0.5,这样通过1*1卷积,将上一个Dense Block模块的输出feature map维度减少一半。