介绍
为了充分利用到像素级别的信息,语义分割的概念在很多年前被提出…FCN应该可以说是语义分割的里程碑式作品。下面我们一起来学习一下这篇经典的论文并用Pytorch实现一下吧!
论文精华提炼
文章主要贡献
1、本文提出了全卷积网络(FCN)的概念,针对语义分割训练一个端到端,点对点的网络,达到了当时的state-of-the-art。
2、这是第一次训练端到端的FCN,用于像素级的预测;
3、也是第一次用监督预训练的方法训练FCN。
网络结构设计
1、卷积化
语义分割其实也是图像分类,只不过是将分类细化到了每个像素点。基于这个思想,本文将一般卷机神经网络后面接的全连接层替换为卷积层,保持特征的多维特性,最后再对每一个像素点分类。
2、反卷积
在一般的CNN结构中,如AlexNet,VGGNet均是使用池化层来提取更为抽象的特征,但同时缩小了输出图片的size,例如VGG16,五次池化后图片被缩小了32倍;而在ResNet中,某些卷积层也参与到缩小图片size的过程。我们需要得到的是一个与原图像size相同的分割图,因此我们需要对最后一层进行上采样得到与原图大小相同的分割结果。
本文中的上采样方法:反卷积(Transposed Convolution)
那么什么是反卷积?
首先理解上采样的概念:实现图像由小分辨率到大分辨率的映射的操作,叫做上采样(Upsample)。上采样有3种常见的方法:双线性插值(bilinear),反卷积(Transposed Convolution),反池化(Unpooling)。这里我们介绍FCN中用的反卷积。
其实直观来看,反卷积就是将当前的特征图(那些蓝色块)间补空再进行卷积。当然,从公式推导角度来讲(参考阿里大佬通俗理解反卷积),
3、跳跃连接(Skip-Connection)
由于通过反卷积后得到的结果是较为粗糙的,这是由于反卷积过程中的矩阵C^T是稀疏的,得到的结果自然不够稠密。为了得到精细的结果,本篇文章中采用了Skip-Connection的结构设计。
即将不同池化层输出结合起来,优化输出。
基于pytorch的代码实现
基于文章提出的思路,我们可以很清楚的设计出FCN的整体网络结构。即采用一个backbone网络,将最后的全连接层改为卷积层,并在最后添加反卷积层。在传播方向上加上跳跃连接。下面我们以VGG16为例。
class FCN8s(nn.Module):
def __init__(self, n_class=21): #VOC2012数据集分为20类,包括背景为21类,分别如下:
# - Person: person
# - Animal: bird, cat, cow, dog, horse, sheep
# - Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
# - Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
super(FCN8s, self).__init__()
# conv1
self.conv1_1 = nn.Conv2d(3, 64, 3, padding=100) #3通道输出,64通道输出 3*3卷积,填充100
self.relu1_1 = nn.ReLU(inplace=True)
self.conv1_2 = nn.Conv2d(64, 64, 3, padding=1) # 3*3 如果我不想最后减少维度,我只希望卷积,stride=1,padding=1,padding的数比核大小小2.
self.relu1_2 = nn.ReLU(inplace=True)
self.pool1 = nn.MaxPool2d(2, stride=2, ceil_mode=True) # 1/2 ;VGG中的池化是2*2。
# conv2
self.conv2_1 = nn.Conv2d(64, 128, 3, padding=1)
self.relu2_1 = nn.ReLU(inplace=True)
self.conv2_2 = nn.Conv2d(128, 128, 3, padding=1)
self.relu2_2 = nn.ReLU(inplace=True)
self.pool2 = nn.MaxPool2d(2, stride=2, ceil_mode=True) # 1/4
# conv3
self.conv3_1 = nn.Conv2d(128, 256, 3, padding=1)
self.relu3_1 = nn.ReLU(inplace=True)
self.conv3_2 = nn.Conv2d(256, 256, 3, padding=1)
self.relu3_2 = nn.ReLU(inplace=True)
self.conv3_3 = nn.Conv2d(256, 256, 3, padding=1)
self.relu3_3 = nn.ReLU(inplace=True)
self.pool3 = nn.MaxPool2d(2, stride=2, ceil_mode=True) # 1/8
# conv4
self.conv4_1 = nn.Conv2d(256, 512, 3, padding=1)
self.relu4_1 = nn.ReLU(inplace=True)
self.conv4_2 = nn.Conv2d(512, 512, 3, padding=1)
self.relu4_2 = nn.ReLU(inplace=True)
self.conv4_3 = nn.Conv2d(512, 512, 3, padding=1)#1*1卷积
self.relu4_3 = nn.ReLU(inplace=True)
self.pool4 = nn.MaxPool2d(2, stride=2, ceil_mode=True) # 1/16
# conv5
self.conv5_1 = nn.Conv2d(512, 512, 3, padding=1)
self.relu5_1 = nn.ReLU(inplace=True)
self.conv5_2 = nn.Conv2d(512, 512, 3, padding=1)
self.relu5_2 = nn.ReLU(inplace=True)
self.conv5_3 = nn.Conv2d(512, 512, 3, padding=1)#1*1卷积
self.relu5_3 = nn.ReLU(inplace=True)
self.pool5 = nn.MaxPool2d(2, stride=2, ceil_mode=True) # 1/32
# fc6
self.fc6 = nn.Conv2d(512, 4096, 7)
self.relu6 = nn.ReLU(inplace=True)
self.drop6 = nn.Dropout2d()
# fc7
self.fc7 = nn.Conv2d(4096, 4096, 1)
self.relu7 = nn.ReLU(inplace=True)
self.drop7 = nn.Dropout2d()
####从conv1 到fc7都是vgg16的结构
#去掉最后的FC1000和softmax层,改为卷积层
self.score_fr = nn.Conv2d(4096, n_class, 1)
self.score_pool3 = nn.Conv2d(256, n_class, 1)
self.score_pool4 = nn.Conv2d(512, n_class, 1)
self.upscore2 = nn.ConvTranspose2d( #反卷积
n_class, n_class, 4, stride=2, bias=False)
self.upscore8 = nn.ConvTranspose2d(
n_class, n_class, 16, stride=8, bias=False)
self.upscore_pool4 = nn.ConvTranspose2d(
n_class, n_class, 4, stride=2, bias=False)
self._initialize_weights()
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
m.weight.data.zero_()
if m.bias is not None:
m.bias.data.zero_()
if isinstance(m, nn.ConvTranspose2d):
assert m.kernel_size[0] == m.kernel_size[1]
initial_weight = get_upsampling_weight(
m.in_channels, m.out_channels, m.kernel_size[0])
m.weight.data.copy_(initial_weight)
def forward(self, x):
h = x
h = self.relu1_1(self.conv1_1(h)) #第一层
h = self.relu1_2(self.conv1_2(h))
h = self.pool1(h)
h = self.relu2_1(self.conv2_1(h))
h = self.relu2_2(self.conv2_2(h))
h = self.pool2(h)
h = self.relu3_1(self.conv3_1(h))
h = self.relu3_2(self.conv3_2(h))
h = self.relu3_3(self.conv3_3(h))
h = self.pool3(h)
pool3 = h # 1/8
h = self.relu4_1(self.conv4_1(h))
h = self.relu4_2(self.conv4_2(h))
h = self.relu4_3(self.conv4_3(h))
h = self.pool4(h)
pool4 = h # 1/16
h = self.relu5_1(self.conv5_1(h))
h = self.relu5_2(self.conv5_2(h))
h = self.relu5_3(self.conv5_3(h))
h = self.pool5(h)
h = self.relu6(self.fc6(h))
h = self.drop6(h)
h = self.relu7(self.fc7(h))
h = self.drop7(h)
h = self.score_fr(h) # 替换的卷积
h = self.upscore2(h) #上采样
upscore2 = h # 1/16
h = self.score_pool4(pool4) #
h = h[:, :, 5:5 + upscore2.size()[2], 5:5 + upscore2.size()[3]]
score_pool4c = h # 1/16
h = upscore2 + score_pool4c # 1/16
h = self.upscore_pool4(h)
upscore_pool4 = h # 1/8
h = self.score_pool3(pool3)
h = h[:, :,
9:9 + upscore_pool4.size()[2],
9:9 + upscore_pool4.size()[3]]
score_pool3c = h # 1/8
h = upscore_pool4 + score_pool3c # 1/8
h = self.upscore8(h)
h = h[:, :, 31:31 + x.size()[2], 31:31 + x.size()[3]].contiguous()
return h ##返回结果
def get_upsampling_weight(in_channels, out_channels, kernel_size):
"""Make a 2D bilinear kernel suitable for upsampling"""
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = np.ogrid[:kernel_size, :kernel_size]
filt = (1 - abs(og[0] - center) / factor) * \
(1 - abs(og[1] - center) / factor)
weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size),
dtype=np.float64)
weight[range(in_channels), range(out_channels), :, :] = filt
return torch.from_numpy(weight).float()
以上的实现还是有些繁琐,不够简洁,在这篇文章中看到了比较简洁的实现,可以参考。
数据集:VOC2012
VOC2012数据集分为20类,包括背景为21类,分别如下:
- Person: person
- Animal: bird, cat, cow, dog, horse, sheep
- Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
- Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
总结
文章中用到的反卷积,因为是通过补0操作,会产生网格效应。可以采用一般的插值上采样(如nn.upsample)+卷积的方式实现相同的效果,是结果更平滑。