FCN和U-Net在2015年先后发表,主要思路都是先编码后解码(encoder-decoder),最后得到和原图大小相同的特征图,然后对特征图每个点与图像的标注mask上的每个像素点求损失。它们的区别主要在于特征融合的方式,FCN特征融合采用特征直接相加,而U-Net特征融合采用的是两个特征在通道维度的堆叠。本文分别采用tensorflow和pytorch复现了FCN和U-Net。
github上发现了一个语义分割代码合集:https://github.com/mrgloom/awesome-semantic-segmentation
一、FCN
FCN主要介绍FCN-8S,FCN论文中共构建了FCN32s、FCN-16s、FCN-8s三种网络结构,其中,FCN32s没有融合浅层特征,直接对深层特征进行上采样;FCN-16s融合了一层的浅层特征;而FCN-8S融合了两层的浅层特征,分割效果最好。
FCN-8s编码网络采用全卷积版的VGG-16,将最后的VGG原来的全连接层改为了卷积层。这里采用tf1.x框架编写代码,使用tf高阶API tf.layers搭建网络,这儿只贴出网络结构部分。
FCN原caffe项目地址:https://github.com/shelhamer/fcn.berkeleyvision.org
def encode(input):
# 全卷积版的VGG
conv1_1 = tf.layers.conv2d(input, 64, 3, padding= 'SAME', activation=tf.nn.relu)
conv1_2 = tf.layers.conv2d(conv1_1, 64, 3, padding= 'SAME', activation=tf.nn.relu)
pool1 = tf.layers.max_pooling2d(conv1_2, pool_size=[2, 2], strides=2)
conv2_1 = tf.layers.conv2d(pool1, 128, 3, padding= 'SAME', activation=tf.nn.relu)
conv2_2 = tf.layers.conv2d(conv2_1, 128, 3, padding= 'SAME', activation=tf.nn.relu)
pool2 = tf.layers.max_pooling2d(conv2_2, pool_size=[2, 2], strides=2)
conv3_1 = tf.layers.conv2d(pool2, 256, 3, padding= 'SAME', activation=tf.nn.relu)
conv3_2 = tf.layers.conv2d(conv3_1, 256, 3, padding= 'SAME', activation=tf.nn.relu)
pool3 = tf.layers.max_pooling2d(conv3_2, pool_size=[2, 2], strides=2)
conv4_1 = tf.layers.conv2d(pool3, 512, 3, padding= 'SAME', activation=tf.nn.relu)
conv4_2 = tf.layers.conv2d(conv4_1, 512, 3, padding= 'SAME', activation=tf.nn.relu)
pool4 = tf.layers.max_pooling2d(conv4_2, pool_size=[2, 2], strides=2)
conv5_1 = tf.layers.conv2d(pool4, 512, 3, padding= 'SAME', activation=tf.nn.relu)
conv5_2 = tf.layers.conv2d(conv5_1, 512, 3, padding= 'SAME', activation=tf.nn.relu)
pool5 = tf.layers.max_pooling2d(conv5_2, pool_size=[2, 2], strides=2)
fc6 = tf.layers.conv2d(pool5, 4096, 1, padding= 'valid', activation=tf.nn.relu) #卷积核为4和得到的特征图大小相同(图像128*128输入时)
tf.layers.dropout(fc6, 0.5)
fc7 = tf.layers.conv2d(fc6, 4096, 1, padding= 'valid', activation=tf.nn.relu)
tf.layers.dropout(fc7, 0.5)
return pool3, pool4, fc7
def decode(vgg_layer3_out, vgg_layer4_out, vgg_layer7_out, num_classes):
"""
Create the layers for a fully convolutional network. Build skip-layers using the vgg layers.
:param vgg_layer7_out: TF Tensor for VGG Layer 3 output
:param vgg_layer4_out: TF Tensor for VGG Layer 4 output
:param vgg_layer3_out: TF Tensor for VGG Layer 7 output
:param num_classes: Number of classes to classify
:return: The Tensor for the last layer of output
"""
# 感觉这个卷积也可以不用
layer7_conv = tf.layers.conv2d(vgg_layer7_out, num_classes, 1,
padding= 'SAME',
kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
layer7_conv = vgg_layer7_out
layer7_trans = tf.layers.conv2d_transpose(layer7_conv, num_classes, 4, 2,
padding= 'SAME',
kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
# 这儿的卷积是为了与后面的反卷集特征融合,特征要保持相同的维度,要把VGG得到的特征维度变为num_classes
layer4_conv = tf.layers.conv2d(vgg_layer4_out, num_classes, 1,
padding= 'SAME',
kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
layer4_out = tf.add(layer7_trans, layer4_conv)
layer4_trans = tf.layers.conv2d_transpose(layer4_out, num_classes, 4, 2,
padding= 'SAME',
kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
layer3_conv = tf.layers.conv2d(vgg_layer3_out, num_classes, 1,
padding= 'SAME',
kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
layer3_out = tf.add(layer3_conv, layer4_trans)
last_layer = tf.layers.conv2d_transpose(layer3_out, num_classes, 16, 8,
padding= 'SAME',
kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3), name = "last_layer")
return last_layer
FCN参考:
https://blog.csdn.net/m0_37862527/article/details/79843963
https://blog.csdn.net/weixin_40519315/article/details/104412740
https://zhuanlan.zhihu.com/p/62995971?utm_source=wechat_session
https://blog.csdn.net/qq_36269513/article/details/80420363
https://blog.csdn.net/u013303599/article/details/79231503
github:
https://github.com/shelhamer/fcn.berkeleyvision.org
https://github.com/pierluigiferrari/fcn8s_tensorflow
https://github.com/MarvinTeichmann/tensorflow-fcn/blob/master/fcn8_vgg.py
https://github.com/zijundeng/pytorch-semantic-segmentation/blob/master/models/fcn8s.py
二、U-Net
U-Net特征融合采用拼接的形式,网络结构看起来是一个对称的U型结构,但是注意这个网络的输入输出大小是不同的,输入是572×572,而输出是388×388,原因是在卷积时没有用padding,每次卷积后特征图尺寸都会变小。一般的分割网络需要输入的大小和输出大小相同,这样才能判断输入图像上每个像素点的类别。对于输入和输出大小不同这个问题,U-Nnet办法是,在输入的时候对待分割的区域进行扩大,用一张更大的图片来预测中央小区域的分割结果,相当于在输入的时候结合了目标区域的上下文信息,缺失区域采用重叠-切片(overlap-tile)方法补充。
下图为重叠-切片策略示意图。预测黄色方框中的分割需要蓝色方框中的图像数据作为输入,缺失的数据由镜像推断。下图白色方框内是原始输入图像,边缘区域是由原图镜像产生的,然后根据蓝色方框的图像作为输入,预测出来黄色方框的分割图(一个区域一个区域地分割是因为一张图片过大显存装不下)。
U-Net结构比较对称,左边网络卷积加池化进行下采样,右边网络卷积加反卷积进行上采样,最终将特征恢复到原图大小,U-Net这张网络结构图表达地十分清晰,可以对着这张网络结构图写代码构建网络。采用的pytorch复现的网络,复现的代码有两个地方和原论文不同:
(1)这儿代码中,卷积使用了padding,而U-Net原论文卷积没有使用padding,因此每次卷积后长宽像素点都减少2。若要完全按照U-Net的网络来构建,在做特征堆叠的时候,要先对左边卷积后的特征尺寸进行缩放,到尺寸和右边相同,然后再做拼接,pytorch中特征缩放可以用torch.nn.functional.upsample()进行插值。有人说连续对feature map加padding卷积,会使得padding进来的feature误差越来越大,因为越卷积,feature的抽象程度越高,就更容易受到padding的影响。
(2)原文中没有用batch normalization,这儿的代码加上了。
class convBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super(convBlock, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels), # 原论文没有说用batch normalization
nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(),
# nn.MaxPool2d(kernel_size=2, stride=2)
)
def forward(self, x):
x = self.cnn(x)
return x
class upSampling(nn.Module):
def __init__(self, in_channels, middle_channels, out_channels):
super(upSampling, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(in_channels, middle_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(middle_channels),
nn.ReLU(),
nn.Conv2d(middle_channels, middle_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(middle_channels),
nn.ReLU(),
# stride为步长,即为扩大倍数
nn.ConvTranspose2d(middle_channels, out_channels, kernel_size=2, stride=2)
)
def forward(self, x):
x = self.cnn(x)
return x
class uNet(nn.Module):
def __init__(self, num_classes):
super(uNet, self).__init__()
self.enCode1 = convBlock(in_channels=3, out_channels=64)
self.enCode2 = convBlock(in_channels=64, out_channels=128)
self.enCode3 = convBlock(in_channels=128, out_channels=256)
self.enCode4 = convBlock(in_channels=256, out_channels=512)
self.Maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
self.deCode1 = upSampling(in_channels=512, middle_channels=1024, out_channels=512)
self.deCode2 = upSampling(in_channels=1024, middle_channels=512, out_channels=256)
self.deCode3 = upSampling(in_channels=512, middle_channels=256, out_channels=128)
self.deCode4 = upSampling(in_channels=256, middle_channels=128, out_channels=64)
self.lastLayer = nn.Sequential(
nn.Conv2d(128, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, num_classes, kernel_size=1) # 输出维度为标注的种类
)
def forward(self, x):
enc1 = self.enCode1(x)
enc1_pool = self.Maxpool(enc1)
enc2 = self.enCode2(enc1_pool)
enc2_pool = self.Maxpool(enc2)
enc3 = self.enCode3(enc2_pool)
enc3_pool = self.Maxpool(enc3)
enc4 = self.enCode4(enc3_pool)
enc4_pool = self.Maxpool(enc4)
dec1 = self.deCode1(enc4_pool)
dec2 = self.deCode2(torch.cat((dec1, enc4), dim=1))
dec3 = self.deCode3(torch.cat((dec2, enc3), dim=1))
dec4 = self.deCode4(torch.cat((dec3, enc2), dim=1))
out = self.lastLayer(torch.cat((dec4, enc1), dim=1))
return out
FCN参考:
https://zhuanlan.zhihu.com/p/31428783
https://zhuanlan.zhihu.com/p/118540575
https://zhuanlan.zhihu.com/p/87593567
https://blog.csdn.net/l2181265/article/details/87735610
https://www.yuque.com/yahei/hey-yahei/segmentation
github:
https://github.com/zijundeng/pytorch-semantic-segmentation/blob/master/models/u_net.py
https://github.com/LeeJunHyun/Image_Segmentation