【论文阅读】Fully Convolutional Networks for Semantic Segmentation FCN阅读、理解和复现

C哈

已于 2023-07-07 14:50:03 修改

阅读量2.1k

点赞数 5

分类专栏：图像分割文章标签：论文阅读 python

于 2023-06-29 22:44:24 首次发布

本文链接：https://blog.csdn.net/pl23k/article/details/131379277

版权

图像分割专栏收录该内容

2 篇文章 1 订阅

订阅专栏

Fully Convolutional Networks for Semantic Segmentation（FCN）是全卷积神经网络，是全卷积网络在语义分割领域的革命性之作。

图 1：全卷积网络可以有效地学习对语义分割等每像素任务进行密集预测。

一、论文理解

1.1 概述

作者提出了一种全卷积网络（FCN），并将其应用于语义分割任务中，取得了当前最先进的效果，也在后续的发展中，给研究人员们提供了新的思路。

【背景小知识】为何叫全卷积？在此之前，卷积网络是总所周知的，在卷积网络之前是全连接网络。通常的“100个输入值，1个输出值”这是简单的单层网络或说是感知机，每个输入值对应有一个w和b权值，汇总相加“每个输入值经过权值的线性运算再加一个激活函数（可选）”得到1个输出，可知计算量还是蛮大的。如果“100个输入值，100个输出值”的情况，那么就要把单个输出值的计算重复个100次，得到100个输出，w和b的个数也翻100倍了，这种计算网络叫做全连接网络，它可以多层拼接，如“100个输入值->（得到）100个输出值->（将输出值作为输入）10个输出值”，将上一层的输出作为输入，可以一直拼接，网上有许多网络结构图，可以搜到。全连接网络的表达能力十分强，在神经网络发明的早起就应用到了图像分类。以手写数字识别为例，他将28*28的灰度图（单通道）展开为784个数值，以784为输入个数，最后输出为10，分别表示0~9的概率，得到识别结果。这是神经网络在图像识别（分类）的应用，以整张图片为输入，得到一个分类结果。问题来了，1、随着现在单张图像的分辨率越来越大，展开成全连接网络的输入，计算量和参数量超级大。2、很明显它的输出只能告诉我们整图是哪个类别，不能做分割任务。解决问题1，卷积网络被提出，最早当然是信号处理里的卷积，应用灵感也是来源于人为选择图像特征，如早起人脸监测算法中，就是人为设定特征核然后使用滑动窗口判断特征相似度。卷积网络通过共用卷积核将图像转换或缩小得到一个新的输出，如“100*100的输出，变成了50*50的输出”，这种做法大大减少了原来全卷积网络的参数量和计算量同时又提取到了图像的特征，因为要训练的只是卷积核，当最终特征被提取到了之后，再通过一个全连接层，输出分类结果，这个是卷积网络（80年代LeNet就十分成功地应用了卷积网络制霸了邮编手写码的识别市场）。可知，卷积一般都是用于特征提取的，在图像的前期处理中使用，最后会用全连接层得到想要的分别输出结果。然后问题2没有被解决，如何让卷积网络能做其他图像识别的任务，如目标检测，特别是语义分割？目标检测相对简单，通过混合模型，分两个阶段进行，阶段一在全图中推荐出几个小框框，分别对小框框进行分类，找到目标对象，然后不断改进小框框的锚点（边界框的左上右下位置）。那么语义分割呢，也可以使用混合模型，通过推荐（提议）框（包含已经被前景分割的内容），不断找到目标对象，再优化合并结果。论文中也提到了当时最牛的SDS。这些方法都是多阶段，且混合不同的算法，不能在一个网络中直接训练和得到结果。因此，作者在卷积网络的基础上，直接去掉了最后的分类层并将全连接层替换为卷积层，加入一个1*1的卷积预测层，再加入反卷积层，加入一个反卷积层，将特征层进行上采样放大到原图大小的输出，得到的是原图每个像素的分类。如“输入是3*100*100（通道*H*W），输出21*100*100（类别*H*W）”，由于网络是先卷积，然后反卷积，没有全连接层，因此作者称其为全卷积网络。

传统的卷积网络在语义分割任务中的使用，可以通过混合模型，通过推荐（提议）框（包含已经被前景分割的内容），不断找到目标对象，再优化合并结果。论文中也提到了当时最牛的SDS（先使用具备传统分割能力的MCG获得前景候选框，然后进入两路网络中提取特征，1是框框特征，2是前景特征。再通过SVM对特征进行分类，最后NMS优化重叠框，最后再用CNN得到的粗特征修正推荐框中的目标边缘）。这些方法都是多阶段，且混合不同的算法，不能在一个网络中直接训练和得到结果。因此，作者在卷积网络的基础上，直接去掉了最后的分类层并将全连接层替换为卷积层，加入一个1*1的卷积预测层，再加入反卷积层，将预测层进行上采样放大到原图大小的输出，得到的是原图每个像素的分类。如“输入是3*100*100（通道*H*W），输出21*100*100（类别*H*W）”，由于网络是先卷积，然后反卷积，没有全连接层，因此作者称其为全卷积网络。作者强调，该网络可以接收任意大小的图像输入和原图大小输出，该网络是端到端（end-to-end）的学习，一个网络一次训练，得到结果，没有分阶段，没有使用复杂的算法进行混合。作者说那样的算法是“合唱团”。

1.2 论文章节介绍

引言不用多说
相关工作 不用多说
全卷积网络
简单介绍了层大小转变规则，引出它可以接收任意尺寸的输入都能输出原图大小。（傻傻看不懂，公式应该是卷积网络层间转换的标准公式，希望有人详细介绍一下）
简单介绍了FCN是真值损失函数，并可以使用随机梯度下降，并正向计算和反向传播都高效。
然后用以下小节介绍前人如何将分类网络转为语义分割网络的相关研究，如何将粗糙信息转为原图大小。
3.1 将分类器（应该是指分类网络中的分类器）转为密集预测
主要意思是修改分类网络的最后几层，输出原图尺寸分类热力图（空间上的，而不是传统分类网络输出非空间的分类值）。
3.2 “移位缝合”是过滤稀疏化
一种输出结果的融合方法，一顿介绍猛如虎，最后却说没使用（we do not use it in our model），目的是介绍一下别人的做法。文4.2中介绍：在有限的实验中，我们发现这种方法的成本改进比层融合更差。
3.3 “上采样”是向后跨步卷积
主要介绍怎么反卷积的。利用的是双线性插值。在你要使用某个卷积层作为预测层，如果这个层大小是输入大小1/f，那么就得上采样f。
3.4 “块训练”是损失采样
“快训练”解析：指对每一个感兴趣的像素，以它为中心取一个patch（小块），然后输入分类网络，输出该像素标签。全图训练与块训练（另一种先前的方法）有着同样的效率，因此不采用块训练。
分割框架
介绍说FCN是跳跃架构，端到端学习的，可以输出同原图尺寸像素分类结果。
4.1 从分类器到密码FCN
详细介绍了FCN的基本网络结构FCN-32s。
4.2 组合“内容”和“位置”
基于基本网络结构FCN-32s得到的结果不精细，提出融合浅层结果的方法。得到FCN-16s和FCN-8s。介绍了其他优化方法，但没什么效果。
4.3 实验框架
介绍了优化器，使用的是带动量的SGD，微调，组织更多的训练数据，块采样，类别平衡，密集预测，增强与实现。
结果
指标，多个数据集与当前最优算法比较，取得了20%的提升，证明自己的方法是当前最优秀的，算法是有效的。
结论
全卷积网络在语义分割中是有效的，表现很好。
致谢。

1.3 核心内容

引入一个传统CNN网络，如AlexCNN或者VGG16，砍去最后的分类层并替换尾部的全连接层为卷积层。
基础做法：在最后的卷积层（粗糙层，因为没有了原图的全局视觉信息）之后，加入一个1*1的预测卷积层，再加入32倍的上采样反卷积层（双线性插值，原文3.3节方法）。将预测层还原到原图图像大小。研究发现，得到的分割结果太粗糙了，如文中图4左一所示效果。该层结果称为FCN-32s
使用上一步的训练得到网络参数（预训练模型），进一步改造，最后那个反卷积层做一个分支，引出平行的一层反卷积层（原来的反卷积层没有了也可以直接砍掉），将最后的卷积层（conv7）经过1*1卷积预测的结果进行放大2倍（双线性差值，参数通过原文3.3节方法习得），对齐前面第4个池化层（pool4）（为什么是第4层，要放大2倍，原因是对照VGG16网络每层缩放倍数），pool4也要经过一个1*1的卷积预测层（参数初始化为0），两个预测融合起来（相加），然后进行16倍上采样反卷积。得到结果。处理方法如文中图3所示，效果如文中图4左二所示。这种做法是将第4层的池化层的局部信息和最后的特征信息融合，使结果更精确一些。该层结果称为FCN-16s
与第3步类似。作者考虑再往前借用全局信息，把上一步FCN-16s的预测结果进行2x放大，对齐pool3，pool3做一个1*1的卷积预测，两者相加融合后，再8倍上采样反卷积得到结果。再精确一些。如文中图4左三所示。该层结果称为FCN-8s
作者再经过一些实验，就不往前借用全部信息，因为他发现再往前借用pool信息，没法得到更好的结果了。放弃了。以FCN-8s作为最优结果，得到20%的提升，当时最牛。作业也在第3章中提到了一些小的改进和调优方法，在实验时做了尝试，没啥提升，最终没有使用。

1.4 举例理解

以VGG16为例（论文中竟然没有网络结构图，盗用一张图）：

如图所示，在conv7处，经过了5次pool下采样，每次采样都是减半的，因此conv7的大小是输入尺寸的1/32倍，所以对conv7上采样回输入尺寸要放大32倍。总之放大倍数根据前面的缩小倍数而定。

注意哈：conv7与output之间，还要有个1*1的卷积层，如果分类的数量是21类，那么卷积核是21*1*1，得到的尺寸与conv7相同，输出是预测结果，尺寸是21*7*7（类别*H*W）。

预测结果中，7*7的一个点代表一个分类，如果将21*7*7直接缩放变大回原图大小的热力图（21*224*224），看到的就是马赛克，因此想要结果平滑一点，就要用双线性插值，插值的参数可以事先设定好（在另一个教程中看到了别人是提前设置的），在网络中也可以学习。这样经过插值放大的输出热力图就看起来舒服多了，但是离真实值还是差很多。这个网络作为基础网络，可以直接用，叫做FCN-32s。

因此，作者提出了融合上层的信息，把pool4的结果经过一个21*1*1卷积核得到预测结果21*14*14，然后把上一步conv7的预测结果21*7*7进行2倍上采样（也用双线性插值）得到21*14*14，那么pool4得到的21*14*14和conv7得到的21*14*14进行相加，再双线性插值上采样16倍，效果就好多了，因为融合pool4的浅层细节。这样的网络叫做FCN-16s。注意哈，作者说是拿FCN-32s训练好的参数拿到本层使用，然后再训练，学习得调一下降个1/100（pool4层的参数要用0来初始化，但不知道是不是指pool4之后的21*1*1的参数，反正没介绍清楚，得看源码了）。没有说FCN-16s可不可以在没有预训练参数的情况下使用，应该可以，可能收敛慢一点吧，做实验才知道。

在FCN-16s的基础上，还可以用相同的思路将网络改为FCN-8s。就是拿pool3的21*28*28做一次卷积预测，然后pool4中做过的预测结果2倍上采样得到21*28*28，相加融合，再上采样8倍，得到结果，表现优异。

作者再想往前融合，发现，结果没啥提升了，就不再试了。

疑问：
1、从网络结构可以看出，输入是224*224的，或者其他尺寸，一旦训练时输入定下来，那么模型参数定下来，对于任意的原图输入，如何不伸缩原图大小就进行训练。因为卷积网络是共享权重的，每个卷积层要训练的也是卷积核参数，所以在FCN中，第一层卷积层的大小是和原图一样的，pool了之后，尺寸减小一半。所以他做到了接收任意的尺寸输入。话说回来了，既然如此，其他网络不也应该可以接收任意的尺寸？可能跟卷连接层的输入有关，因为不固定住输入，最后一卷积层大小可能会比较大，展平后长度很长，全连接算计就GG了。

二、复现

2.1 源码阅读

GitHub源码：点击查看

这边先阅读 voc-fcn32s。

def fcn(split): #整体网络
    # 创建数据集，train时使用SBDDSegDataLayer来获取，其他则使用VOCSegDataLayer来获取
    n = caffe.NetSpec()
    pydata_params = dict(split=split, mean=(104.00699, 116.66877, 122.67892),
            seed=1337)
    if split == 'train':
        pydata_params['sbdd_dir'] = '../data/sbdd/dataset'
        pylayer = 'SBDDSegDataLayer'
    else:
        pydata_params['voc_dir'] = '../data/pascal/VOC2011'
        pylayer = 'VOCSegDataLayer'
    n.data, n.label = L.Python(module='voc_layers', layer=pylayer,
            ntop=2, param_str=str(pydata_params))

    # the base net # conv_relu自定义函数，得到一个卷积层和激活层，
    n.conv1_1, n.relu1_1 = conv_relu(n.data, 64, pad=100) # 64的输出通道，100padding。把原始图像放大了
    n.conv1_2, n.relu1_2 = conv_relu(n.relu1_1, 64) # 64的输出通道
    n.pool1 = max_pool(n.relu1_2) #池化，使图像尺寸减半

    # 2个128的输出通道
    n.conv2_1, n.relu2_1 = conv_relu(n.pool1, 128)
    n.conv2_2, n.relu2_2 = conv_relu(n.relu2_1, 128)
    n.pool2 = max_pool(n.relu2_2)#池化，使图像尺寸减半

    # 3个256的输出通道
    n.conv3_1, n.relu3_1 = conv_relu(n.pool2, 256)
    n.conv3_2, n.relu3_2 = conv_relu(n.relu3_1, 256)
    n.conv3_3, n.relu3_3 = conv_relu(n.relu3_2, 256)
    n.pool3 = max_pool(n.relu3_3)#池化，使图像尺寸减半

    # 3个512的输出通道
    n.conv4_1, n.relu4_1 = conv_relu(n.pool3, 512)
    n.conv4_2, n.relu4_2 = conv_relu(n.relu4_1, 512)
    n.conv4_3, n.relu4_3 = conv_relu(n.relu4_2, 512)
    n.pool4 = max_pool(n.relu4_3)#池化，使图像尺寸减半

    # 3个512的输出通道
    n.conv5_1, n.relu5_1 = conv_relu(n.pool4, 512)
    n.conv5_2, n.relu5_2 = conv_relu(n.relu5_1, 512)
    n.conv5_3, n.relu5_3 = conv_relu(n.relu5_2, 512)
    n.pool5 = max_pool(n.relu5_3)#池化，使图像尺寸减半

    # fully conv
    n.fc6, n.relu6 = conv_relu(n.pool5, 4096, ks=7, pad=0) # 通道512->4096 卷积核变为7了，增加感受野的
    n.drop6 = L.Dropout(n.relu6, dropout_ratio=0.5, in_place=True)# 添加一个0.5的随机丢弃
    n.fc7, n.relu7 = conv_relu(n.drop6, 4096, ks=1, pad=0)# 再卷积一次
    n.drop7 = L.Dropout(n.relu7, dropout_ratio=0.5, in_place=True) #再丢弃一次 Dropout是防止过拟合的
    n.score_fr = L.Convolution(n.drop7, num_output=21, kernel_size=1, pad=0,
        param=[dict(lr_mult=1, decay_mult=1), dict(lr_mult=2, decay_mult=0)])# 通道4096->21，21为分类数（20个目标，1个背景），这里做预测了，得到了一个小尺寸的分类预测图
    n.upscore = L.Deconvolution(n.score_fr,
        convolution_param=dict(num_output=21, kernel_size=64, stride=32,
            bias_term=False),
        param=[dict(lr_mult=0)])# 一个反卷积层，卷积核大小64，步长32，不要偏置参数
    n.score = crop(n.upscore, n.data) # 因为第一层对原图像进行了padding100的放大，因此要裁剪一下
    n.loss = L.SoftmaxWithLoss(n.score, n.label,
            loss_param=dict(normalize=False, ignore_label=255)) # 损失函数

    return n.to_proto()

solve.py
训练代码
weights = '../ilsvrc-nets/vgg16-fcn.caffemodel' # 载入预训练的模型参数

# init
caffe.set_device(int(sys.argv[1]))
caffe.set_mode_gpu()

solver = caffe.SGDSolver('solver.prototxt') # 载入配置参数
solver.net.copy_from(weights) # 设置初始权值

# surgeries #手术  upscore 上采样的层取出来，写入双线性插值的参数
interp_layers = [k for k in solver.net.params.keys() if 'up' in k]
surgery.interp(solver.net, interp_layers)

# scoring
val = np.loadtxt('../data/segvalid11.txt', dtype=str)

for _ in range(25):
    solver.step(4000)
    score.seg_tests(solver, False, val, layer='score')

折腾了几天，caffe早就不维护了，包版本很多匹配不了，一会儿这问题一会儿那问题。最后弄好要训练了，结果又报错，网上说标签不对会引起这个错误：https://blog.csdn.net/calvinpaean/article/details/83654858，也有说得更新包：https://blog.csdn.net/xqqqiang/article/details/107416133 懒得试了。16s的代码也很容易懂，就是在末尾拼接pool4的结果。过程中就是注意要裁剪一下。

Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
Aborted

caffe被淘汰是对的，调试起来解决问题太难了。

2.2 代码复现

想用百度paddle来复现一下，时间很紧张，这里就只试了32s

1、数据准备

import os
import random
import paddle
import paddle.io as io
from paddle.vision import transforms as T
import numpy as np
from PIL import Image
import scipy.io


class FcnDataset(io.Dataset):
    """
    数据集定义
    """
    def __init__(self, mode='train'):
        """
        构造函数
        """
        self.image_size = (500,500) # 将读入的图片统一调整到500*500，要不然不能批载入，只能一张一张读
        self.mode = mode.lower()
        self.img_path = './dataset/sbdd/img' # 图片地址
        self.label_path = './dataset/sbdd/cls' # 标注地址
        
        self.img_names = os.listdir(self.img_path)
        self.label_names = os.listdir(self.label_path)
     
        assert(len(self.img_names) == len(self.label_names))
        # 图片集的均值与方差 因为pytorch会自动乘以255，这边我们要手动乘以255
        # R_mean is 0.457545, G_mean is 0.437410, B_mean is 0.403840
        # R_var is 0.274347, G_var is 0.271358, B_var is 0.284813
        
        if(mode == 'train'):
            val = np.loadtxt('./dataset/sbdd/train.txt', dtype=str)
            trainCount = round(len(val)*0.8)
            self.ids = val[0:trainCount] # 训练集id
        elif(mode == 'dev'):
            val = np.loadtxt('./dataset/sbdd/train.txt', dtype=str)
            devCount = len(val) - round(len(val)*0.8)
            self.ids = val[0:devCount] # 验证集id
        elif(mode == 'test'):
            self.ids = np.loadtxt('./dataset/sbdd/val.txt', dtype=str) # 测试集
        
        # 图像转换 
        self.transform = T.Compose([T.Resize(self.image_size),
                                    T.Transpose(),
                                    T.Normalize(mean=127.5, std=127.5) # 此处本来想用计算出来的，后来直接使用网上的
                                    ])
        
    def __getitem__(self, idx):
        img = np.array(Image.open('{}/{}.jpg'.format(self.img_path, self.ids[idx])).convert('RGB'), dtype='float32')
        img = self.transform(img)
        
       
        mat = scipy.io.loadmat('{}/{}.mat'.format(self.label_path, self.ids[idx]))
        label_arr = mat['GTcls'][0]['Segmentation'][0].astype(np.uint8)
        label_im = Image.fromarray(label_arr)
        if label_im.mode not in ('L', 'I;16', 'I'):
                    label_im = label_im.convert('L')
        label_im = T.Compose([T.Resize(self.image_size),
                                    T.Grayscale()
                                    ])(label_im)
        label = np.array(label_im, dtype='int64')
        return img, label

    def __len__(self):
        return len(self.ids)

2、网络结构

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

def conv_relu(nin, nout, ks=3, stride=1, pad=1):
    conv = nn.Conv2D(in_channels=nin, out_channels=nout, kernel_size=ks, stride=stride, padding=pad, weight_attr=paddle.ParamAttr())
    return conv, nn.ReLU()

class FCN32s(nn.Layer):
    def __init__(self, num_classes=21):
        super(FCN32s, self).__init__()
        
        # 1/2x 卷积：输入通道数为3，输出通道数为64
        self.conv1_1, self.relu1_1 = conv_relu(3, 64, pad=100)
        self.conv1_2, self.relu1_2 = conv_relu(64, 64)
        self.pool1 = nn.MaxPool2D(kernel_size=2, stride=2, padding=0)
        
        
        # 1/4x 卷积：输入通道数为64，输出通道数为128
        self.conv2_1, self.relu2_1 = conv_relu(64, 128)
        self.conv2_2, self.relu2_2 = conv_relu(128, 128)
        self.pool2 = nn.MaxPool2D(kernel_size=2, stride=2, padding=0)
        
        # 1/8x 卷积：输入通道数为128，输出通道数为256
        self.conv3_1, self.relu3_1 = conv_relu(128, 256)
        self.conv3_2, self.relu3_2 = conv_relu(256, 256)
        self.conv3_3, self.relu3_3 = conv_relu(256, 256)
        self.pool3 = nn.MaxPool2D(kernel_size=2, stride=2, padding=0)
        
        # 1/16x 卷积：输入通道数为256，输出通道数为512
        self.conv4_1, self.relu4_1 = conv_relu(256, 512)
        self.conv4_2, self.relu4_2 = conv_relu(512, 512)
        self.conv4_3, self.relu4_3 = conv_relu(512, 512)
        self.pool4 = nn.MaxPool2D(kernel_size=2, stride=2, padding=0)
        
        # 1/32x 卷积：输入通道数为512，输出通道数为512
        self.conv5_1, self.relu5_1 = conv_relu(512, 512)
        self.conv5_2, self.relu5_2 = conv_relu(512, 512)
        self.conv5_3, self.relu5_3 = conv_relu(512, 512)
        self.pool5 = nn.MaxPool2D(kernel_size=2, stride=2, padding=0)
        
        # 全卷积
        self.fc6, self.relu6 = conv_relu(512, 4096, ks=7, pad=0) #变小了！ 不是上一层的1/2
        # 随机丢弃层
        self.drop6 = nn.Dropout2D(p = 0.5, data_format='NCHW')
        # 全卷积
        self.fc7, self.relu7 = conv_relu(4096, 4096, ks=1, pad=0)
        # 随机丢弃层
        self.drop7 = nn.Dropout2D(p = 0.5, data_format='NCHW')
        
        # 算分层
        self.score = nn.Conv2D(in_channels=4096, out_channels=num_classes, kernel_size=1, stride=1, padding=0, weight_attr=paddle.ParamAttr())
        
        # 32x上采样 本来想用32倍因子的，但是1来因为fc6使原图不是1/32被缩放了，图像变小了。2来反正原图都已经固定了尺寸500*500，这边就直接上采样回去就是了
        self.upscore = nn.UpsamplingBilinear2D(size=(500,500))
        
    def forward(self, x):
        # 1/2x 卷积
        conv1_1 = self.relu1_1(self.conv1_1(x))
        conv1_2 = self.relu1_2(self.conv1_2(conv1_1))
        pool1 = self.pool1(conv1_2)
        
        # 1/4x 卷积
        conv2_1 = self.relu2_1(self.conv2_1(pool1))
        conv2_2 = self.relu2_2(self.conv2_2(conv2_1))
        pool2 = self.pool2(conv2_2)
        
        # 1/8x 卷积
        conv3_1 = self.relu3_1(self.conv3_1(pool2))
        conv3_2 = self.relu3_2(self.conv3_2(conv3_1))
        conv3_3 = self.relu3_3(self.conv3_3(conv3_2))
        pool3 = self.pool3(conv3_3)
        
        # 1/16x 卷积
        conv4_1 = self.relu4_1(self.conv4_1(pool3))
        conv4_2 = self.relu4_2(self.conv4_2(conv4_1))
        conv4_3 = self.relu4_3(self.conv4_3(conv4_2))
        pool4 = self.pool4(conv4_3)
        
        # 1/32x 卷积
        conv5_1 = self.relu5_1(self.conv5_1(pool4))
        conv5_2 = self.relu5_2(self.conv5_2(conv5_1))
        conv5_3 = self.relu5_3(self.conv5_3(conv5_2))
        pool5 = self.pool5(conv5_3)
        
        # 全卷积
        fc6 = self.relu6(self.fc6(pool5)) #变小了！因为卷积核是7，导致尺寸变更小了，不是上一层的1/2
        # 随机丢弃层
        drop6 = self.drop6(fc6)
        # 全卷积
        fc7 = self.relu7(self.fc7(drop6))
        # 随机丢弃层
        drop7 = self.drop7(fc7)
        
        # 算分层
        score = self.score(drop7)
        
        # 32x上采样 因为不能按准确的32倍上采样，这里是直接上采样到500*500
        upscore = self.upscore(score)
        # upscore = F.interpolate(
        #         score,
        #         paddle.shape(x)[2:],
        #         mode='bilinear')
        
        return upscore

3、训练

import os
import numpy as np
import paddle
import paddle.optimizer as opt
import paddle.nn.functional as F
from fcn_dataset import FcnDataset
from fcn import FCN32s
from convert_sbdd import make_palette # 这个文件fcn caffe源码里有，稍改一下

from PIL import Image

    
def main():
   
    paddle.seed(100)
    
    # 指定运行设备
    use_gpu = True if paddle.get_device().startswith("gpu") else False
    if use_gpu:
        paddle.set_device('gpu:0')
    
    IMAGE_SIZE = (500, 500)
    # 类别个数
    num_classes = 21
        
    # 定义数据集
    train_dataset = FcnDataset('train')
    dev_dataset = FcnDataset('dev')
    test_dataset = FcnDataset('test')
    
    # 定义网络
    network = FCN32s(num_classes=num_classes)
    model = paddle.Model(network)
    model.summary((-1,3,)+IMAGE_SIZE) # 查看网络结构
    
    # 定义优化器，这里拿了另一个优化器，可以改为SGD
    optim = paddle.optimizer.RMSProp(learning_rate=0.001, 
                                 rho=0.9, 
                                 momentum=0.0, 
                                 epsilon=1e-07, 
                                 centered=False,
                                 parameters=model.parameters())
    
    model.prepare(optim, paddle.nn.CrossEntropyLoss(axis=1))

    model.fit(
        train_dataset,
        dev_dataset,
        epochs=15,
        batch_size=8,
        verbose=1
    )
    
    model.save("mymodel")
    
if __name__ == '__main__':
    main()

模型参数打印

----------------------------------------------------------------------------------
Layer (type) Input Shape Output Shape Param #
==================================================================================
Conv2D-1 [[1, 3, 500, 500]] [1, 64, 698, 698] 1,792
ReLU-1 [[1, 64, 698, 698]] [1, 64, 698, 698] 0
Conv2D-2 [[1, 64, 698, 698]] [1, 64, 698, 698] 36,928
ReLU-2 [[1, 64, 698, 698]] [1, 64, 698, 698] 0
MaxPool2D-1 [[1, 64, 698, 698]] [1, 64, 349, 349] 0
Conv2D-3 [[1, 64, 349, 349]] [1, 128, 349, 349] 73,856
ReLU-3 [[1, 128, 349, 349]] [1, 128, 349, 349] 0
Conv2D-4 [[1, 128, 349, 349]] [1, 128, 349, 349] 147,584
ReLU-4 [[1, 128, 349, 349]] [1, 128, 349, 349] 0
MaxPool2D-2 [[1, 128, 349, 349]] [1, 128, 174, 174] 0
Conv2D-5 [[1, 128, 174, 174]] [1, 256, 174, 174] 295,168
ReLU-5 [[1, 256, 174, 174]] [1, 256, 174, 174] 0
Conv2D-6 [[1, 256, 174, 174]] [1, 256, 174, 174] 590,080
ReLU-6 [[1, 256, 174, 174]] [1, 256, 174, 174] 0
Conv2D-7 [[1, 256, 174, 174]] [1, 256, 174, 174] 590,080
ReLU-7 [[1, 256, 174, 174]] [1, 256, 174, 174] 0
MaxPool2D-3 [[1, 256, 174, 174]] [1, 256, 87, 87] 0
Conv2D-8 [[1, 256, 87, 87]] [1, 512, 87, 87] 1,180,160
ReLU-8 [[1, 512, 87, 87]] [1, 512, 87, 87] 0
Conv2D-9 [[1, 512, 87, 87]] [1, 512, 87, 87] 2,359,808
ReLU-9 [[1, 512, 87, 87]] [1, 512, 87, 87] 0
Conv2D-10 [[1, 512, 87, 87]] [1, 512, 87, 87] 2,359,808
ReLU-10 [[1, 512, 87, 87]] [1, 512, 87, 87] 0
MaxPool2D-4 [[1, 512, 87, 87]] [1, 512, 43, 43] 0
Conv2D-11 [[1, 512, 43, 43]] [1, 512, 43, 43] 2,359,808
ReLU-11 [[1, 512, 43, 43]] [1, 512, 43, 43] 0
Conv2D-12 [[1, 512, 43, 43]] [1, 512, 43, 43] 2,359,808
ReLU-12 [[1, 512, 43, 43]] [1, 512, 43, 43] 0
Conv2D-13 [[1, 512, 43, 43]] [1, 512, 43, 43] 2,359,808
ReLU-13 [[1, 512, 43, 43]] [1, 512, 43, 43] 0
MaxPool2D-5 [[1, 512, 43, 43]] [1, 512, 21, 21] 0
Conv2D-14 [[1, 512, 21, 21]] [1, 4096, 15, 15] 102,764,544
ReLU-14 [[1, 4096, 15, 15]] [1, 4096, 15, 15] 0
Dropout2D-1 [[1, 4096, 15, 15]] [1, 4096, 15, 15] 0
Conv2D-15 [[1, 4096, 15, 15]] [1, 4096, 15, 15] 16,781,312
ReLU-15 [[1, 4096, 15, 15]] [1, 4096, 15, 15] 0
Dropout2D-2 [[1, 4096, 15, 15]] [1, 4096, 15, 15] 0
Conv2D-16 [[1, 4096, 15, 15]] [1, 21, 15, 15] 86,037
UpsamplingBilinear2D-1 [[1, 21, 15, 15]] [1, 21, 500, 500] 0
==================================================================================
Total params: 134,346,581
Trainable params: 134,346,581
Non-trainable params: 0
----------------------------------------------------------------------------------
Input size (MB): 2.86
Forward/backward pass size (MB): 2197.93
Params size (MB): 512.49
Estimated Total Size (MB): 2713.29
----------------------------------------------------------------------------------

15次迭代很快

4、预测

读取数据

import os
import random
import paddle
import paddle.io as io
from paddle.vision import transforms as T
import numpy as np
from PIL import Image
import scipy.io
from convert_sbdd import make_palette

# 只读取指定 filedi 的一张图片
class FcnTempDataset(io.Dataset):
    """
    数据集定义
    """
    def __init__(self, fileid=None):
        """
        构造函数
        """
        self.image_size = (500,500)
        self.img_path = './dataset/sbdd/img'
        self.label_path = './dataset/sbdd/cls'
        self.fileid = fileid
        
        assert(fileid != None)
        
        self.transform = T.Compose([T.Resize(self.image_size),
                                    T.Transpose(),
                                    T.Normalize(mean=127.5, std=127.5)
                                    ])
        
    def __getitem__(self, idx):
        img = np.array(Image.open('{}/{}.jpg'.format(self.img_path, self.fileid)).convert('RGB'), dtype='float32')
        img = self.transform(img)

        mat = scipy.io.loadmat('{}/{}.mat'.format(self.label_path, self.fileid))
        label_arr = mat['GTcls'][0]['Segmentation'][0].astype(np.uint8)
        label_im = Image.fromarray(label_arr)
        if label_im.mode not in ('L', 'I;16', 'I'):
                    label_im = label_im.convert('L')
        label_im = T.Compose([T.Resize(self.image_size),
                                    T.Grayscale()
                                    ])(label_im)
        label = np.array(label_im, dtype='int64')
        
        return img, label

    def __len__(self):
        return 1
    
    def getSizeAndOutput(self):
        #输出一下原图和标签图看看  返回原图大小，便于后续预测结果缩放回去
        img = Image.open('{}/{}.jpg'.format(self.img_path, self.fileid))
        assert(img != None)
        img.save('./pred/ori.png')
        
        palette = make_palette(256).reshape(-1)
        mat = scipy.io.loadmat('{}/{}.mat'.format(self.label_path, self.fileid))
        label_arr = mat['GTcls'][0]['Segmentation'][0].astype(np.uint8)
        label_im = Image.fromarray(label_arr)
        label_im.putpalette(palette)
        label_im.save('./pred/label.png')
        return img.size

预测

import os
import numpy as np
import paddle
import paddle.nn.functional as F
from fcn_dataset import FcnTempDataset
from fcn import FCN32s
from convert_sbdd import make_palette

from PIL import Image
    
def pred():
    paddle.seed(100)
    
    # 指定运行设备
    use_gpu = True if paddle.get_device().startswith("gpu") else False
    if use_gpu:
        paddle.set_device('gpu:0')
    
    IMAGE_SIZE = (500, 500)
    network = FCN32s(num_classes=21)
    model = paddle.Model(network)
    model.load("mymodel")
    model.prepare()
    model.summary((-1,3,)+IMAGE_SIZE)
    
    img_id = '2008_000119'
    one_image = FcnTempDataset(img_id)
    img_size = one_image.getSizeAndOutput()
    print("Image id:{} image size: {} * {}".format(img_id, img_size[0], img_size[1]))
    
    predict_results = model.predict(one_image)
    
    palette = make_palette(256).reshape(-1)
    
    data = predict_results[0][0][0].transpose((1, 2, 0))
    mask = np.argmax(data, axis=2)
    mask = mask.astype('uint8')
    mask = np.squeeze(mask)
    label_im = Image.fromarray(mask)
    label_im.putpalette(palette)
    label_im = label_im.resize(img_size)
    label_im.save('./pred/pred.png')
    
if __name__ == '__main__':
    pred()

效果：

15次迭代的32s也就只能这样了。感觉标签ID被弄混了，颜色不对。 caffe不方便调试，paddle方便多了，但每一个地方查看过去确实花许多时间。

三、论文阅读（翻译）

摘要

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixelsto-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efﬁcient inference and learning. We deﬁne and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classiﬁcation networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by ﬁne-tuning [3] to the segmentation task. We then deﬁne a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, ﬁne layer to produce accurate and detailed segmentations. Our fully convolutional network achieves stateof-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one ﬁfth of a second for a typical image.

卷积网络是一个强大的视觉模型，它可以产出层次特征。我们展示了：网络本身，端到端训练，像素到像素，超过当前最新的语义分割。我们的关键见解是建立了“全卷积网络”，采用任意大小的输入并通过有效的推理和学习产出与输入相应大小的输出。我们定义和描述了全卷积网络的控件，阐述了它们在空间稠密预测任务中的应用，以及绘制了与先前网络的连接情况。我们将当代分类网络（AlexNet、VGG、GoogLeNet）调整成了全卷积网络，并将它们进行微调转换为语义学习的分割任务。我们定义了一个跳跃结构，采用结合来自深层与粗糙层的语义信息和浅层的外观信息，最后一层产生准确和详细的分割信息。我们的全卷积网络在一下数据集中达到了先进水平，PASCAL VOC（2012年平均IU提升了20%达到了62.2%），NYUDv2以及SIFT Flow，同时，对于一张典型的图像，推理所需时间用不了1秒钟。

1. 引言

Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classiﬁcation [20, 31, 32], but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 10, 17], part and keypoint prediction [39, 24], and local correspondence [24, 8].

卷积网络正在推动识别的进步，卷积往不仅提升了全图分类，还在结构化输出的图像局部处理任务取得了进展（应该是指目标检测等）。这些进展包括矩形框的目标检测、部分和关键点预测，以及局部相关信息。

The natural next step in the progression from coarse to ﬁne inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [27, 2, 7, 28, 15, 13, 9], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.

从粗糙推理到精确推理的下一步改进自然是每个像素的预测。先前的方法已经使用了卷积网络进行语义分割，其中每个像素都用其封闭对象或区域的类来标记，但它们所遇到的缺点被本项（本论文）的工作解决了。

We show that a fully convolutional network (FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the ﬁrst work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-ata-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.

我们表明，在没有进一步机制的情况下，完全卷积网络（FCN）在语义分割上的端到端、像素到像素训练超过了最先进的水平。据我们所知，这是第一项工作基于FCN的端到端训练用于像素预测以及来自监督的预训练。基于现有神经网络的全卷积版本预测来自任意大小输入的密集输出。学习和推理都是通过密集的前馈计算和反向传播在全图像一次性完成的。网络内上采样层能够在具有子采样池的网络中实现像素预测和学习

This method is efﬁcient, both asymptotically and absolutely, and precludes the need for the complications in other works. Patchwise training is common [27, 2, 7, 28, 9], but lacks the efﬁciency of fully convolutional training. Our approach does not make use of pre- and post-processing complications, including superpixels [7, 15], proposals [15, 13], or post-hoc reﬁnement by random ﬁelds or local classiﬁers [7, 15]. Our model transfers recent success in classiﬁcation [20, 31, 32] to dense prediction by reinterpreting classiﬁcation nets as fully convolutional and ﬁne-tuning from their learned representations. In contrast, previous works have applied small convnets without supervised pre-training [7, 28, 27].

这种方法是有效的，无论是渐进的还是绝对的，并且排除了其他工作中复杂情况的需要。逐片训练很常见，但缺乏全卷积训练的效率。我们的方法没有利用前处理和后处理的复杂性，包括超像素、提议，以及随机场或局部分类器的事后处理。我们的模型将当前已有的分类网络成功转移到了密集预测，通过将分类网络重新解释为从其学习的表征中进行全卷积和精细调整。相比之下，以前的工作在没有监督的预训练的情况下已经应用了卷积网络。

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies encode location and semantics in a nonlinear local-to-global pyramid. We deﬁne a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, ﬁne, appearance information in Section 4.2(see Figure3)

语义分割面临着语义和位置之间的内在冲突：全局信息解决“什么”的问题，而局部信息解决“哪里”的问题。深层特征层次在非线性局部到全局金字塔中对位置和语义进行编码。我们在第4.2节中定义了一个跳跃架构，以利用结合了深度、粗糙的语义信息和浅层、精细的外观信息的优势的特征谱。（见图3）

In the next section, we review related work on deep classiﬁcation nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

在下一节中，我们将回顾深度分类网、FCN以及最近使用卷积网进行语义分割的方法的相关工作。以下部分解释了FCN设计和密集预测权衡，介绍了我们的网络内上采样和多层组合架构，并描述了我们的实验框架。最后，我们展示了PASCAL VOC 2011-2、NYUDv2和SIFT Flow的最新结果。

2. 相关工作

Our approach draws on recent successes of deep nets for image classiﬁcation [20, 31, 32] and transfer learning [3, 38]. Transfer was ﬁrst demonstrated on various visual recognition tasks [3, 38], then on detection, and on both instance and semantic segmentation in hybrid proposalclassiﬁer models [10, 15, 13]. We now re-architect and ﬁnetune classiﬁcation nets to direct, dense prediction of semantic segmentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.

我们的方法借鉴了深度网络最近在图像分类和迁移学习方面的成功。转移学习首先被论证在各种视觉识别任务，然后在检测上，以及在混合提议分类模型中的实例和语义分割上进行了论证。我们现在重新构建并实现分类网络，以指导语义分割的密集预测。我们绘制了FCN的空间图，并将此框架重置于历史和最近的网络模型中。

Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs ﬁrst appeared in Matan et al. [26], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] deﬁne a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference.

全卷积网络据我们所知，将卷积网扩展到任意大小输入的想法首次出现在Matan等人中，他们将经典的LeNet扩展到识别数字串。由于他们的网络仅限于一维输入字符串，Matan等人使用维特比解码来获得他们的输出。Wolf和Platt将卷积网络输出扩展为邮政地址块四个角的检测分数的二维映射。这两部历史著作都在检测任务上使用了全卷积网络进行了推理和学习。Ning等人定义了一种利用全卷积网络推理对秀丽隐杆线虫组织进行粗多类分割的卷积网络。

Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [4] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.

全卷积计算也被用于当前的多层网络时代。Sermanet等人的滑动窗口检测，Pinheiro和Collobert等人的语义分割[28]，以及Eigen等人的图像恢复[4]都使用全卷积推理。全卷积训练是罕见的，但被Tompson等人有效地使用，用于学习用于姿态估计的端到端部分检测器和空间模型，尽管他们没有公开或分析这种方法。

Alternatively, He et al. [17] discard the nonconvolutional portion of classiﬁcation nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, ﬁxed-length feature for classiﬁcation. While fast and effective, this hybrid model cannot be learned end-to-end.

或者，He等人丢弃分类网络的非卷积部分以制作特征提取器。它们结合了提议和空间金字塔池，为分类提供了一个本地化的固定长度特征。尽管这种混合模型快速有效，但无法端到端地学习。

Dense prediction with convnets Several recent works have applied convnets to dense prediction problems, including semantic segmentation by Ning et al. [27], Farabet et al.[7], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresan et al. [2] and for natural images by a hybrid convnet/nearest neighbor model by Ganin and Lempitsky [9]; and image restoration and depth estimation by Eigen et al. [4, 5]. Common elements of these approaches include

• small models restricting capacity and receptive ﬁelds;

• patchwise training [27, 2, 7, 28, 9];

• post-processing by superpixel projection, random ﬁeld regularization, ﬁltering, or local classiﬁcation [7, 2, 9];

• input shifting and output interlacing for dense output [29,

28, 9];

• multi-scale pyramid processing [7, 28, 9];

• saturating tanh nonlinearities [7, 4, 28]; and

• ensembles [2, 9],

whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsampling 3.3, of which the fully connected prediction by Eigen et al. [5] is a special case.

卷积网络的密集预测，最近的几项工作将卷积网络应用于密集预测问题，包括Ning等人的语义分割。Farabet等人，以及Pinheiro和Collobert；Ciresan等人的电子显微镜边界预测。Ganin和Lempitsky的混合卷积往/最近邻模型的自然图像边界预测；以及Eigen等人的图像恢复和深度估计。这些方法的共同要素包括

小模型限制了容量和感受野；
分段训练；
通过超像素投影、随机场正则化、过滤或局部分类进行后处理；
输入移位和输出交错以实现密集输出；
多尺度金字塔处理；
饱和 tanh 非线性；
合奏团（高级组合？流水线？）。

而我们的方法不需要这种机制。然而，我们确实从 FCN 的角度研究了分段训练（章节3.4）和“移位缝合”密集输出（章节3.2）。我们还讨论了网络内上采样（3.3），其中 Eigen 等人的全连接预测，是一个特例。

Unlike these existing methods, we adapt and extend deep classiﬁcation architectures, using image classiﬁcation as supervised pre-training, and ﬁne-tune fully convolutionally to learn simply and efﬁciently from whole image inputs and whole image ground thruths.

与这些现有方法不同，我们适应和扩展深度分类架构，使用图像分类作为监督预训练，并进行全卷积微调，以简单有效地从整个图像输入和整个图像真值进行学习。

Hariharan et al. [15] and Gupta et al. [13] likewise adapt deep classiﬁcation nets to semantic segmentation, but do so in hybrid proposal-classiﬁer models. These approaches ﬁne-tune an R-CNN system [10] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end. They achieve state-of-the-art segmentation results on PASCAL VOC and NYUDv2 respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.

哈里哈兰等人和古普塔等人，同样使深度分类网络适应语义分割，但也同样在使用混合提议分类器模型。这些方法通过采样边界框和/或区域建议来微调R-CNN系统，以进行检测、语义分割和实例分割。这两种方法都不是端到端学习的。他们分别在 PASCAL VOC 和 NYUDv2 上实现了最先进的分割结果，因此我们直接将我们的独立端到端 FCN 与第 5 章中的语义分割结果进行比较。

We fuse features across layers to deﬁne a nonlinear localto-global representation that we tune end-to-end. In contemporary work Hariharan et al. [16] also use multiple layers in their hybrid model for semantic segmentation.

我们跨层融合特征来定义我们端到端调整的非线性局部到全局表示。在当前的工作中，Hariharan 等人还在其混合模型中使用多层进行语义分割。

3. 全卷积网络

Each layer of data in a convnet is a three-dimensional array of size h × w × d, where h and w are spatial dimensions, and d is the feature or channel dimension. The ﬁrst layer is the image, with pixel size h × w, and d color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive ﬁelds.

卷积网络中的每一层数据都是一个大小为 h × w × d 的三维数组，其中 h 和 w 是空间维度，d 是特征或通道维度。第一层是图像，像素大小为 h × w，有 d 个颜色通道。较高网络层中的位置对应于它们路径连接到的图像中的位置，称为它们的感受野。

Convnets are built on translation invariance. Their basic components (convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates. Writing xij for the data vector at location (i, j) in a particular layer, and yij for the following layer, these functions compute outputs yij by

where k is called the kernel size, s is the stride or subsampling factor, and fks determines the layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling, or an elementwise nonlinearity for an activation function, and so on for other types of layers.

卷积网络建立在平移不变性的基础上。它们的基本组件（卷积、池化和激活函数）在局部输入区域上运行，并且仅依赖于相对空间坐标。将特定层中位置 (i, j) 处的数据向量用 Xij表示，将 Yij 表示下一层的位置，这些函数通过以下方式计算输出 Yij

其中 k 称为内核大小，s 是步幅或子采样因子，fks 确定层类型：卷积或平均池化的矩阵乘法、最大池化的空间最大值或激活函数的元素非线性等等或其他类型的网络层。

This functional form is maintained under composition, with kernel size and stride obeying the transformation rule

While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear ﬁlter, which we call a deep ﬁlter or fully convolutional network. An FCN naturally operates on an input of any size, and produces an output of corresponding (possibly resampled) spatial dimensions.

这种函数形式在结构下保持不变，内核大小和步长遵循转换规则：

（遇到瓶颈了，这个公式看不懂，（Fks与Gk`s`的张量积，○是张量积？）等于 (F与G张量积)的(k`+ (k-1)乘以s`),s乘以s`分量。把(k`+(k-1)s`)看做是a，ss`看做是b，那么=(F○G)a,b，(F○G)是什么，G是什么也没介绍一下，得看其他论文或源码。k`和s`是什么，上采样时的核大小与步长？公式希望有人详解）

虽然一般的深度网络计算一般的非线性函数，但仅具有这种形式的层的网络计算非线性滤波器，我们将其称为深度滤波器或全卷积网络。FCN自然地对任何大小的输入进行操作，并产生相应的（可能是重新采样的）空间维度的输出。

A real-valued loss function composed with an FCN deﬁnes a task. If the loss function is a sum over the spatial dimensions of the ﬁnal layer, L(x; θ) = ∑ij L` (xij ; θ), its gradient will be a sum over the gradients of each of its spatial components. Thus stochastic gradient descent on L computed on whole images will be the same as stochastic gradient descent on L`, taking all of the ﬁnal layer receptive ﬁelds as a minibatch.

由 FCN 组成的实值损失函数定义了一个任务。如果损失函数是最后一层空间维度的总和，L(x; θ) = Σij L`(xij ; θ)，则其梯度将是其每个空间分量的梯度之和。因此，在整个图像上计算的 L 上的随机梯度下降将与 L` 上的随机梯度下降相同，将所有最终层感受野作为一个小批量。

When these receptive ﬁelds overlap signiﬁcantly, both feedforward computation and backpropagation are much more efﬁcient when computed layer-by-layer over an entire image instead of independently patch-by-patch.

当这些感受野显着重叠时，前馈计算和反向传播在整个图像上逐层计算而不是独立地逐块计算时会更加有效。

We next explain how to convert classiﬁcation nets into fully convolutional nets that produce coarse output maps. For pixelwise prediction, we need to connect these coarse outputs back to the pixels. Section 3.2 describes a trick, fast scanning [11], introduced for this purpose. We gain insight into this trick by reinterpreting it as an equivalent network modiﬁcation. As an efﬁcient, effective alternative, we introduce deconvolution layers for upsampling in Section 3.3. In Section 3.4 we consider training by patchwise sampling, and give evidence in Section 4.3 that our whole image training is faster and equally effective.

接下来我们解释如何将分类网络转换为产生粗略输出图的全卷积网络。对于像素预测，我们需要将这些粗略输出连接回像素级别的。 3.2 节描述了为此目的引入的一个技巧，即快速扫描。我们通过将其重新解释为等效的网络修改来深入了解这一技巧。作为一种高效、有效的替代方案，我们在 3.3 节中引入了用于上采样的反卷积层。在第 3.4 节中，我们考虑通过块级别采样进行训练，并在第 4.3 节中给出证据，证明我们的整个图像训练速度更快且同样有效。

3.1 调整分类器进行密集预测

图 2. 将全连接层转换为卷积层使分类网络能够输出热图。添加层和空间损失（如图 1 所示）可产生用于端到端密集学习的高效机器。

Typical recognition nets, including LeNet [21], AlexNet [20], and its deeper successors [31, 32], ostensibly take ﬁxed-sized inputs and produce non-spatial outputs. The fully connected layers of these nets have ﬁxed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts them into fully convolutional networks that take input of any size and output classiﬁcation maps. This transformation is illustrated in Figure 2

典型的识别网络，包括 LeNet、AlexNet 及其更深层次的后继者，基本上都采用固定大小的输入并产生非空间输出。这些网络的全连接层具有固定的尺寸并抛弃空间坐标。然而，这些全连接层也可以被视为具有覆盖其整个输入区域的内核的卷积。这样做会将它们放入全卷积网络中，该网络接受任意大小的输入并输出分类图。这种转变如图 2 所示。

Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches. For example, while AlexNet takes 1.2 ms (on a typical GPU) to infer the classiﬁcation scores of a 227×227 image, the fully convolutional net takes 22 ms to produce a 10×10 grid of outputs from a 500×500 image, which is more than 5 times faster than the naive approach

此外，虽然生成的图相当于对特定输入块的原始网络的评估，但计算在这些块的重叠区域上进行了高度摊销。例如，AlexNet 需要 1.2 毫秒（在典型的 GPU 上）来推断 227×227 图像的分类分数，而全卷积网络只需要 22 毫秒就能从 500×500 图像产生 10×10 的输出网格，这比朴素方法快 5 倍以上。

The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation. With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational efﬁciency (and aggressive optimization) of convolution. The corresponding backward times for the AlexNet example are 2.4 ms for a single image and 37 ms for a fully convolutional 10 × 10 output map, resulting in a speedup similar to that of the forward pass.

这些卷积模型的空间输出图使它们成为语义分割等密集问题的自然选择。由于每个输出单元都有真实数据，前向和后向传递都很简单，并且都利用了卷积固有的计算效率（和积极的优化）。 AlexNet 示例的相应后向时间对于单个图像为 2.4 毫秒，对于完全卷积 10 × 10 输出图为 37 毫秒，从而获得与前向传递类似的加速。

While our reinterpretation of classiﬁcation nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by subsampling. The classiﬁcation nets subsample to keep ﬁlters small and computational requirements reasonable. This coarsens the output of a fully convolutional version of these nets, reducing it from the size of the input by a factor equal to the pixel stride of the receptive ﬁelds of the output units.

虽然我们将分类网络重组为全卷积，可以生成任意大小的输入的输出图，但输出维度通常会通过子采样来减小。分类网络子样本保持过滤器较小且计算要求合理。这会粗化这些网络的完全卷积版本的输出，将其从输入大小减少一个等于输出单元感受野的像素步幅的因子。

3.2 “移位缝合”是过滤稀疏化

Dense predictions can be obtained from coarse outputs by stitching together output from shifted versions of the input. If the output is downsampled by a factor of f, shift the input x pixels to the right and y pixels down, once for every (x, y) s.t. 0 ≤ x, y < f. Process each of these f^2 inputs, and interlace the outputs so that the predictions correspond to the pixels at the centers of their receptive ﬁelds.

通过将输入的移位版本的输出拼接在一起，可以从粗略的输出中获得密集预测。如果输出按因子f 下采样得到的，则对于每个(x, y)0≤x,y<f，将输入x像素向右移动并向下移动y像素。处理每个 f^2 输入，并交错输出，以便预测对应于其感受野中心的像素。

Although performing this transformation na¨ıvely increases the cost by a factor of f^2 , there is a well-known trick for efﬁciently producing identical results [11, 29] known to the wavelet community as the a` trous algorithm [25]. Consider a layer (convolution or pooling) with input stride s, and a subsequent convolution layer with ﬁlter weights fij (eliding the irrelevant feature dimensions). Setting the lower layer’s input stride to 1 upsamples its output by a factor of s. However, convolving the original ﬁlter with the upsampled output does not produce the same result as shift-and-stitch, because the original ﬁlter only sees a reduced portion of its (now upsampled) input. To reproduce the trick, rarefy the ﬁlter by enlarging it as

(with i and j zero-based). Reproducing the full net output of the trick involves repeating this ﬁlter enlargement layerby-layer until all subsampling is removed. (In practice, this can be done efﬁciently by processing subsampled versions of the upsampled input.)

尽管执行这种变换只会将成本增加 f^2 倍，但有一个众所周知的技巧可以有效地产生相同的结果,小波社区将其称为 a` trous算法。考虑一个具有输入步幅为s的层（卷积或池化），以及具有过滤器权重fij（消除不相关的特征维度）的后续卷积层。将下层的输入步长设置为1会将其输出上采样s 倍。然而，将原始滤波器与上采样输出进行卷积不会产生与移位缝合相同的结果，因为原始滤波器仅看到其（现在已上采样）输入的减少部分。要重现该技巧，请通过将其放大来稀疏过滤器：

（i 和 j 从零开始）。再现该技巧的完整净输出涉及逐层重复此滤波器放大，直到删除所有子采样。（实际上，这可以通过处理上采样输入的下采样版本来有效地完成。）

Decreasing subsampling within a net is a tradeoff: the ﬁlters see ﬁner information, but have smaller receptive ﬁelds and take longer to compute. The shift-and-stitch trick is another kind of tradeoff: the output is denser without decreasing the receptive ﬁeld sizes of the ﬁlters, but the ﬁlters are prohibited from accessing information at a ﬁner scale than their original design.

减少网络内的子采样是一种权衡：过滤器可以看到更精细的信息，但感受野更小并且计算时间更长。移位缝合技巧是另一种权衡：输出更密集，而不会减小滤波器的感受野大小，但滤波器被禁止以比原始设计更精细的尺度尺度信息。

Although we have done preliminary experiments with this trick, we do not use it in our model. We ﬁnd learning through upsampling, as described in the next section, to be more effective and efﬁcient, especially when combined with the skip layer fusion described later on.

尽管我们已经用这个技巧做了初步实验，但我们并没有在我们的模型中使用它。我们发现通过上采样进行学习（如下一节所述）更加有效和高效，尤其是与稍后描述的跳层融合相结合时。

3.3 上采样是反向跨步卷积

Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output yij from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells.

将粗略输出连接到密集像素的另一种方法是插值。例如，简单的双线性插值通过线性映射从最近的四个输入计算每个输出 yij，该线性映射仅取决于输入和输出单元的相对位置。

In a sense, upsampling with factor f is convolution with a fractional input stride of 1/f. So long as f is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of f. Such an operation is trivial to implement, since it simply reverses the forward and backward passes of convolution. Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.

从某种意义上说，使用因子 f 进行上采样是使用分数输入步幅为 1/f 的卷积。只要 f 是积分，上采样的自然方法就是输出步长为 f 的向后卷积（有时称为反卷积）。这种操作实现起来很简单，因为它只是反转卷积的前向和后向传递。因此，通过像素损失的反向传播在网络内执行上采样以进行端到端学习。

Note that the deconvolution ﬁlter in such a layer need not be ﬁxed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.

注意哈，此类层中的反卷积滤波器不需要固定（例如，双线性上采样），但可以被学习。一堆反卷积层和激活函数甚至可以学习非线性上采样。

In our experiments, we ﬁnd that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to upsample for reﬁned prediction in Section 4.2.

在我们的实验中，我们发现网络内上采样对于学习密集预测来说是快速且有效的。我们最好的分割架构使用这些层来学习上采样以实现第 4.2 节中的精细预测。

3.4 块训练就是损失采样

In stochastic optimization, gradient computation is driven by the training distribution. Both patchwise training and fully convolutional training can be made to produce any distribution, although their relative computational efﬁciency depends on overlap and minibatch size. Whole image fully convolutional training is identical to patchwise training where each batch consists of all the receptive ﬁelds of the units below the loss for an image (or collection of images). While this is more efﬁcient than uniform sampling of patches, it reduces the number of possible batches. However, random selection of patches within an image may be recovered simply. Restricting the loss to a randomly sampled subset of its spatial terms (or, equivalently applying a DropConnect mask [36] between the output and the loss) excludes patches from the gradient computation.

在随机优化中，梯度计算由训练分布驱动的。块训练和全卷积训练都可以产生任何分布，尽管它们的相对计算效率取决于重叠和小批量大小。整个图像全卷积训练与块训练相同，其中每个批次由图像（或图像集合）损失下方的单元的所有感受野组成。虽然这比块的统一采样更有效，但它减少了可能的块数量。然而，图像内随机选择的块可以简单地被恢复。将损失限制为其空间项的随机采样子集（或者等效地在输出和损失之间应用 DropConnect mask）可以从梯度计算中排除块。

If the kept patches still have signiﬁcant overlap, fully convolutional computation will still speed up training. If gradients are accumulated over multiple backward passes, batches can include patches from several images.

如果保留的块仍然有显着的重叠，全卷积计算仍然会加速训练。如果梯度是通过多次反向传播累积的，则块可以包含来自多个图像的块。

Sampling in patchwise training can correct class imbalance [27, 7, 2] and mitigate the spatial correlation of dense patches [28, 15]. In fully convolutional training, class balance can also be achieved by weighting the loss, and loss sampling can be used to address spatial correlation.

块训练中的采样可以纠正类别不平衡并减轻密集块的空间相关性。在全卷积训练中，还可以通过对损失进行加权来实现类别平衡，并且可以使用损失采样来解决空间相关性。

We explore training with sampling in Section 4.3, and do not ﬁnd that it yields faster or better convergence for dense prediction. Whole image training is effective and efﬁcient.

我们在 4.3 节中探索了采样训练，但没有发现它可以为密集预测带来更快或更好的收敛。全图训练是有效且高效的。

4. 分割结构

We cast ILSVRC classiﬁers into FCNs and augment them for dense prediction with in-network upsampling and a pixelwise loss. We train for segmentation by ﬁne-tuning. Next, we add skips between layers to fuse coarse, semantic and local, appearance information. This skip architecture is learned end-to-end to reﬁne the semantics and spatial precision of the output.

我们将 ILSVRC 分类器放入FCN中，并通过网络内上采样和像素损失来增强它们以进行密集预测。我们通过微调来训练分割。接下来，我们在层之间添加跳跃以融合粗略的语义和局部外观信息。这种跳跃架构是端到端学习的，以细化输出的语义和空间精度。

For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge [6]. We train with a per-pixel multinomial logistic loss and validate with the standard metric of mean pixel intersection over union, with the mean taken over all classes, including background. The training ignores pixels that are masked out (as ambiguous or difﬁcult) in the ground truth.

在本次调查中，我们对 PASCAL VOC 2011 分割挑战进行训练和验证。我们使用每像素多项式逻辑损失进行训练，并使用均值标准度量像素的交并集的进行验证，包括所有类别（及背景）。训练忽略了在真实情况中被掩盖（未被标注）（不明确或困难）的像素。

4.1 从分类器到密集 FCN

We begin by convolutionalizing proven classiﬁcation architectures as in Section 3. We consider the AlexNet 3 architecture [20] that won ILSVRC12, as well as the VGG nets [31] and the GoogLeNet 4 [32] which did exceptionally well in ILSVRC14. We pick the VGG 16-layer net 5 , which we found to be equivalent to the 19-layer net on this task. For GoogLeNet, we use only the ﬁnal loss layer, and improve performance by discarding the ﬁnal average pooling layer. We decapitate each net by discarding the ﬁnal classiﬁer layer, and convert all fully connected layers to convolutions. We append a 1 × 1 convolution with channel dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a deconvolution layer to bilinearly upsample the coarse outputs to pixel-dense outputs as described in Section 3.3. Table 1 compares the preliminary validation results along with the basic characteristics of each net. We report the best results achieved after convergence at a ﬁxed learning rate (at least 175 epochs).

我们首先对第 3 章中经过验证的分类架构进行卷积。我们考虑使用再ILSVRC12中获胜的AlexNet 框架，以及在 ILSVRC14 中表现出色的 VGG 网络和 GoogLeNet。我们选择VGG-16网络，我们发现它与此任务中的19层网络等效。对于GoogLeNet，我们仅使用最终的损失层，并通过丢弃最终的平均池化层来提高性能。我们截断每个网络通过丢弃最终的分类器层，并将所有全连接层转换为卷积层。我们在粗略的输出进行双线性上采样为像素密集输出的反卷积层（第 3.3 节中）之后附加一个通道维度为21（PASSCAL共21个分类）的1 × 1卷积来预测每个粗略输出位置处每个 PASCAL 类别（包括背景）的分数。表1 比较了初步验证结果以及每个网络的基本特征。我们报告了以固定学习率（至少175迭代）收敛后取得的最佳结果。

表1 我们调整并扩展了三个分类卷积网络。我们通过 PASCAL VOC 2011 验证集上的平均交并比以及推理时间（在 NVIDIA Tesla K40c 上对 500 × 500 输入进行 20 多次试验进行平均）来比较性能。我们详细介绍了关于密集预测的适应网络的架构：参数层的数量、输出单元的感受野大小以及网络内的最粗步长。（这些数字给出了在固定学习率下获得的最佳性能，而不是可能的最佳性能。）

Fine-tuning from classiﬁcation to segmentation gave reasonable predictions for each net. Even the worst model achieved ∼ 75% of state-of-the-art performance. The segmentation-equipped VGG net (FCN-VGG16) already appears to be state-of-the-art at 56.0 mean IU on val, compared to 52.6 on test [15]. Training on extra data raises FCN-VGG16 to 59.4 mean IU and FCN-AlexNet to 48.0 mean IU on a subset of val 7 . Despite similar classiﬁcation accuracy, our implementation of GoogLeNet did not match the VGG16 segmentation result.

从分类到分割的微调为每个网络提供了合理的预测。即使是最差的模型也能达到大约75%的最先进性能。分割改装的 VGG 网络 (FCN-VGG16) 看起来已经是最先进的，验证时的平均 IU 为 56.0，而测试时的平均 IU 为 52.6 。在验证在集上，对额外数据进行训练将 FCN-VGG16 的平均 IU 提高到 59.4，FCN-AlexNet 的平均 IU 提高到 48.0。尽管分类精度相似，但我们的 GoogLeNet 实现与 VGG16 分割结果不匹配。

4.2 结合分类和定位

图 3. 我们的 DAG 网络学习将粗略的高层信息与精细的低层信息相结合。池化层和预测层显示为网格，显示相对空间粗糙度，而中间层显示为垂直线。第一行（FCN-32s）：我们的单流网络，如第 4.1 节所述，通过一步将步长为32的上采样预测回像素。第二行（FCN-16s）：以步长为16组合来自最终层和 pool4 层的预测，让我们的网络预测更精细的细节，同时保留高级语义信息。第三行 (FCN-8s)：来自 pool3 的附加预测，步长为8，提供了更高的精度。

We deﬁne a new fully convolutional net (FCN) for segmentation that combines layers of the feature hierarchy and reﬁnes the spatial precision of the output. See Figure 3.

我们定义了一个新的用于分割的全卷积网络（FCN），它结合了特征层次结构的各层并细化了输出的空间精度。参见图 3。

While fully convolutionalized classiﬁers can be ﬁne-tuned to segmentation as shown in 4.1, and even score highly on the standard metric, their output is dissatisfyingly coarse (see Figure 4). The 32 pixel stride at the ﬁnal prediction layer limits the scale of detail in the upsampled output.

虽然全卷积分类器可以微调成分割（如 4.1 所示），甚至在标准指标上得分很高，但它们的输出却非常粗糙（见图 4）。最终预测层的 32 像素步幅限制了上采样输出中的细节尺度。

图 4. 通过融合来自不同步长的层的信息来细化全卷积网络，从而改善分割细节。前三幅图像显示了 32、16 和 8 像素步幅网络的输出（见图 3）。

We address this by adding skips [1] that combine the ﬁnal prediction layer with lower layers with ﬁner strides. This turns a line topology into a DAG, with edges that skip ahead from lower layers to higher ones (Figure 3). As they see fewer pixels, the ﬁner scale predictions should need fewer layers, so it makes sense to make them from shallower net outputs. Combining ﬁne layers and coarse layers lets the model make local predictions that respect global structure. By analogy to the jet of Koenderick and van Doorn [19], we call our nonlinear feature hierarchy the deep jet.

我们通过添加跳跃来解决这个问题，该跳跃将最终预测层与使用更精细步长的较低层结合起来。这会将线形拓扑转变为 DAG（有向无环图），其“边”（图的边）从较低层向前跳跃到较高层（图 3）。由于他们看到的像素更少，更精细的尺度预测应该需要更少的层，因此从更浅层的网络输出中进行预测是有意义的。结合精细层和粗略层可以让模型注重全局结构的局部预测。通过类比 Koenderick 和 van Doorn 的射流（jet），我们将非线性特征层次称为深度射流（deep jet）。

We ﬁrst divide the output stride in half by predicting from a 16 pixel stride layer. We add a 1 × 1 convolution layer on top of pool4 to produce additional class predictions. We fuse this output with the predictions computed on top of conv7 (convolutionalized fc7) at stride 32 by adding a 2× upsampling layer and summing 6 both predictions (see Figure 3). We initialize the 2× upsampling to bilinear interpolation, but allow the parameters to be learned as described in Section 3.3. Finally, the stride 16 predictions are upsampled back to the image. We call this net FCN-16s. FCN-16s is learned end-to-end, initialized with the parameters of the last, coarser net, which we now call FCN-32s. The new parameters acting on pool4 are zeroinitialized so that the net starts with unmodiﬁed predictions. The learning rate is decreased by a factor of 100.

我们首先通过16像素步幅层进行预测，将输出步幅分成一半。我们在 pool4 之上添加一个 1 × 1 卷积层以产生额外的类预测。我们将之（pool4的类预测）与步长为32的 conv7（卷积化 fc7）之上计算的预测通过添加 2×上采样结果，这两个两个预测结果进行求和来将此输出与融合起来（见图 3）。我们将 2× 上采样初始化为双线性插值，但允许按照 3.3 节中的描述来学习参数。最后，步长为16的预测被上采样回图像。我们将此网络称为 FCN-16s。 FCN-16s 是端到端学习的，使用最后一个较粗网络（我们现在称为 FCN-32s）的参数进行初始化。作用于 pool4 的新参数被初始化为零，以便网络以未修改的预测开始。学习率降低了100倍。

Learning this skip net improves performance on the validation set by 3.0 mean IU to 62.4. Figure 4 shows improvement in the ﬁne structure of the output. We compared this fusion with learning only from the pool4 layer, which resulted in poor performance, and simply decreasing the learning rate without adding the skip, which resulted in an insigniﬁcant performance improvement without improving the quality of the output.

学习这个跳跃网络将验证集的性能提高了3.0个平均IU点到到 62.4。图 4 显示了输出精细结构的改进。我们将这种融合与仅从 pool4 层学习进行比较，后者导致性能不佳，并且只是降低学习率而不添加跳跃，这导致性能提升不显着，而没有提高输出质量。

We continue in this fashion by fusing predictions from pool3 with a 2× upsampling of predictions fused from pool4 and conv7, building the net FCN-8s. We obtain a minor additional improvement to 62.7 mean IU, and ﬁnd a slight improvement in the smoothness and detail of our output. At this point our fusion improvements have met diminishing returns, both with respect to the IU metric which emphasizes large-scale correctness, and also in terms of the improvement visible e.g. in Figure 4, so we do not continue fusing even lower layers.

我们继续以这种方式，将 pool3预测的2x上采样与 pool4 和 conv7 融合的预测进行融合，构建网络 FCN-8。我们获得了 62.7 平均 IU 的微小额外改进，并发现输出的平滑度和细节略有改善。此时，我们的融合改进已经遇到了收益递减的情况，无论是在强调大规模正确性的 IU 指标方面，还是在可见的改进方面，例如在图 4 中，因此我们不会继续融合更低的层。

Reﬁnement by other means Decreasing the stride of pooling layers is the most straightforward way to obtain ﬁner predictions. However, doing so is problematic for our VGG16-based net. Setting the pool5 stride to 1 requires our convolutionalized fc6 to have kernel size 14 × 14 to maintain its receptive ﬁeld size. In addition to their computational cost, we had difﬁculty learning such large ﬁlters. We attempted to re-architect the layers above pool5 with smaller ﬁlters, but did not achieve comparable performance; one possible explanation is that the ILSVRC initialization of the upper layers is important.

通过其他方式进行细化 减少池化层的步长是获得更精细预测的最直接方法。然而，这样做对于我们基于 VGG16 的网络来说是有问题的。将 pool5 步幅设置为1要求我们的卷积化 fc6 的内核大小为 14 × 14，以维持其感受野大小。除了计算成本之外，我们还很难学习如此大的过滤器。我们尝试使用更小的过滤器重新架构 pool5 之上的层，但没有达到可比较的性能；一种可能的解释是上层的 ILSVRC 初始化很重要。

Another way to obtain ﬁner predictions is to use the shiftand-stitch trick described in Section 3.2. In limited experiments, we found the cost to improvement ratio from this method to be worse than layer fusion.

获得更精细预测的另一种方法是使用第 3.2 节中描述的移位缝合技巧。在有限的实验中，我们发现这种方法的成本改进比层融合更差。

4.3 实验框架

Optimization We train by SGD with momentum. We use a minibatch size of 20 images and ﬁxed learning rates of 10 −3 , 10 −4 , and 5 −5 for FCN-AlexNet, FCN-VGG16, and FCN-GoogLeNet, respectively, chosen by line search. We use momentum 0.9, weight decay of 5 −4 or 2 −4 , and doubled learning rate for biases, although we found training to be sensitive to the learning rate alone. We zero-initialize the class scoring layer, as random initialization yielded neither better performance nor faster convergence. Dropout was included where used in the original classiﬁer nets.

优化我们通过SGD（随机梯度下降优化算法）进行有动量（momentum）的训练。对于 FCN-AlexNet、FCN-VGG16 和 FCN-GoogLeNet，我们分别使用 20 张图像的小批量大小和 10 −3 、10 −4 和 5 −5 的固定学习率，通过线性搜索选择得到。我们使用动量 0.9、权重衰减 5 −4 或 2 −4 以及双倍学习率来应对偏差，尽管我们发现训练仅对学习率敏感。我们对分类层进行零值初始化，因为随机初始化既不会产生更好的性能，也不会产生更快的收敛。在原始分类器网络中使用的地方包含了Dropout（沿用原来的Dropout概率）。

Fine-tuning We ﬁne-tune all layers by backpropagation through the whole net. Fine-tuning the output classiﬁer alone yields only 70% of the full ﬁnetuning performance as compared in Table 2. Training from scratch is not feasible considering the time required to learn the base classiﬁcation nets. (Note that the VGG net is trained in stages, while we initialize from the full 16-layer version.) Fine-tuning takes three days on a single GPU for the coarse FCN-32s version, and about one day each to upgrade to the FCN-16s and FCN-8s versions.

微调我们通过整个网络的反向传播来微调所有层。与表2相比，单独微调输出分类器只能产生完整微调性能的70%。考虑到学习基本分类网络所需的时间，从头开始训练是不可行的。（注意哈，VGG 网络是分阶段训练的，而我们从完整的16层版本进行初始化。）对于粗略的 FCN-32s 版本，在单个GPU上进行微调需要三天时间，升级到 FCN-16s 和 FCN-8s 版本分别需要一天的时间。

More Training Data The PASCAL VOC 2011 segmentation training set labels 1112 images. Hariharan et al. [14] collected labels for a larger set of 8498 PASCAL training images, which was used to train the previous state-of-theart system, SDS [15]. This training data improves the FCNVGG16 validation score 7 by 3.4 points to 59.4 mean IU.

更多训练数据 PASCAL VOC 2011 分割训练集标记了 1112 个图像。哈里哈兰等人收集了更大的 8498个PASCAL训练图像集的标签，用于训练之前最先进的系统SDS。该训练数据将 FCNVGG16 验证分数 7 提高了 3.4 分，达到 59.4 平均 IU。

Patch Sampling As explained in Section 3.4, our full image training effectively batches each image into a regular grid of large, overlapping patches. By contrast, prior work randomly samples patches over a full dataset [27, 2, 7, 28, 9], potentially resulting in higher variance batches that may accelerate convergence [22]. We study this tradeoff by spatially sampling the loss in the manner described earlier, making an independent choice to ignore each ﬁnal layer cell with some probability 1 − p. To avoid changing the effective batch size, we simultaneously increase the number of images per batch by a factor 1/p. Note that due to the efﬁciency of convolution, this form of rejection sampling is still faster than patchwise training for large enough values of p (e.g., at least for p > 0.2 according to the numbers in Section 3.1). Figure 5 shows the effect of this form of sampling on convergence. We ﬁnd that sampling does not have a signiﬁcant effect on convergence rate compared to whole image training, but takes signiﬁcantly more time due to the larger number of images that need to be considered per batch. We therefore choose unsampled, whole image training in our other experiments.

块采样 正如 3.4 节中所解释的，我们的完整图像训练有效地将每个图像批量化为由大的、重叠的块组成的规则网格。相比之下，先前的工作在完整数据集上随机采样块，可能会导致更高的方差批次，从而加速收敛。我们通过以前面描述的方式对损失进行空间采样来研究这种权衡，并做出独立选择以概率 1 − p 忽略每个最后层单元。为了避免改变有效批量大小，我们同时将每批图像数量增加 1/p。注意哈，由于卷积的效率，对于足够大的 p 值（例如，根据第 3.1 节中的数字，至少 p > 0.2），这种形式的拒绝采样仍然比块训练更快。图 5 显示了这种形式的采样对收敛的影响。我们发现，与整个图像训练相比，采样对收敛速度没有显着影响，但由于每批需要考虑的图像数量较多，因此需要花费更多时间。因此，我们在其他实验中选择未采样的全图像训练。

图 5. 对整个图像进行训练与采样块一样有效，但通过更有效地利用数据来实现更快的（实际运行时间）收敛。左图显示了固定预期批量大小下采样对收敛速度的影响，而右图则通过相对实际运行时间绘制了相同的效果。

Class Balancing Fully convolutional training can balance classes by weighting or sampling the loss. Although our labels are mildly unbalanced (about 3/4 are background), we ﬁnd class balancing unnecessary.

类平衡 全卷积训练可以通过对损失进行加权或采样来平衡类。尽管我们的标签稍微不平衡（大约 3/4 是背景），但我们发现类别平衡是不必要的。

Dense Prediction The scores are upsampled to the input dimensions by deconvolution layers within the net. Final layer deconvolutional ﬁlters are ﬁxed to bilinear interpolation, while intermediate upsampling layers are initialized to bilinear upsampling, and then learned.

密集预测 分数通过网络内的反卷积层上采样到输入维度。最终层反卷积滤波器固定为双线性插值，而中间上采样层初始化为双线性上采样，然后进行学习。

Augmentation We tried augmenting the training data by randomly mirroring and “jittering” the images by translating them up to 32 pixels (the coarsest scale of prediction) in each direction. This yielded no noticeable improvement.

增强我们尝试通过随机镜像和“抖动”图像来增强训练数据，方法是将图像在每个方向上使用 32 像素（最粗的预测尺度）（应该是指FCN-32s）。这没有产生明显的改善。

Implementation All models are trained and tested with Caffe [18] on a single NVIDIA Tesla K40c. Our models and code are publicly available at Fully Convolutional Networks for Semantic Segmentation

实现所有模型均在单个 NVIDIA Tesla K40c 上使用 Caffe 进行训练和测试。我们的模型和代码可在 http://fcn.berkeleyvision.org 上公开获取。

5. 结果

图 6. 全卷积分割网络在 PASCAL 上产生最先进的性能。左栏显示了我们性能最高的网络 FCN-8 的输出。第二个显示了 Hariharan 等人之前最先进的系统产生的分割。注意哈，恢复的精细结构（第一行）、分离紧密相互作用的对象的能力（第二行）以及对遮挡物的鲁棒性（第三行）。第四行显示了一个失败案例：网络将船上的救生衣视为人。

We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, and SIFT Flow. Although these tasks have historically distinguished between objects and regions, we treat both uniformly as pixel prediction. We evaluate our FCN skip architecture on each of these datasets, and then extend it to multi-modal input for NYUDv2 and multi-task prediction for the semantic and geometric labels of SIFT Flow.

我们在语义分割和场景解析上测试 FCN，探索 PASCAL VOC、NYUDv2 和 SIFT Flow。尽管这些任务在对象和区域之上取得了历史性的贡献，但我们将它们统一视为像素预测。我们在每个数据集上评估我们的 FCN 跳跃架构，然后将其扩展到 NYUDv2 的多模态输入以及 SIFT Flow 的语义和几何标签的多任务预测。

Metrics We report four metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU). Let nij be the number of pixels of class i predicted to belong to class j, where there are ncl different classes, and let ti = ∑j nij be the total number of pixels of class i. We compute:

pixel accuracy: ∑ i nii/ ∑ i ti
mean accuraccy: (1/ncl) ∑ i nii/ti
mean IU: (1/ncl) ∑ i nii/ ( ti + ∑ j nji − nii )
frequency weighted IU: ( ∑k tk)^ −1 ∑i tinii/ (ti + ∑ j nji − nii )

指标我们报告了常见语义分割和场景解析评估的四个指标，这些指标是像素精度和区域交集（IU）的变化。设 nij 为预测属于 j 类的第 i 类像素的数量，其中有 ncl 个不同的类，并令 ti =∑j nij 为第 i 类的像素总数。我们计算：

像素准确性: $\sum_{i}^{}n_{ii}/\sum_{i}^{}t_{i}$
平均正确性： $\left (1/n_{cl} \right )\sum_{i}^{}n_{ii}/t_{i}$
平均IU： $\left (1/n_{cl} \right )\sum_{i}^{}n_{ii}/\left ( t_{i} +\sum_{j}^{}n_{ji}-n_{ii} \right )$
频率加权IU： $\left ( \sum_{k}t_{k} \right )^{-1}\sum_{i}t_{i}n_{ii}/\left ( t_{i}+\sum_{j}n_{ji}-n_{ii} \right )$

PASCAL VOC Table 3 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [15], and the well-known R-CNN [10]. We achieve the best results on mean IU 8 by a relative margin of 20%. Inference time is reduced 114× (convnet only, ignoring proposals and reﬁnement) or 286× (overall).

PASCAL VOC 表 3 给出了我们的 FCN-8 在 PASCAL VOC 2011 和 2012 测试集上的性能，并将其与之前的最先进技术SDS和著名的 R-CNN 进行了比较。我们在平均IU上取得了最佳结果，相对优势为 20%。推理时间减少了114倍（仅使用卷积网络，忽略提议和细化）或 286 倍（总体）。

表 3. 我们的全卷积网络在 PASCAL VOC 2011 和 2012 测试集上比最先进的网络有了 20% 的相对改进，并减少了推理时间。

NYUDv2 [30] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [12]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PASCAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodiﬁed coarse model (FCN-32s) on RGB images. To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little beneﬁt, perhaps due to the difﬁcultly of propagating meaningful gradients all the way through the model. Following the success of Gupta et al. [13], we try the three-dimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predictions from both nets are summed at the ﬁnal layer, and the resulting two-stream net is learned end-to-end. Finally we upgrade this late fusion net to a 16-stride version.

NYUDv2 是使用 Microsoft Kinect 收集的 RGB-D 数据集。它有 1449 张 RGB-D 图像，带有像素标签，已被 Gupta 等人合并为 40 类语义分割任务。我们报告了 795 个训练图像和 654 个测试图像的标准分割结果。（注：所有模型选择均在 PASCAL 2011 val 上进行）表 4 给出了我们的模型在几种变体中的性能。首先，我们在 RGB 图像上训练未经修改的粗略模型 (FCN-32s)。为了添加深度信息，我们对升级后的模型进行训练，以采用四通道 RGB-D 输入（早期融合）。这几乎没有什么好处，可能是因为很难在整个模型中传播有意义的梯度。继古普塔等人的成功之后，我们尝试了深度的三维 HHA 编码，仅根据该信息训练网络，以及 RGB 和 HHA 的“后期融合”，其中两个网络的预测在最后一层求和，并得到结果双流网络是端到端学习的。最后我们将这个后期融合网络升级为16步版本。

表 4. NYUDv2 的结果。 RGBD 是输入端 RGB 和深度通道的早期融合。 HHA 是深度嵌入，作为水平视差、地面高度以及局部表面法线与推断重力方向的角度。 RGB-HHA 是联合训练的后期融合模型，它将 RGB 和 HHA 预测相加。

SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”), as well as three geometric categories (“horizontal”, “vertical”, and “sky”). An FCN can naturally learn a joint representation that simultaneously predicts both types of labels. We learn a two-headed version of FCN-16s with semantic and geometric prediction layers and losses. The learned model performs as well on both tasks as two independently trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training and 200 test images, 9 show state-of-the-art performance on both tasks.

SIFT Flow 是一个包含 2,688 张图像的数据集，这些图像带有 33 个语义类别（“桥”、“山”、“太阳”）以及 3 个几何类别（“水平”、“垂直”和“天空”）的像素标签。 FCN 可以自然地学习同时预测两种类型标签的联合表示。我们学习 FCN-16 的双头版本，具有语义和几何预测层和损失。学习模型在这两项任务上的表现与两个独立训练的模型一样好，而学习和推理基本上与每个独立模型本身一样快。表 5 中的结果按照标准计算，分为 2,488 个训练图像和 200 个测试图像，显示了这两项任务的最先进性能。

表 5. 使用类分割（中）和几何分割（右）的 SIFT Flow 结果。 Tighe 是一种非参数传递方法。 Tighe1是一个 SVM示例，而 2 是 SVM + MRF。 Farabet 是一个在类平衡样本 (1) 或自然频率样本 (2) 上训练的多尺度卷积网络。 Pinheiro 是一个多尺度循环网络，表示为 RCNN3 (O3 )。几何的衡量标准是像素精度。

6. 结论

Fully convolutional networks are a rich class of models, of which modern classiﬁcation convnets are a special case. Recognizing this, extending these classiﬁcation nets to segmentation, and improving the architecture with multi-resolution layer combinations dramatically improves the state-of-the-art, while simultaneously simplifying and speeding up learning and inference.

全卷积网络是一类丰富的模型，现代分类卷积网络是其中的一个特例。认识到这一点，将这些分类网络扩展到分割，并通过多分辨率层组合改进架构，可以显着提高最先进的水平，同时简化和加速学习和推理。

Acknowledgements This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS1427425, IIS-1212798, IIS-1116411, and the NSF GRFP, Toyota, and the Berkeley Vision and Learning Center. We gratefully acknowledge NVIDIA for GPU donation. We thank Bharath Hariharan and Saurabh Gupta for their advice and dataset tools. We thank Sergio Guadarrama for reproducing GoogLeNet in Caffe. We thank Jitendra Malik for his helpful comments. Thanks to Wei Liu for pointing out an issue wth our SIFT Flow mean IU computation and an error in our frequency weighted mean IU formula.

致谢这项工作得到了 DARPA 的 MSEE 和 SMISC 项目、NSF 奖项 IIS1427425、IIS-1212798、IIS-1116411 以及 NSF GRFP、丰田和伯克利视觉与学习中心的部分支持。我们衷心感谢 NVIDIA 的 GPU 捐赠。我们感谢 Bharath Hariharan 和 Saurabh Gupta 的建议和数据集工具。我们感谢 Sergio Guadarrama 在 Caffe 中重现了 GoogLeNet。我们感谢 Jitendra Malik 的有益评论。感谢 Wei Liu 指出了 SIFT Flow 平均 IU 计算的问题以及频率加权平均 IU 公式中的错误。