Enet语义分割笔记

最新推荐文章于 2025-04-14 13:12:44 发布

GungnirsPledge

最新推荐文章于 2025-04-14 13:12:44 发布

阅读量3.6k

点赞数 3

分类专栏：实时语义分割经典框架

本文链接：https://blog.csdn.net/GungnirsPledge/article/details/108199795

版权

实时语义分割经典框架专栏收录该内容

1 篇文章

订阅专栏

Enet笔记

Enet 设计的初衷
Enet 模型结构
Enet 用到的trick和创新点
Enet在数据集上的效果
一些感受
现成的大佬们的复现转送门

因为最近项目要考虑到实时性所以看看它的框架,做个笔记,换言之写给我自己这个菜鸟看的…

论文下载的地址: https://arxiv.org/pdf/1606.02147.pdf

然后就是自己读论文的笔记了

Enet 设计的初衷

嫌segnet速度太慢了, 满足不了实时(real-time) 分割的需求
简单分析了一下以前的工作, 比如segnet 就是一个 encoder-decoder 结构, 参数很多, 模型很大,所以很慢
还有写更早一点的分割算法, 用的是简单的分类器然后级联条件随机场作为一个后处理, 但是对于出现次数不多的目标检测不了.
当然还有些用CNN分类器接Rnn的,管用但是这个架构下速度也很慢

Enet 模型结构

模型的核心分为两个小结构, initial 和 bottleneck, 如下图:

在这里插入图片描述

initial block

在initial block下输入图片然后分两个分支, 一个接kernel_size=3, stride=2的卷积层, 卷基层通道数为13, 另一边就直接接kernel_size=2, stride=2的maxpooling层,然后两边的结果叠加通道数就是 13+3 = 16 个通道数.
这里贴keras代码, 因为最清爽:

keras: https://github.com/BBuf/Keras-Semantic-Segmentation/blob/master/Models/ENet.py

def initial_block(inp, nb_filter=13, nb_row=3, nb_col=3, strides=(2, 2)):
    conv = Conv2D(nb_filter, (nb_row, nb_col), padding='same', strides=strides)(inp)
    max_pool = MaxPooling2D()(inp)
    merged = concatenate([conv, max_pool], axis=3)
    return merged

可以发现 keras的复现没有加 PReLU的, 而 pytorch 版本里的却加了, 但是两边都没有加bn.

bottleneck block

在bottleneck block下就和上面的图一样,就是有几点值得注意:

图中第一个1x1的卷积就是用来压缩通道数的, 复现里都设置为输出通道数//4(而这个参数估计会是整个网络里最重要的参数了, 大了快,但是精度估计会低,小了精度可能高, 但是慢),而第二个1x1 的卷积是用来恢复成输出的通道数的, 除非在表中特别说明, 卷积核都是3x3
bottleneck block 可以分成3中不同的情况:
1. 一般情况那么中间的卷积层stride=1, maxpooling分支不做操作
2. 如果是downsample下采样的情况, 中间的卷积层stride=2, maxpooling分支一般操作
3. 如果是asymmetric 那么中间的卷积就从比如一个5x5卷积变成一个1x5卷积和一个 5x1卷积
4. 如果是dilated, 那么中间的卷积就变成dilation的卷积
5. 如果是上采样, 那么卷积就变成反卷积, 下采样就变成上采样
每个卷积之间都要接一个BN和PReLU
输出的时候, 卷积的分支要接一个BN和spatial_dropout, 2.0层前p=0.01 后面都是p=0.1
两个分支通形状相同所以是数值相加,而不是通道数叠加
max_pooling分支过完max_pooling以后会发现通道数其实和卷积分支不一样,所以要在通道数这个维度上补上0来保持两个分支形状一样. keras代码里是这样搞的,自己看的也有点迷

    # other branch
    if downsample:
        other = MaxPooling2D()(other)
        other = Permute((1, 3, 2))(other)
        pad_feature_maps = output - inp.get_shape().as_list()[3]
        tb_pad = (0, 0)
        lr_pad = (0, pad_feature_maps)
        other = ZeroPadding2D(padding=(tb_pad, lr_pad))(other)
        other = Permute((1, 3, 2))(other)

总体的网络结构

说白了就是过了一个几个initial blcok 以后就是各种形态的bottleneck block 套娃, 如下图:
在这里插入图片描述

后面用的loss函数以及一些超参

loss函数是inverse class probability weighing的自定义loss函数
在这里插入图片描述
貌似并没有讲到用了dropout的概率是多少.

Enet 用到的trick和创新点

解决下采样过程中边缘信息丢失的问题, 解决方案一般有两种, 一种是像FCN里面那样加上编码器里相应的feature map, 还有一种就是像segnet里面那样保留编码器里最大池化过程中的最大值的索引, 然后在解码器上采样的时候接由这个索引来生成上采样的稀疏 featuremap, Enet在处理上使用了segnet中这个方法, 并且尽可能地限制下采样的幅度
提早地使用下采样, 逻辑是这样的图片上的大多的信息分布是稀疏且多余的, 因此它是可以压缩成一个更有效的信息表达形式, 并且网络最开始的部分,在直接上应该不直接参与分类的任务, 而是作为一个有效的特征提取器来给后面的网络提供一个预处理好的图片输入.
减少了编码器的size 来提速, 很多网络都有对称的编码与解码结构, 而Enet是大编码器,小解码器结构, 因为编码结构应该直接效力与分类任务, 而解码器的工作只是在分类好的结果上做一些微调, 所以权重比不应该那么大
使用了PReLU代替了ReLU, 我理解他论文里说的意思是在开头的时候物体在PReLU中的权重更大, 当然权重的间的方差也大, 在往更深地网络里, 物体在PReLU中的权重逐渐变小,下面的图就是本来开头大于0较多,很快就趋近与0,表示PReLU的占比减小, 相反, 在ReLU的占比就越大, 就像下面这幅图, 然后物体在decoder中的比重明显变大,说明decoder中的基本都是对目标的微调.(不过说实话这实验是怎么做的?不清楚)反正说明了开头物体用PReLU会好,层数往下深到15以后用ReLU好的意思.
说白了就是不卷积加pooling了, 而是直接卷积加stride=2, 由于是基于输入的通道数来进行下采样, 而不是基于卷积后增加了通道数的下采样,计算量减小,所以速度就可以提高. 还有一点就是传统的resnet 的initial block 会用stride =(2, 2) 来快速下采样, 相当于4个像素只选一个, 丢掉了75%, 浪费了很多信息, 所以作者说他们搞提高filer的size 为2x2, 那么基本上每个像素都可以或多或少地考虑进去, 但是同样的本来1x1的kernel 变成了2x2, 计算量提高了4倍. 但是保证了精确度
用到了不对称卷积和空洞卷积, 不对称卷积可以提高预案算的效率, 也可以减少参数量, 3x3 变成了1x3 和 3x1 其实感受也是一样的, 但是参数量就少了(3x3=9>3+3=6), Enet 里面用的是5x1和1x5, 感受野是5x5 比 3x3大, 参数量只多了5+5-3x3=1
用了空洞卷积, 不多说, 就是强行提高感受也的玩意儿,但是不要连续的放一块,得久不久来那么一下,连续放效果反而不好.
spatial dropout 减少过拟合的玩意儿,因为数据集上的图片还是太少了, 而模型表达能力一般很强.

Enet在数据集上的效果

在这里插入图片描述

一些感受

代码上的难点

我认为在代码上需要看的就是那些github上的高手是如何处理这个网络上用到的tricks, 看着些就足够了.

我觉得最反一般直觉的就是那个padding操作了, 因为它是在通道数这个维度上进行padding的, 比如从initial 到 bottleneck1.0 就padding了 64-16=48个通道的0 feature_map, 不知道这个对网络有没有影响. 后面也是如此.这在一定程度上来说maxpooling 分支的特征是只有1/4是与卷积分支进行了结合, 而且各个通道的特征来源比例是1/4. 这里贴别人精彩的复现:
TF:

 # First get the difference in depth to pad, then pad with zeros only on the last dimension.
 inputs_shape = inputs.get_shape().as_list()
 depth_to_pad = abs(inputs_shape[3] - output_depth)
 paddings = tf.convert_to_tensor([[0, 0], [0, 0], [0, 0], [0, depth_to_pad]])
 net_main = tf.pad(net_main, paddings=paddings, name=scope + '_main_padding') # 在最后一个维度padd 0

pytorch:

# Main branch channel padding
 n, ch_ext, h, w = ext.size()
 ch_main = main.size()[1]
 padding = torch.zeros(n, ch_ext - ch_main, h, w)

 # Before concatenating, check if main is on the CPU or GPU and
 # convert padding accordingly
 if main.is_cuda:
     padding = padding.cuda()

 # Concatenate
 main = torch.cat((main, padding), 1) # torch 很直白, 就是0 tensor 的一个concat

 # Add main and extension branches
 out = main + ext   # 这里论文上是数值相加比较蛋疼

 return self.out_activation(out), max_indices

Spatial Dropout 这种dropout简而言之就是不在单个元素地置0了, 而是一个区域一个区域地置零, 也可以以按一个维度来置零(其实你可以填任何形状), 这里是按照channel维度来置零, 说明有的channel所对应的2D feature map 要变成全是0, 所以torch里用的dropout2d, 而tf里用的, noise_shape, 代码也顺带贴了:

    def _spatial_dropout(self, x, p, seed, scope, is_training=True):
        '''
        Performs a 2D spatial dropout that drops layers instead of individual elements in an input feature map.
        Note that p stands for the probability of dropping, but tf.nn.relu uses probability of keeping.

        ------------------
        Technical Details
        ------------------
        The noise shape must be of shape [batch_size, 1, 1, num_channels], with the height and width set to 1, because
        it will represent either a 1 or 0 for each layer, and these 1 or 0 integers will be broadcasted to the entire
        dimensions of each layer they interact with such that they can decide whether each layer should be entirely
        'dropped'/set to zero or have its activations entirely kept.
        --------------------------

        INPUTS:
        - x(Tensor): a 4D Tensor of the input feature map.
        - p(float): a float representing the probability of dropping a layer
        - seed(int): an integer for random seeding the random_uniform distribution that runs under tf.nn.relu
        - scope(str): the string name for naming the spatial_dropout
        - is_training(bool): to turn on dropout only when training. Optional.

        OUTPUTS:
        - output(Tensor): a 4D Tensor that is in exactly the same size as the input x,
                          with certain layers having their elements all set to 0 (i.e. dropped).
        '''
        if is_training:
            keep_prob = 1.0 - p
            input_shape = x.get_shape().as_list()
            noise_shape = tf.constant(value=[input_shape[0], 1, 1, input_shape[3]]) 
            # 这里指的是channel(input_shape[3]) 相关, 1,1 独立,其实填啥貌似都行只要不填height 和width
            # 都会boardcast 到整个feature_map 全都是0, 表示的就是相当于只在channel 维度上随机抽为0 还是不为0, 其它维度一旦抽中了都是0
            output = tf.nn.dropout(x, keep_prob, noise_shape, seed=seed, name=scope)

            return output

        return x

tf的不是很好理解noise_shape, 这里就贴个链接方便以后查:

https://blog.csdn.net/weixin_43896398/article/details/84762943
https://blog.csdn.net/qq_20412595/article/details/82824830

Maxpooling 要保留index, 这是segnet的一个骚操作, 在pytorch的复现里没怎么体现,好像影响不大,但也可以搞的, 其实就是保留kernel里留下来的哪个元素的位置,等要上采样的时候对应地在那个位置放上元素,其它位置补0就好了(貌似也可以补别的)比如可以试试按1234的大小顺序排进去,数值取最大值的一个高斯分布?, 这里一样贴个代码:
https://github.com/sangeet259/tensorflow_unpooling

TF:

# maxpooling with index
net_main, pooling_indices = tf.nn.max_pool_with_argmax(inputs,
                                                      ksize=[1, 2, 2, 1],
                                                      strides=[1, 2, 2, 1],
                                                      padding='SAME',
                                                      name=scope + '_main_max_pool')

# uppooling with index
# https://github.com/sangeet259/tensorflow_unpooling/blob/master/unpool.py
def unpool_with_with_argmax(pooled, ind, ksize=[1, 2, 2, 1]):
    """
      To unpool the tensor after  max_pool_with_argmax.
      Argumnets:
          pooled:    the max pooled output tensor
          ind:       argmax indices , the second output of max_pool_with_argmax
          ksize:     ksize should be the same as what you have used to pool
      Returns:
          unpooled:      the tensor after unpooling
      Some points to keep in mind ::
          1. In tensorflow the indices in argmax are flattened, so that a maximum value at position [b, y, x, c] becomes flattened index ((b * height + y) * width + x) * channels + c
          2. Due to point 1, use broadcasting to appropriately place the values at their right locations ! 
    """
    # Get the the shape of the tensor in th form of a list
    input_shape = pooled.get_shape().as_list()
    # Determine the output shape
    output_shape = (input_shape[0], input_shape[1] * ksize[1], input_shape[2] * ksize[2], input_shape[3])
    # Ceshape into one giant tensor for better workability
    pooled_ = tf.reshape(pooled, [input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3]])
    # The indices in argmax are flattened, so that a maximum value at position [b, y, x, c] becomes flattened index ((b * height + y) * width + x) * channels + c
    # Create a single unit extended cuboid of length bath_size populating it with continous natural number from zero to batch_size
    batch_range = tf.reshape(tf.range(output_shape[0], dtype=ind.dtype), shape=[input_shape[0], 1, 1, 1])
    b = tf.ones_like(ind) * batch_range
    b_ = tf.reshape(b, [input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3], 1])
    ind_ = tf.reshape(ind, [input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3], 1])
    ind_ = tf.concat([b_, ind_],1)
    ref = tf.Variable(tf.zeros([output_shape[0], output_shape[1] * output_shape[2] * output_shape[3]]))
    # Update the sparse matrix with the pooled values , it is a batch wise operation
    unpooled_ = tf.scatter_nd_update(ref, ind_, pooled_)
    # Reshape the vector to get the final result 
    unpooled = tf.reshape(unpooled_, [output_shape[0], output_shape[1], output_shape[2], output_shape[3]])
    return unpooled


# uppooling with index version2
    def _unpool(self, updates, mask, k_size=[1, 2, 2, 1], output_shape=None, scope=''):
        '''
        Unpooling function based on the implementation by Panaetius at https://github.com/tensorflow/tensorflow/issues/2169

        INPUTS:
        - inputs(Tensor): a 4D tensor of shape [batch_size, height, width, num_channels] that represents the input block to be upsampled
        - mask(Tensor): a 4D tensor that represents the argmax values/pooling indices of the previously max-pooled layer
        - k_size(list): a list of values representing the dimensions of the unpooling filter.
        - output_shape(list): a list of values to indicate what the final output shape should be after unpooling
        - scope(str): the string name to name your scope

        OUTPUTS:
        - ret(Tensor): the returned 4D tensor that has the shape of output_shape.

        '''
        with tf.variable_scope(scope):
            mask = tf.cast(mask, tf.int32)
            input_shape = tf.shape(updates, out_type=tf.int32)
            #  calculation new shape
            if output_shape is None:
                output_shape = (input_shape[0], input_shape[1] * k_size[1], input_shape[2] * k_size[2], input_shape[3])

            # calculation indices for batch, height, width and feature maps
            one_like_mask = tf.ones_like(mask, dtype=tf.int32)
            batch_shape = tf.concat([[input_shape[0]], [1], [1], [1]], 0)
            batch_range = tf.reshape(tf.range(output_shape[0], dtype=tf.int32), shape=batch_shape)
            b = one_like_mask * batch_range
            y = mask // (output_shape[2] * output_shape[3])
            x = (mask // output_shape[3]) % output_shape[
                2]  # mask % (output_shape[2] * output_shape[3]) // output_shape[3]
            feature_range = tf.range(output_shape[3], dtype=tf.int32)
            f = one_like_mask * feature_range

            # transpose indices & reshape update values to one dimension
            updates_size = tf.size(updates)
            indices = tf.transpose(tf.reshape(tf.stack([b, y, x, f]), [4, updates_size]))
            values = tf.reshape(updates, [updates_size])
            ret = tf.scatter_nd(indices, values, output_shape)
            return ret

pytorch:

    self.pool = nn.MaxPool2d(3, 3, return_indices = True)
    x = self.unpool(x, pool_idx)

不对称卷积, 现在条件好了,不用自己算padding, conv2D不管在pytorch 和 tf 里都是 ‘same’, 换言之你再也不用担心是padding 几圈, 只用关心padd啥数字了, 一般都是padd 0
空洞卷积也有对应的接口了,不做多说
用了PReLU, pytorch 里直接有接口了, tensorflow 由于可能会需要你用上古时期的版本所以还是贴个代码:

    @slim.add_arg_scope
    def _prelu(self, x, scope, decoder=False):
        '''
        Performs the parametric relu operation. This implementation is based on:
        https://stackoverflow.com/questions/39975676/how-to-implement-prelu-activation-in-tensorflow
        For the decoder portion, prelu becomes just a normal prelu
        INPUTS:
        - x(Tensor): a 4D Tensor that undergoes prelu
        - scope(str): the string to name your prelu operation's alpha variable.
        - decoder(bool): if True, prelu becomes a normal relu.
        OUTPUTS:
        - pos + neg / x (Tensor): gives prelu output only during training; otherwise, just return x.
        '''
        # If decoder, then perform relu and just return the output
        if decoder:
            return tf.nn.relu(x, name=scope)

        alpha = tf.get_variable(scope + 'alpha', x.get_shape()[-1],
                                initializer=tf.constant_initializer(0.0),
                                dtype=tf.float32)
        pos = tf.nn.relu(x)
        neg = alpha * (x - abs(x)) * 0.5
        return pos + neg

自己改来玩

自己是做工程的, 但是偶尔还是会手痒想改改框架自己优化着来玩, 主要是测试pad操作

实际在TX2上的部署

为了部署去改源码以保证OP支持, 后面再写吧

自己使用体验

TX2上用 tf-trt arm_tensorflow_1.10.0 转Float32模型pb文件和savedmodel文件部署, 256x512的图片,前推速度大概是10-12fps, 当然其中为了使得有些OP支持会改了些网络,精度也会有点影响,但是还可以接受. 可能是我部署的不好(ToT) 后面再谈谈自己感觉的网络的优缺点吧