论文阅读笔记(十):CondenseNet: An Efficient DenseNet using Learned Group Convolutions

划重点啦划重点!现将CondenseNet论文中的核心语句摘录下来,以便快速浏览论文主干,翻译不当的地方,还请指正。


It combines dense connectivity between layers with a mechanism to remove unused connections. The dense connectivity facilitates feature re-use in the network, whereas learned group convolutions remove connections between layers for which this feature re-use is superfluous. At test time, our model can be implemented using standard grouped convolutions—allowing for efficient computation in practice.

它将层间的密集连接与一个机制相结合, 以移除未使用的连接。密集连接方便了网络中的特征重用, 而学习卷积组删除了此特征重用层之间的连接。在测试阶段, 我们的模型可以用标准卷积组实现-允许在实践中有效计算。

The high accuracy of convolutional networks (CNNs) in visual recognition tasks, such as image classification [11, 12], has fueled the desire to deploy these networks on platforms with limited computational resources, e.g., in robotics, self-driving cars, and on mobile devices. Unfortunately, the most accurate deep CNNs, such as the winners of the ImageNet [5] and COCO [31] challenges, were designed for scenarios in which computational resources are abundant. As a result, these models cannot be used to perform real-time inference on low-compute devices.

在视觉识别任务 (如图像分类 [11、12]) 中, 卷积网络 (CNNs) 的高精度激发了在有限的计算资源的平台上部署这些网络的愿望, 例如, 在机器人技术、自动驾驶和移动设备上。不幸的是, 最准确的深度卷积神经网络, 如 ImageNet [5] 和COCO [31] 挑战的获胜者, 是为计算资源丰富的场景设计的。因此, 这些模型不能用于对低计算设备执行实时推理。

This problem has fueled development of computationally efficient CNNsthat, e.g., remove redundant connections [8, 10, 27, 29, 32], use low-precision or quantized weights [3, 21, 36], or use more efficient network architectures [4, 12, 16, 19, 22, 47]. These efforts have lead to substantial improvements: on ImageNet, to achieve comparable accuracy as VGG [38], residual networks (ResNets;[12]) reduce the amount of computation by a factor 5×, DenseNets [19] by a factor of 10×, and MobileNets [16] and ShuffleNets [47] by a factor of 25×.

这个问题推动了计算效率卷积神经网络的发展, 例如, 删除冗余连接 [8、10、27、29、32], 使用低精度或量化的权重 [3、21、36], 或使用更高效的网络体系结构 [4、12、16、19、22、47]。这些努力已经导致了实质性的改进: 在 ImageNet, 实现可比较的准确性像 VGG [38], 残差网络 (ResNets; [12]) 通过一个因子 5×, DenseNets [19] 由10×的因子减少计算量, MobileNets [16] 和 ShuffleNets [47] 由25×因素决定。

A typical set-up for deep learning on mobile devices is one where CNNs are trained on multi-GPU machines but deployed on devices with limited compute. Therefore, a good network architecture allows for fast parallelization during training, but is compact at test-time.

在移动设备上进行深度学习的典型设置是卷积神经网络在多 GPU 计算机上进行训练, 但部署在有限计算的设备上。因此, 良好的网络体系结构允许在训练过程中进行快速并行化, 但在测试时是简洁的。

Recent work [3, 20] shows that there is a lot of redundancy in CNNs. The layer-by-layer connectivity pattern forces networks to replicate features from earlier layers throughout the network. The DenseNet architecture [19] alleviates the need for feature replication by directly connecting each layer with all layers before it, which induces feature re-use. Although more efficient, we hypothesize that dense connectivity introduces redundancies when early features are not needed in later layers. We propose a novel method to prune such redundant connections between layers and then introduce a more efficient architecture. In contrast to prior pruning methods, our approach learns a sparsified network automatically during the training process, and produces a regular connectivity pattern that can be implemented efficiently using group convolutions. Specifically, we split the filters of a layer into multiple groups, and gradually remove the connections to less important features per group during training. Importantly, the groups of incoming features are not predefined, but learned. The resulting model, named CondenseNet, can be trained efficiently on GPUs, and has high inference speed on mobile devices.

最近的工作 [3, 20] 表明在卷积神经网络中有很多冗余。逐层连接模式强制网络在整个网络中复制早期层的特征。DenseNet 体系结构 [19] 通过直接将每个层与前面的所有层连接在一起, 从而缓解了对特征复制的需求, 从而诱发了特征重用。虽然效率更高, 但我们假设在以后的层中不需要早期特征时, 密集连接会引入冗余。我们提出了一种新的方法来修剪这些层之间的冗余连接, 然后引入一个更有效的体系结构。与以前的修剪方法相比, 我们的方法在训练过程中自动学习一个稀疏网络, 并生成一个常规的连接模式, 可以使用卷积组有效地实现。具体地说, 我们将图层的过滤器拆分为多个组, 并在训练过程中逐步删除每个组中不太重要的特征的连接。重要的是, 传入特征的组不是预定义的, 而是经过学习的。所生成的模型, 名为 CondenseNet, 可以在 GPUs 上进行有效的训练, 并且在移动设备上具有较高的推理速度。

Weights pruning and quantization. CondenseNets are closely related to approaches that improve the inference efficiency of (convolutional) networks via weight pruning [10, 14, 27, 29, 32] and/or weight quantization [21, 36]. These approaches are effective because deep networks often have a substantial number of redundant weights that can be pruned or quantized without sacrificing (and sometimes even improving) accuracy. For convolutional networks, different pruning techniques may lead to different levels of granularity [34]. Fine-grained pruning, e.g., independent weight pruning [9, 27], generally achieves a high degree of sparsity. However, it requires storing a large number of indices, and relies on special hardware accelerators to be fast in practice. In contrast, coarse-grained pruning methods such as filter-level pruning [14, 29, 32] achieve a lower degree of sparsity, but the resulting networks are much more regular, which facilitates efficient implementations.

权重修剪和量化。CondenseNets 与通过权重修剪 (10、14、27、29、32] 和/或加权重量化 [21、36] 的方法提高 (卷积) 网络的推理效率密切相关。这些方法是有效的, 因为深网络通常有大量的冗余权重, 可以修剪或量化, 而不牺牲 (有时甚至提高) 准确性。对于卷积网络, 不同的修剪技术可能导致不同的粒度级别 [34]。细粒度修剪, 例如, 独立的权重修剪 [9, 27], 一般达到高度的稀疏度。然而, 它需要存储大量的索引, 并依靠特殊的硬件加速器在实践中更快。相比之下, 粗粒度修剪方法 (如过滤器级修剪 [14、29、32] 可实现较低程度的稀疏性, 但由此产生的网络更有规律, 从而促进了高效的实现。

CondenseNets also rely on a pruning technique, but differ from prior approaches in two main ways: First, the weight pruning is initiated in the early stages of training, which is substantially more effective and efficient than using L1 L 1 regularization throughout. Second, CondenseNets have a higher degree of sparsity than filter-level pruning, yet generate highly efficient group convolution—reaching a sweet spot between sparsity and regularity.

CondenseNets 还依赖于修剪技术, 但与以前的方法有两个主要的区别: 首先, 权重是在训练的早期阶段开始的, 这比使用 L1 L 1 正则化在整个过程中更为实在和有效。其次, CondenseNets 的稀疏度比过滤级修剪高, 但生成高效的卷积组-在稀疏性和正则性之间达到一个最佳点。

Efficient network architectures. A range of recent studies has explored efficient convolutional networks that can be trained end-to-end [16, 19, 22, 46, 47, 48, 49]. Three prominent examples of networks that are sufficiently efficient to be deployed on mobile devices are MobileNet [16], ShuffleNet [47], and Neural Architecture Search (NAS) networks [49]. All these networks use depth-wise separable convolutions, which greatly reduce computational requirements without significantly reducing accuracy. A practical downside of these networks is depth-wise separable convolutions are not (yet) efficiently implemented in most deeplearning platforms. By contrast, CondenseNet uses the well-supported group convolution operation [25], leading to better computational efficiency in practice.

高效的网络体系结构。最近的一系列研究探索了有效的卷积网络, 可以训练端到端网络 [16、19、22、46、47、48、49]。三个在移动设备上部署得足够高效的网络的突出例子是 MobileNet [16]、ShuffleNet [47] 和神经结构搜索 (NAS) 网络 [49]。所有这些网络都使用深度可分离的卷积, 这大大减少了计算要求而不显著降低精度。在大多数深度学习平台中, 这些网络的一个实际缺点是深度明智的可分离卷积 (尚未) 有效地实现。相比之下, CondenseNet 使用支持良好的卷积组操作 [25], 从而在实践中提高了计算效率。

Architecture-agnostic efficient inference has also been explored by several prior studies. For example, knowledge distillation [2, 15] trains small “student” networks to reproduce the output of large “teacher” networks to reduce testtime costs. Dynamic inference methods [1, 6, 7, 17] adapt the inference to each specific test example, skipping units or even entire layers to reduce computation. We do not explore such approaches here, but believe they can be used in conjunction with CondenseNets.

一些先前的研究还探讨了体系结构不可知的有效推理。例如, 知识蒸馏 [2, 15] 训练小 “学生” 网络再生产大 “老师” 网络的输出减少测试时间费用。动态推理方法 [1、6、7、17] 将推理调整到每个特定的测试示例, 跳过单元甚至整个层以减少计算。我们没有在这里探索这种方法, 但相信它们可以与 CondenseNets 一起使用。

Densely connected networks (DenseNets; [19]) consist of multiple dense blocks, each of which consists of multiple layers. Each layer produces k k features, where k is referred to as the growth rate of the network. The distinguishing property of DenseNets is that the input of each layer is a concatenation of all feature maps generated by all preceding layers within the same dense block. The l l th layer therefore receives lk input channels. Each layer performs a sequence of consecutive transformations, as shown in the left part of Figure 1. The first transformation (BN-ReLU, blue) is a composition of batch normalization [23] and rectified linear units [35]. The first convolutional layer in the sequence reduces the number of channels from lk l k to 4k 4 k , using the computationally efficient 1 × 1 filters. The output is followed by another BN-ReLU transformation and is then reduced to the final k k output features through a 3 × 3 convolution.

密集网络 (DenseNets;[19]) 由多个密集块组成, 每个区块由多个层组成。每个层都产生 k特征, 其中 k k 被称为网络的增长率。DenseNets 的区别在于, 每个层的输入是同一稠密块内所有前层生成的所有特征映射的串联。因此,第 l l 层接收 lk 输入通道。每个层都执行一系列连续转换, 如图1的左部分所示。第一个转换 (BN-ReLU, 蓝色) 是批正则化的构成 [23] 和整流线性单位 [35]。序列中的第一个卷积层通过计算效率 1 x 1 过滤器, 将通道的数量从 lk l k 减少到 4k 4 k 。输出之后是另一个 BN ReLU 转换, 然后通过 3 x 3 卷积将其还原为最终的 k k <script type="math/tex" id="MathJax-Element-320">k</script> 输出特性。

Group convolution is a special case of a sparsely connected convolution, as illustrated in Figure 2. It was first adopted in the AlexNet architecture [25], and has more recently been popularized by their successful application in ResNeXt [43]. Standard convolutional layers (left illustration in Figure 2) generate O output features by applying a convolutional filter (one per output) over all R input features. The cost R × O is expensive if the number of input features is large. Group convolution (right illustration) reduces this cost by partitioning the input features into G mutually exclusive groups, each producing its own outputs— reducing the computational cost by a factor G to (R×O)/G

卷积组是稀疏连接卷积的特例, 如图2所示。它首先被采纳了在 AlexNet 结构 [25], 并且最近被推广了他们的成功的应用在 ResNeXt [43]。标准卷积层 (图2中的左图) 通过对所有 R 输入特征应用卷积过滤器 (每输出一个) 来生成 O 输出特性。如果输入特征的数量很大, 成本 R x O 就很昂贵。卷积组 (右图) 通过将输入特征划分为G互斥组来降低这一成本, 每一个都产生自己的输出-通过一个因子将计算成本 G 降低到(R×O)/G

Group convolution works well with many deep neural network architectures [43, 46, 47] that are connected in a layer-by-layer fashion. For dense architectures group convolution can be used in the 3 × 3 convolutional layer (see Figure 1, left). However, preliminary experiments show that a na ̈ıve adaptation of group convolutions in the 1 × 1 layer leads to drastic reductions in accuracy. We surmise that this is caused by the fact that the inputs to the 1×1 layer are concatenations of feature maps generated by preceding layers. Therefore, they differ in two ways from typical inputs to convolutional layers: 1. they have an intrinsic order; and 2. they are far more diverse. The hard assignment of these features to disjoint groups hinders effective re-use of features in the network. Experiments in which we randomly permute feature maps in each layer before performing the group convolution show that this reduces the negative impact on accuracy — but even with the random permutation, group convolution in the 1 × 1 conv. layer makes DenseNets less accurate than for example smaller DenseNets with equivalent computational cost.

卷积组很好地与许多深神经网络体系结构 [43、46、47] 连接在一个逐层的方式。对于密集体系结构, 卷积组可用于 3 x 3 卷积层 (见图 1, 左)。然而, 初步实验表明, 在 1 x 1 层中, 原生对卷积组的适应性会导致精度的大幅降低。我们推测这是由于1×1层的输入是由前面的层生成的特征映射串联的事实造成的。因此, 它们从典型的输入到卷积层有两种不同的方式: 1. 它们有内在的秩序; 和 2.它们的多样性要大得多。将这些特征硬分配给不相交的组会妨碍有效地再利用网络中的特征。在执行卷积组之前, 我们随机置换每个层中的特征映射的实验表明, 这减少了对精度的负面影响–即使是随机置换, 卷积组在 1 x 1 卷积层使 DenseNets 更少准确比例如更小的 DenseNets 以等效计算费用。

Huang et al. [19] have shown that it is important to make early features available as inputs to later layers. Although not all prior features are needed at every subsequent layer, it is hard to predict which features should be utilized at what point. To address this problem, we develop an approach that learns the input feature groupings automatically during training. Learning the group structure allows each filter group to select its own set of most relevant inputs. Further, we allow multiple groups to share input features and also allow features to be ignored by all groups. (Even if an input feature is ignored by all groups in a specific layer, it can still be utilized by some groups At different layers.)

黄等人 [19] 已经表明, 重要的是要使早期的特征作为输入的后期层。虽然在每个后续层都不需要所有以前的特征, 但很难预测在什么时候应该使用哪些特征。为了解决这个问题, 我们开发了一种在训练过程中自动学习输入特征分组的方法。学习组结构允许每个过滤器组选择它自己的一组最相关的输入。此外, 我们允许多个组共享输入特征, 并且允许所有组都忽略这些特征。(即使特定层中的所有组都忽略了输入特征, 但在不同层中的某些组仍然可以使用它。)

We learn group convolutions through a multi-stage process, illustrated in Figures 3 and 4. The first half of the training iterations comprise of condensing stages. Here, we repeatedly train the network with sparsity inducing regularization for a fixed number of iterations and subsequently prune away unimportant filters with low magnitude weights. The second half of the training consists of the optimization stage, in which we learn the filters after the groupings are fixed. When performing the pruning, we ensure that filters from the same group share the same sparsity pattern. As a result, the sparsified layer can be implemented using a standard group convolution once training is completed (testing stage).

我们通过一个多阶段的过程学习卷积组, 如图3和4所示。训练迭代的前半部分包括压缩阶段。在这里, 我们反复训练网络的稀疏性引导固定数的迭代的正则化, 然后修剪掉不重要的过滤器低量权重。训练的第二部分包括优化阶段, 我们学会过滤器在分组被固定之后。执行修剪时, 我们确保同一组中的过滤器共享相同的稀疏模式。因此, 一旦训练完成 (测试阶段), 稀疏层可以使用标准卷积组实现。

In this paper, we introduced CondenseNet: an efficient convolutional network architecture that encourages feature re-use via dense connectivity and prunes filters associated with superfluous feature re-use via learned group convolutions. To make inference efficient, the pruned network can be converted into a network with regular group convolutions, which are implemented efficiently in most deeplearning libraries. Our pruning method is simple to implement, and adds only limited computational costs to the training process.

在本文中, 我们介绍了 CondenseNet: 一个高效的卷积网络体系结构, 它鼓励通过密集连接和修剪过滤器的特征再利用, 与通过学习卷积组多余特征再利用相结合。为了有效地进行推理, 可将修剪后的网络转换为具有常规卷积组的网络, 在大多数深度学习库中都能有效地实现。我们的修剪方法很容易实施, 并在训练过程中增加了有限的计算成本。

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值