
  1. Residual connections
  2. Weighted residual connections
  3. Multi-input weighted residual connections
  4. Cross stage partial connections (CSP)


二、 Weighted Residuals for Very Deep Networks


Deep residual networks have recently shown appealing performance on many challenging computer vision tasks. However, the original residual structure still has some defects making it difficult to converge on very deep networks. In this paper, we introduce a weighted residual network to address the incompatibility between ReLU and elementwise addition and the deep network initialization problem. The weighted residual network is able to learn to combine residuals from different layers effectively and efficiently. The proposed models enjoy a consistent improvement over accuracy and convergence with increasing depths from 100+ layers to 1000+ layers. Besides, the weighted residual networks have little more computation and GPU memory burden than the original residual networks. The networks are optimized by projected stochastic gradient descent. Experiments on CIFAR-10 have shown that our algorithm has a faster convergence speed than the original residual networks and reaches a high accuracy at 95.3% with a 1192-layer model.

1  说明


The state-of-the-art model for image classification is built on inception and residual structure [1,2,3]. Lots of works devoted on residual networks are emerging recently [4,5,6,7]. Very deep convolutional networks [8,9], especially with residual units, have shown compelling accuracy and nice convergence behaviors on many challenging computer vision tasks [3,10,11]. Since vanishing gradients problem is well handled by batch normalization [12] and highway signal propagation [13], networks with 100+ layers are being developed and trained, even 1000+ layers structure still yields meaningful results when combined with adequate dropout as shown in [6] . He et al. [4] also introduced the pre-activation structure to allow the highway signal to be directly propagated through the very deep networks. However they seemed to harness the features with a larger dimension (4×) and adapted multiple 1 × 1 convolutional layers to substitute 3 × 3 convolutional layers for convergence with 1000+ layers.



A typical convolutional unit is composed of one convolutional layer, one batch normalization layer and one ReLU layer, all of which are performed sequently [12]. For a residual unit, a central question is how to combine the residual signal and the highway signal, where element-wise addition was proposed in [3]. A natural idea is to perform addition after ReLU activation. However, this leads to a nonnegative output from residual branch, which limits the representative ability of the residual unit meaning that it can only enhance the highway signal. He et al. firstly proposed to perform addition between batch normalization and ReLU. In [4], they further proposed to inverse the order of the three layers, performing batch normalization and ReLU before convolutional layers. The question is due to that ReLU activation can only generate positive value which is incompatible with element-wise addition in the residual unit.



As it is non-convex optimization to solve deep networks, an appropriate initialization is important for both faster convergence and a good local minima. The “xavier” [14] and “msra” [15] are popular used for deep networks initialization. However, for networks with depths beyond 100 layers, neither “xavier” nor “msra” works well. The paper of [3] proposed to “warm up” the network with small learning rate and then restore the learning rate to normal value. However, this hand-craft strategy is not that useful for very deep networks, where even a very low learning rate (0.00001) still is not enough to promise convergence and restoring the learning rate has a chance to get rid of the initial convergence [2].





Generally speaking, there are two defects embedded in the training of the original residual networks

– Incompatibility of ReLU and element-wise addition.

– difficutly for networks to converge with depths beyond 1000-layer using “msra” initializer.



The third point resides that a better mode to combine the residuals from different layers are necessary to train very deep networks. For very deep networks, not all layers are that important as 1000-layer networks often perform not much better than 100-layer networks. In fact, lots of layers serve as redundant information and very deep networks tend to over-fit on some tasks.




In this paper, we introduce the weighted residual networks, which learn to combine residuals from different layers effectively and efficiently. All the residual weights are initialized at zeros and optimized with a very small learning rate (0.001), which allows all the residual signals to gradually add to the highway signal. With a group of gradually growing-up residual weights, the 1192-layer residual networks converge even much faster than the 100-layer networks. Finally, the distribution of the learned residual weights is in a symmetry mode ranging in [−0.5,0.5], which implies the incompatibility of ReLU and elementwise addition can be appropriately handled. The networks are optimized by projected stochastic gradient descent with exactly the same training epochs to original residual networks.


Experiments on CIFAR-10 have shown that our algorithm has a faster convergence speed than the original residual networks and reaches a high accuracy at 95.3% with a 1192-layer model.


We conduct experiments on CIF AR-10 [16] to verify the practicability of the weighted residual networks. Training with the weighted residual networks can converge much faster and reach a higher performance with negligible more computation and GPU memory cost than the original residual networks. The weighted residual networks with depths beyond 1000 layers still converge faster than shallower networks and enjoy a consistent improvement over accuracy with increasing depths from 100+ layers to 1000+ layers without resorting to any hand-craft strategy such as “warm up” [3]. After applying dropout on the residuals, our weighted residual networks reach a very high accuracy (95.3%) on Weighted Residuals for Very Deep Networks 3 CIF AR-10 using a 1192-layer model with the same training epochs to the original residual networks (about 164 epochs, 64k iterations).








The contributions of our work presented in this paper have four folds:

– We propose the weighted residual networks, which learn to combine the residuals from each residual unit. The weighted residual networks converge much faster in the training stage and reach a higher accuracy than the original residual networks at little more computation and GPU memory cost.

– The incompatibility of ReLU and element-wise addition can be addressed appropriately by weighted residuals and we clear all the obstacles on the information highway to allow the highway signal to enjoy a unhindered propagation.

– The residuals are gradually added to the highway signal to make the training process more reliable, even networks with depths beyond 1000 layers can converge very fast without the “warm up” strategy.

– We modify the down-sampling step to make the spatial size and feature dimension consistent between highway signal and branched residual signal, without resorting to zero-padding or extra converting matrix. The weighted residual networks are simple and easy to implement while having surprising practical effectiveness, which makes it particular useful for complicated residual networks in research community and real applications.

2 相关工作



The residual networks have attracted lots of researchers and many works on it have appeared [4,5,6,7,17]. In the following paragraphs we will review some related works.


残差网络使用身份跳过连接简化了高速网络【13】,这允许信息直接流动并绕过复杂的层。残差网络由多个残差单元组成。在残差单元中有两个信息流。高速信号通过同一跳接,支路残差信号由Conv-BN ReLU-Conv-BN实现。这两个流在残差单元的末端通过元素加和进行组合,然后通过ReLU层进行激活。这个简单的结构非常强大,并且在imageNet挑战[1]中使用150层网络[3]取得了令人惊讶的性能。

The residual networks simplify the highway networks [13] using identity skip connection, which allows information to flow directly and bypass complex layers. The residual networks consist of many residual units. There are two information flows in a residual unit. The highway signal goes through the identity skip connection and the branched residual signal is realized by Conv-BN-ReLU-Conv-BN. The two flows are combined at the end of a residual unit by element-wise addition and then it goes through a ReLU layer for activation. This simple structure is quite powerful and achieved a surprising performance on the imageNet challenge [1] with 150-layer networks [3].



In the original residual networks, the two flows are added up before ReLU activation for a numerical reason that ReLU can only produce non-negative output, which means the branched residual signal can only enhance the highway signal. However, intuitively it is not a natural solution as the branched residual signal needs to be “activated”.


He等人[4] 建议通过将这些层重新排列到BN-ReLU-Conv-BN-ReLU-Conv 并且命名为“预激活”结构。在应用“预激活”结构时,应特别注意网络的第一个和最后一个残差单元。

He et al. [4] proposed to handle the incompatibility between ReLU and elementwise addition by re-arranging these layers to BN-ReLU-Conv-BN-ReLU-Conv and named it “pre-activation” structure. When applying the “pre-activation” structure, special attention should be taken on the first and the last residual unit of the networks.


为了训练“残差”网络,自然只适合“残差”网络,这意味着当分支残差信号不存在时,高速信号仍应取得有意义的结果。在这种情况下,分支残差信号可以集中于在残差单元中拟合“残差”。Huang等人[6] 提出了一种丢包残差网络,在每个残差单元中随机丢包分支残差信号。因此,当分支残差信号以残差单位表示时,它可以专注于拟合“残差”。由于该模型可以看作是不同深度模型的集合,因此他们将其命名为“随机深度网络”。

To train “residual” networks, it is natural to fit on the “residual” only, which means when the branched residual signal is not presented, the highway signal should still make meaningful results. Under this condition, the branched residual signal can focus on fitting the “residual” in a residual unit. Huang et al. [6] proposed a dropout residual network, which randomly drops the branched residual signal in each residual unit. Therefore, when the branched residual signal is presented in a residual unit, it can focus on fitting the “residual”. As this model can be treated as an ensemble of models with different depths, they named it “stochastic depth networks”.


在卷积网络中,深度和宽度对于图像分类的高性能都很重要[7,3]。conv1-conv3-conv1瓶颈结构使用的特征尺寸比conv3-conv3大4倍,达到了更高的性能[4]。Zagoruyko等人[7] 使用特征尺寸10×更大的conv3-conv3,在CIF AR10上达到最高性能(4.10%)。然而,更大的特性维度会消耗更多的GPU内存,导致结构更浅。深度和宽度是平衡的。


In the convolutional networks, the depth and width are both important for a high performance in image classification [7,3]. The conv1-conv3-conv1 bottleneck structure which used a feature dimension 4× larger than conv3-conv3 reached a higher performance [4]. Zagoruyko et al. [7] used conv3-conv3 with feature dimension 10× larger and reached the highest performance on CIF AR10 (4.10%). However, a larger feature dimension costs much more GPU memory and leads a shallower structure. There is a balance between depth and width.



In this paper, we mainly focus on models with depth beyond 100+ layers. We mean to explore how to train a very deep model effectively instead of tuning a more accurate model.

3 加权残差网络



Firstly we will give a brief introduction to the residual networks. The residual networks build the information highway by allowing earlier feature representation to flow unimpededly and directly to the following layers without any modification. A residual unit performs the following computation:



Here xiis the input highway signal to the i-th residual unit. θiis the filter parameters for the residual unit and it is initialized by “msra”, ∆Liis the residual function, which is realized by a stack of two 3×3 convolutional layers. Typically, one convolutional layer should be followed by one batch normalization layer to keep the signal with non-zero variance and one ReLU layer for non-linearity activation. The highway should be clean and unhindered. As it is shown in [4], obstacles on the highway, such as constant scaling and dropout, will make the optimization difficult. A typical residual unit is depicted in Figure 1. The original residual networks stated above have two defects

图1    剩余单元的示意图。残差函数由两个3×3卷积层组成。每个卷积层(Conv)后面跟着一个批处理规范化层(BN)和一个ReLU层(ReLU)。卷积层的权值由msra初始化。高速信号和残差信号通过元素加法进行组合。


ReLU与元素加和的不相容性。高速信号和残差函数产生的残差信号通过元素加法进行组合。但是,按元素添加操作 在第二个Conv层之后的BN层和ReLU层之间。这主要是由于ReLU激活函数,它产生非负输出。ReLU操作的输出与元素加法不兼容,因为它只能增强高速信号,这限制了残差函数的可表示性,残差函数的意义是取中的值。当然,我们可以设计其他的激活函数,它可以在更大的范围内取值,或者在0附近采用对称模式。


Incompatibility of ReLU and element-wise addition. The highway signal and the residual signal which is produced by the residual function are combined by the element-wise addition. However, the element-wise addition is operated between the BN layer and the ReLU layer after the second Conv layer. This is mainly due to the ReLU activation function, which produces non-negative output. The output of ReLU operation is not compatible with element-wise addition as it can only enhance the highway signal, which limits the representability of the residual function, which is meant to take values in (−∞,+∞). One can of course resort to designing other activation function which can take values in a larger range or a symmetry mode around zero.



Initialization of very deep networks. Very deep networks with depths beyond 1000 layers, even equipped with residual structure, batch normalization and ReLU, still do not converge in the training stage as shown in Figure 5. The paper of [3] proposed to “warm up” the network training with a little learning rate for several epochs and then restore it to the normal learning rate in order to facilitate the initial convergence. However, for deeper networks, even very little learning rate may not work well [2].




In very deep networks, the residuals from each block are added together and make the training hard to converge. One may want to zero all the residuals to start the training. However, the weights of the convolutional layers in residual functions should be initialized by “msra” which has little probability to produce all-zero weights.

3.1 加权残差



其中θ为滤波器参数,由“msra”初始化,λ为残差的权值标量,由零初始化,学习率很小。ReLU激活从高速信号上移除,并通过两个Conv-BN-ReLUs实现 ∆Li。


To address the incompatibility of ReLU and element-wise addition and to get a better initialization for very deep networks, we introduce the weighted residual networks. Formally in a weighted residual networks unit, the computation of the signal is where θiis the filter parameters and it is initialized by “msra” , λiis the weight scalar for the residual and it is initialized by zero with a very small learning rate.The ReLU activation is removed from the highway and ∆Liis realized by two Conv-BN-RelUs.


对于任何深度的模块,第(i+k)层的特征表示可以表示为输入层表示Xi 和一系列加权残差函数的求和,



For any deep blocks, the feature representation xi+k in the (i + k)-th layer can be expressed as a summation of the input layer representation xi and a series of weighted residual functions,In the back-propagation stage, the gradient of any layer does not vanish when filter parameter θi+jis arbitrarily small. Note that the pre-activation structure proposed in [4] also has a similar property by converting the order of Conv-BN-ReLU to BN-ReLU-Conv.




In Figure 3 we visualize the distribution of the learned residual weights in a 1192-layer model. The residual weight values range around (-0.5,0.5) in a symmetry mode, which means the branched residual signal has equal probability to enhance/weaken the highway signal, which means the incompatibility between ReLU and element-wise addition is appropriately addressed by the learned residual weights.

3.2  结构修改



At the beginning of a new block in the original residual networks, the highway signal is down-sampled by a stride-2 convolution layer while the branched residual signal also need to be halved by a stride-2 convolution layer. When performing the element-wise addition, zeros-padding or convert matrix is necessary to make a matched feature dimension between the two signals. In our networks as it is shown in Figure 4, we directly halve the feature size at the beginning and the following layers are performed as stated in the previous sections.

3.3  优化


给定训练图像及其对应的ground truth标签{Ii,yi},损失函数是负似然和正则项的总和



Given training images and its corresponding ground truth labels {Ii, yi}, the loss function is the summation of the negative likelihood and the regularized term. where θ is the network parameters which is initialized by “msra”, λ is the weight vector for the residuals and is initialized by all-zeros. We apply projected SGD to this typical constraint optimization problem. In the (t + 1)-th iteration, the updated λt+1 i is projected to the convex set S


其中凸集S=(-1,1),∆λt 是方程4中关于λt i的损失函数的梯度,该梯度可通过深度网络中的反向传播[18]有效计算。

where the convex set S = (−1,1) and ∆λt i is the gradient of the loss function in Equation 4 with regard to λt i, which is effectively computed by back-propagation [18] in deep networks.

3.4  应用细节


数据集。CIF AR-10[16]是一个32×32大小的彩色图像数据集,由10类50k训练图像和10k测试图像组成。我们在训练集上训练我们的深度模型,并在测试集上评估最终训练的模型。我们遵循[3]中提出的相同的残差架构。我们的代码构建在开源深度学习框架Caffe【19】之上。我们使用0.0001的重量衰减和0.9的动量,批量大小为128。初始学习率为0.1,没有任何模型的“热身”。残差权重的初始学习率设置为0.001。滤波器参数由“msra”[15]初始化。残差权重设置为所有零。我们并不打算在CIFAR-10上推出最先进的表演,因此我们遵循与[3]相同的培训策略。所有模型都经过64k次迭代训练,在32k次和48k次迭代时,学习率除以10。我们还采用了简单的数据增强,如[20]所示:在训练图像周围填充4个像素,值为零,并将转换或镜像的32×32裁剪输入到网络中。我们没有val集合,训练结束时的模型用于在测试集合上执行。在测试阶段,对原始的32×32图像进行评价。



原文: 可修改后右键重新翻译

Dataset. CIF AR-10 [16] is a dataset of color images all coming with the same size of 32×32, which consists of 50k training images and 10k testing image in 10 classes. We train our deep model on the train set and evaluate the finally trained models on the test set. We follow the same residual architecture as proposed in [3]. Our code is built on the open source deep learning framework Caffe [19]. We use a weight decay of 0.0001 and momentum of 0.9 with batch size of 128. The initial learning rate is 0.1 without “warm up” for any model. The initial learning rate for residual weights is set to 0.001. The filter parameters are initialized by “msra” [15]. The residual weights are set to all-zeros. We are not meant to push the state-of-the-art performance on CIFAR-10 so we follow the same training strategy as [3]. All the models are trained for 64k iterations and the learning rate is divided by 10 at 32k and 48k iterations. We also adapt the simple data augmentation as it is shown in [20]: 4 pixels are padded around the training images with zero-values and a translated or mirrored 32 × 32 crop is fed into the networks. We do not have val set and the model at the end of training is used to perform on the test set. In the test stage, the original 32 × 32 images are evaluated.

The network contains three blocks and the feature map is halved twice. There are totally 6n + 4 layers as it is shown in Table 1. We compare n = {1,3,9,18,48,100,198}, which leads to 10, 22, 58, 112, 292, 604 and 1192-layer networks.

4  实验


在这一部分中,我们给出并分析了在CIF AR-10上的实验结果,以证明加权残差网络的有效性。

In this section we present and analyze the experiment results on CIF AR-10 to demonstrate the effectiveness of the weighted residual networks.

图5:原始残差网络和加权残差网络在CIF AR-10上的比较。粗线表示加权残差网络,虚线表示原始残差网络。左上角为训练熵损失,右上角为相应的浅层网络测试精度。左下角的图记录了非常深的网络的训练熵损失,右下角的图是放大版,以获取更多细节。

4 .1 实验结果


汇聚。首先我们在浅层网络(层数<100)上进行实验。如图5(a)和图5(b)所示,加权残差网络和原始残差网络在浅层网络上的收敛性能和最终精度非常相似。然后我们在非常深的网络上进行实验(层数>100)。在图5(c)中,加权残差网络在训练阶段表现出更好的收敛性能。事实上,深度超过1000层的网络仍然比图5(d)中的112层网络收敛得更快。相反,由于没有采用“热身”策略,原有的残差网络不收敛,1192层网络甚至根本不收敛。然而,即使配备了“预热”,原始的1192层剩余网络也会因过度拟合而结束,并且达到比文献[3]中所述的112层网络更差的性能。准确度。图6报告了CIF AR-10深网络的总体测试精度。蓝色直方图表示原始网络的性能。当层数大于100时,精度降低。然而,对于表示为黄色直方图的加权残差网络,随着深度从10+层增加到1000+层,性能得到了一致的改善。在我们的实验中,当有更多的层时,加权残差网络总是能够更快地收敛并达到更高的性能。

图6:在CIF AR-10上测试精度。原始的1192层残差网络在训练阶段未能达到有意义的结果,我们不予报告。

Convergence. Firstly we experiment on shallow networks (layer number < 100). As it is shown in Figure 5(a) and Figure 5(b), both of the weighted residual networks and the original residual networks have very similar performance of convergence and final accuracy on shallow networks. Then we conduct experiments on very deep networks (layer number > 100). In Figure 5(c), the weighted residual network shows much better performance on convergence in the training stage. In fact, networks with depths beyond 1000 layers still converge faster than the 112-layer networks in Figure 5(d). As contrary, the original residual network does not converge well and the 1192-layer network even does not converge at all as we did not apply the “warm up” strategy. However, even equipped with “warm up”, the original 1192-layer residual network ends with over-fitting and reaches a worse performance than the 112layer network as it is reported in [3]. Accuracy. The overall test accuracy of deep networks on CIF AR-10 is reported in Figure 6. The blue histograms denote the performance of the original networks. The accuracy decreases after the layer number is larger than 100. However, for the weighted residual networks, which are denoted as yellow histograms, the performance enjoys a consistent improvement with the increasing depths from 10+ layers to 1000+ layers. The weighted residual networks can always converge faster and reach a higher performance when there are more layers throughout our experiments.

4.2 与世界先进水平的比较




In this subsection we compare the weighted residual networks (WResNet) with other recently proposed models. Mainly there are two kinds of models, first of which focus on enlarging the feature dimension and we call them wide models, the second of which focus on depths and we call them deep models. Note that 1001-layer Pre-activation [4] is both deep (1000+ layers) and wide (4× feature dimension) model. The results are presented in Table 2. All these models, except for Highway [13], share similar structures with ResNet [3], including three feature blocks.


预激活[4]采用conv1-conv3-conv1瓶颈结构,特征尺寸放大4×。显然,一个4倍宽的型号享受更高的性能,但成本更多的GPU内存。由于GPU内存(一个GTX TITAN X为128GB)资源有限,因此经济地调整模型的宽度和深度对于非常精确的模型非常重要。WideDim[7]和RiR[5]是另外两种提高特征尺寸精度的方法。一个明显的趋势是,更广泛的功能更利于更高的性能。WideDim采用了10倍的特征尺寸,在CIF AR-10上达到了非常高的性能(95.8%)。Dropout[6]在相同的GPU内存开销下,对残差信号应用Dropout操作,实现了随机深度网络。唯一的缺点是它需要更多的时间(大约2倍)才能收敛到一个好的性能。


Pre-activation [4] adapted a conv1-conv3-conv1 bottle-neck structure and enlarged the feature dimension by 4×. Apparently a 4× wider model enjoys a higher performance but costs more GPU memory. As the GPU memory (12GB for one GTX TITAN X) resource is limited, it is important to tune the model width and depth economically for a very accurate model. WideDim [7] and RiR [5] are two other methods to enlarge the feature dimension for higher accuracy. A clear tendency is that a wider feature is better for higher performance. WideDim adapted a 10× feature dimension and reached a very high performance (95.8%) on CIF AR-10. Dropout [6] realized stochastic depth networks by applying the dropout operation on the residual signal at exactly the same GPU memory cost. The only defect resides that it needs much more epochs (about 2×) to converge at a good performance.




The weighted residual networks make very deep networks training converge faster and reach a good performance while bringing little more computation and GPU memory burden. As time and GPU resource is limited, we have not tuned the model width (feature dim.) or more training epochs and we are meant to explore the effectiveness of the weighted residuals in training very deep models. Yet with shorter feature dim., the weighted residual networks still perform much better than the original residual networks and reach a quite meaningful accuracy as shown in Table 2.


我们进一步应用[6]提出的三个区块的dropout ratio为{0.2,0.4,0.6}的残差。该模型的性能称为WResNet-d,训练周期只有文献[6]的一半左右,加权残差网络具有较高的性能(95.3%)。


We further apply dropout on the residuals with dropout ratio = {0.2,0.4,0.6} for three blocks as proposed by [6]. The performance of this model is named as WResNet-d. With only about half training epochs of [6], the weighted residual networks with dropout reach a relative very high performance (95.3%).

4.3  分析



We provide more insights into the weighted residual networks by presenting more details information of results in this subsection.




The initial learning rate for the residual weights is set to 0.001 for all models and the residual weights are initialized with zeros.residual weight values in each element-wise addition layer in a 1192-layer model. It comprises two parts divided by a visible sharp boundary around the 800-layer and the latter residuals have larger weights. It may imply the residuals from the later layers are more important than earlier layers on the final decisions. We will explore this phenomenon in the future work.




We have also plotted the evolution history of the distribution of the residual weight values as show in Figure 8. At the 8k iteration, the distribution is relative uniform. As more and more training iterations, the distribution begins to concentrate around two peaks. In the 64k iteration, most of the residual weight values are around 0.2 and −0.2 in a symmetry mode indicating that the branched residual signals have equal probability to enhance/weaken the highway signals, which verifies our hypothesi. Therefore the learned residual weights can solve the incompatibility between ReLU activation and element-wise addition appropriately.

5  结论


原始剩余网络存在两个缺陷:1)ReLU与元素加法不相容。2) 使用“msra”初始值设定项很难在深度超过1000层的网络中收敛。本文引入加权残差网络,使得极深残差网络比原残差网络收敛速度更快,性能更高,计算量和GPU内存负担也更小。所有的残差通过学习的缓慢增长的权重逐步加入到高速信号中,以保证收敛。在CIF-AR-10上的实验证明了加权残差网络对非常深的模型的有效性。随着深度从100+层增加到1000+层,它在精度和收敛性方面有了一致的改进。加权残差网络具有简单易实现、实用性强等优点,特别适用于复杂残差网络的研究和实际应用。


The original residual networks have two defects, 1) Incompatibility between ReLU and element-wise addition. 2) Difficulty for networks to converge with depths beyond 1000-layer using “msra” initializer. In this paper we introduce the weighted residual networks to make very deep residual networks converge faster and reach a higher performance with little more computation and GPU memory burden than the original residual networks. All the residuals are added to the highway signal gradually by the learned slowly growing-up weights to promise convergence. Experiments on CIF AR-10 have demonstrated the effectiveness of the weighted residual networks for very deep models. It enjoys a consistent improvements over accuracy and convergence with the increasing depths from 100+ layers to 1000+ layers. The weighted residual networks are simple and easy to implement while having surprising practical effectiveness, which makes it particular useful for complicated residual networks in research community and real applications.

(二)Multi-input weighted residual connections


(三)Cross stage partial connections (CSP)


1. 简介

Cross Stage Partial Network(CSPNet)就是从网络结构设计的角度来解决以往工作在推理过程中需要很大计算量的问题。


下图是cspnet对不同backbone结合后的效果,可以看出计算量大幅下降,准确率保持不变或者略有提升(ps: 分类的提升确实不多)




  • 增强CNN的学习能力,能够在轻量化的同时保持准确性。
  • 降低计算瓶颈
  • 降低内存成本

2. 实现


图中的Transition Layer代表过渡层,主要包含瓶颈层(1x1卷积)和池化层(可选)。(a)图是原始的DenseNet的特征融合方式。(b)图是CSPDenseNet的特征融合方式(trainsition->concatenation->transition)。(c)图是Fusion First的特征融合方式(concatenation->transition)(d)图是Fusion Last的特征融合方式(transition->concatenation)

Fustion First的方式是对两个分支的feature map先进行concatenation操作,这样梯度信息可以被重用。

Fusion Last的方式是对Dense Block所在分支先进性transition操作,然后再进行concatenation, 梯度信息将被截断,因此不会重复使用梯度信息 。

Fusion First 和 Fusion Last 对比

上图是对Fusion First、Fusion Last和CSP最终采用的融合方式(对应上图CSPPeleeNet)在ILSVRC2012分类数据集上的对比,可以得到以下结论:

  • 使用Fusion First有助于降低计算代价,但是准确率有显著下降。
  • 使用Fusion Last也是极大降低了计算代价,top-1 accuracy仅仅下降了0.1个百分点。
  • 同时使用Fusion First和Fusion Last的CSP所采用的融合方式可以在降低计算代价的同时,提升准确率。

上图是DenseNet的示意图以及CSPDenseNet的改进,改进点在于CSPNet将浅层特征映射为两个部分,一部分经过Dense模块(图中的Partial Dense Block),另一部分直接与Partial Dense Block输出进行concate。





3. FPN设计


第一个如(a)图所示,是最常见的FPN,在YOLOv3中使用。(ps: YOLOv3中的FPN跟原始FPN不同,其融合的方式是concate)

第二个如(b)图所示,是ThunderNet中提出的GFM, 之前的文章中有详解,直接将多个不同分辨率的特征进行融合,具体融合方式是相加。


4. 实验


实验是基于MS COCO数据集的,PRN其实也是同一个团队在提出的和CSP相似的思想,被ICCV接受。

上图来自《Enriching Variety of Layer-wise Learning Information by Gradient Combination》,也就是RPN网络,也是将输入特征划分为两部分,一部分经过卷积,另一部分经过直接通过concate进行融合。






下图是在MS COCO数据集上的SOTA模型:

MS COCO上的结果比较




CSPNet和PRN都是一个思想,将feature map拆成两个部分,一部分进行卷积操作,另一部分和上一部分卷积操作的结果进行concate。









PRN: Enriching Variety of Layer-wise Learning Information by Gradient Combination





