论文翻译——Weighted Residuals for Very Deep Networks

本文针对深度残差网络在训练过程中存在的ReLU激活与元素加法不兼容以及深度网络初始化问题,提出加权残差网络。通过在不同层间学习并结合残差,有效提升了模型的收敛速度和准确性。实验结果显示,加权残差网络在1192层模型上达到95.3%的精度,优于原始残差网络,并且计算量和GPU内存消耗增加不大。
摘要由CSDN通过智能技术生成

               Weighted Residuals for Very Deep Networks

 

论文:https://ieeexplore.ieee.org/document/7811085

 

最近,深度剩余网络在许多具有挑战性的计算机视觉任务中表现出了引人注目的性能。然而,原有的残差结构仍然存在一些缺陷,使得其难以在深度很深的网络上收敛。在本文中,我们引入了一个加权残差网络来解决ReLU和元素加法与深度网络初始化之间的不兼容性问题。加权残差网络能够有效地结合不同层次的残差。随着深度从100+层增加到1000+层,所提出的模型在准确性和收敛性方面都有持续的改善。加权残差网络与原始残差网络相比,计算量和GPU内存负担基本没有增加。采用投影随机梯度下降法对网络进行优化。在CIFAR-10上的实验表明,我们的算法比原始的残差网络具有更快的收敛速度,在1192层模型下达到了95.3%的精度。

Deep residual networks have recently shown appealing performance on many challenging computer vision tasks. However, the original residual structure still has some defects making it difficult to converge on very deep networks. In this paper, we introduce a weighted residual network to address the incompatibility between ReLU and elementwise addition and the deep network initialization problem. The weighted residual network is able to learn to combine residuals from different layers effectively and efficiently. The proposed models enjoy a consistent improvement over accuracy and convergence with increasing depths from 100+ layers to 1000+ layers. Besides, the weighted residual networks have little more computation and GPU memory burden than the original residual networks. The networks are optimized by projected stochastic gradient descent. Experiments on CIFAR-10 have shown that our algorithm has a faster convergence speed than the original residual networks and reaches a high accuracy at 95.3% with a 1192-layer model.

1  说明

目前最先进的图像分类模型是建立在inception和residual结构上的[1,2,3]。最近出现了很多关于残差网络的研究[4,5,6,7]。非常深的卷积网络[8,9],特别是带有残差单位的卷积网络,在许多具有挑战性的计算机视觉任务上显示出令人信服的准确性和良好的收敛性[3,10,11]。由于批量归一化[12]和高速公路信号传播[13]很好地解决了梯度消失问题,所以我们正在开发和训练100+层的网络,即使是1000+层结构,在[6]中显示结合足够的dropout也能产生有意义的结果。He等人[4]还引入了预激活结构,以允许高速公路信号通过非常深的网络直接传播。但他们似乎利用了更大维度(4倍)的特征,采用了多个1×1卷积层代替3×3卷积层以实现1000+层的收敛。

The state-of-the-art model for image classification is built on inception and residual structure [1,2,3]. Lots of works devoted on residual networks are emerging recently [4,5,6,7]. Very deep convolutional networks [8,9], especially with residual units, have shown compelling accuracy and nice convergence behaviors on many challenging computer vision tasks [3,10,11]. Since vanishing gradients problem is well handled by batch normalization [12] and highway signal propagation [13], networks with 100+ layers are being developed and trained, even 1000+ layers structure still yields meaningful results when combined with adequate dropout as shown in [6] . He et al. [4] also introduced the pre-activation structure to allow the highway signal to be directly propagated through the very deep networks. However they seemed to harness the features with a larger dimension (4×) and adapted multiple 1 × 1 convolutional layers to substitute 3 × 3 convolutional layers for convergence with 1000+ layers.

翻译:

一个典型的卷积单元由一个卷积层、一个批处理归一层和一个ReLU层组成,它们依次执行[12]。对于一个残差单元,关键问题是如何将残差信号与高速公路信号结合起来,在[3]中提出了按元素添加的方法。一个自然的想法是在ReLU激活后执行加法。但这导致了残差分支的非负输出,限制了残差单元的代表能力,即只能增强公路信号。He等人[3]首先提出在批次归一化和ReLU之间进行添加。在[4]中,他们进一步提出将三层的顺序颠倒,在卷积层之前进行批处理归一化和ReLU。问题在于ReLU激活只能产生正值,而正值与剩余单元的元素添加不相容。

A typical convolutional unit is composed of one convolutional layer, one batch normalization layer and one ReLU layer, all of which are performed sequently [12]. For a residual unit, a central question is how to combine the residual signal and the highway signal, where element-wise addition was proposed in [3]. A natural idea is to perform addition after ReLU activation. However, this leads to a nonnegative output from residual branch, which limits the representative ability of the residual unit meaning that it can only enhance the highway signal. He et al. firstly proposed to perform addition between batch normalization and ReLU. In [4], they further proposed to inverse the order of the three layers, performing batch normalization and ReLU before convolutional layers. The question is due to that ReLU activation can only generate positive value which is incompatible with element-wise addition in the residual unit.

翻译:

由于求解深度网络是一种非凸优化方法,因此适当的初始化对于快速收敛和良好的局部极小值是非常重要的。xavier[14]和msra[15]是深度网络初始化的常用工具。然而,对于深度超过100层的网络,“xavier”和“msra”都不是很好。[3]的论文提出对学习率较小的网络进行“预热”,然后将学习率恢复到正常值。然而,这种手工策略对深度网络不是很有用,在深度网络中,即使非常低的学习率(0.00001)也不足以保证收敛,恢复学习率有机会摆脱初始收敛[2]。

As it is non-convex optimization to solve deep networks, an appropriate initialization is important for both faster convergence and a good local minima. The “xavier” [14] and “msra” [15] are popular used for deep networks initialization. However, for networks with depths beyond 100 layers, neither “xavier” nor “msra” works well. The paper of [3] proposed to “warm up” the network with small learning rate and then restore the learning rate to normal value. However, this hand-craft strategy is not that useful for very deep networks, where even a very low learning rate (0.00001) still is not enough to promise convergence and restoring the learning rate has a chance to get rid of the initial convergence [2].

翻译:

一般来说,原始残差网络的训练存在两个缺陷:

——ReLU的不相容性和元素加法

——使用“msra”初始值设定项很难使网络收敛到1000层以上的深度

Generally speaking, there are two defects embedded in the training of the original residual networks

– Incompatibility of ReLU and element-wise addition.

– difficutly for networks to converge with depths beyond 1000-layer using “msra” initializer.

翻译:

第三点在于,为了训练非常深的网络,需要一种更好的模式来组合来自不同层的残差。对于非常深的网络,并不是所有的层都那么重要,因为1000层网络的性能通常不比100层网络好多少。事实上,许多层充当冗余信息,而且非常深的网络往往会过拟合某些任务。

The third point resides that a better mode to combine the residuals from different layers are necessary to train very deep networks. For very deep networks, not all layers are that important as 1000-layer networks often perform not much better than 100-layer networks. In fact, lots of layers serve as redundant information and very deep networks tend to over-fit on some tasks.

翻译:

本文引入加权残差网络,学习如何有效地组合不同层次的残差。所有的剩余权值都初始化为零,并以非常小的学习率(0.001)进行优化,这使得所有的剩余信号逐渐添加到高速公路信号中。随着一组残差权值的逐渐增大,1192层残差网络的收敛速度甚至比100层网络快得多。最后,学习的残差权重的分布是在[-0.5,0.5]范围内的对称模式,这意味着可以适当处理ReLU和元素加法的不相容性。利用与原残差网络具有完全相同训练时间的投影随机梯度下降法对网络进行优化。

 

In this paper, we introduce the weighted residual networks, which learn to combine residuals from different layers effectively and efficiently. All the residual weights are initialized at zeros and optimized with a very small learning rate (0.001), which allows all the residual signals to gradually add to the highway signal. With a group of gradually growing-up residual weights, the 1192-layer residual networks converge even much faster than the 100-layer networks. Finally, the distribution of the learned residual we

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值