RES2NET

最新推荐文章于 2024-05-10 08:42:45 发布

东升董事长

最新推荐文章于 2024-05-10 08:42:45 发布

阅读量951

点赞数

本文链接：https://blog.csdn.net/qq_40962619/article/details/89133854

版权

本文提出了一种名为Res2Net的新型CNN构建模块，通过在单个残差块内创建分层的残差连接来增强多尺度特征表示。这种方法增加了每个网络层的感受野，提高了在图像分类、对象检测等视觉任务上的性能。实验结果显示，Res2Net在ResNet、ResNeXt和DLA等模型上表现出了显著的性能提升。

摘要由CSDN通过智能技术生成

Abstract—Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models,e.g. , ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g. , CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e. , object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models will be made publicly available.
抽象 - 多尺度的表征特征对于许多视觉任务非常重要。骨干卷积神经网络（CNN）的最新进展不断展示出更强的多尺度表示能力，从而在广泛的应用中实现一致的性能提升。然而，大多数现有方法以分层方式表示多尺度特征。在本文中，我们通过在一个单个残差块内构造分层的残差类连接，为CNN提出了一种新的构建模块，即Res2Net。 Res2Net以粒度级别表示多尺度特征，并增加每个网络层的感知字段范围。所提出的Res2Net块可以插入到最先进的主干CNN模型中，例如。，ResNet，ResNeXt和DLA。我们在所有这些模型上评估Res2Net模块，并证明在广泛使用的数据集上比基线模型具有一致的性能增益，例如：，CIFAR-100和ImageNet。关于代表性计算机视觉任务的进一步消融研究和实验结果，即对象检测，类激活映射和显着对象检测，进一步验证了Res2Net相对于现有技术的基线方法的优越性。源代码和经过培训的模型将公开发布。
VISUAL patterns occur at multi-scales in natural senses as shown in Fig. 1. First, objects may appear with different sizes in a single image, e.g. , the sofa and cup are of different sizes. Second, essential contextual information of an object may
occupy a much larger area than the object itself. For instance, we need to rely on the big table as context to better tell whether the small black blob placed on it is a cup or a pen holder.Third, perceiving information from different scales is essential
for understanding parts as well as objects for tasks such as fine-grained classification and semantic segmentation. Thus, it is of critical importance to design good features for multi-scale stimuli for visual cognition tasks, including image classification [22],object detection [33], attention prediction [35], target tracking
[50], action recognition [36], semantic segmentation [3], salient object detection [18].
Unsurprisingly, multi-scale features have been widely used in both conventional feature design [1], [31] and deep learning [6],[18], [29], [39], [44], [52]. Obtaining multi-scale representations in vision tasks requires feature extractors to use a large range of receptive fields to describe objects/parts/context at different scales. Convolutional neural networks (CNNs) naturally learn coarse-to-fine multi-scale features through a stack of convolutional operators. Such inherent multi-scale feature extraction ability of CNNs leads to effective representations for solving
numerous vision tasks. How to design a more efficient network architecture is the key to further improving the performance of CNNs.
视觉模式在自然意义上以多尺度发生，如图1所示。首先，对象可以在单个图像中以不同的尺寸出现，例如，，沙发和杯子有不同的尺寸。其次，对象的基本上下文信息可能比对象本身占用更大的区域。例如，我们需要依靠大表作为上下文来更好地判断放置在其上的小黑色斑点是杯子还是笔筒。第三，从不同尺度感知信息对于理解部件和任务对象是必不可少的。例如细粒度分类和语义分割。因此，为视觉认知任务设计多尺度刺激的良好特征至关重要，包括图像分类，物体检测，注意力预测，目标跟踪，动作识别[36]，语义分割[3]，显着对象检测。
不出所料，多尺度特征已广泛用于传统特征设计和深度学习。在视觉任务中获得多尺度表示需要特征提取器使用大范围的感受域来描述不同尺度的对象/部件/上下文。卷积神经网络（CNN）通过一堆卷积算子自然地学习粗到细的多尺度特征。 CNN的这种固有的多尺度特征提取能力导致用于解决许多视觉任务的有效表示。如何设计更高效的网络架构是进一步提高CNN性能的关键。
In the past few years, several backbone networks have made significant advances in numerous vision tasks with state-of-the-art performance. Earlier architectures such as AlexNet and VGGNet stack convolutional operators, making the data-
driven learning of multi-scale features feasible. The efficiency of multi-scale ability was subsequently improved by using conv layers with different kernel size (e.g. , InceptionNets [38]–[40]), residual modules (e.g. , ResNet [17]), shortcut connections(e.g. , DenseNet [20]), and hierarchical layer aggregation (e.g. ,DLA [47]). The advances in backbone CNN architectures have demonstrated a trend towards more effective and efficient multi-scale representations.
在过去几年中，一些骨干网络在众多视觉任务方面取得了显着进步，并具有最先进的性能。早期的架构，如AlexNet和VGGNet堆栈卷积运算符，使得数据驱动的多尺度特征学习成为可能。随后通过使用具有不同内核大小的转换层（例如，InceptionNets [38] - [40]），残余模块（例如，ResNet [17]），快捷连接（例如，DenseNet [20]来提高多尺度能力的效率。 ]）和层次层聚合（例如，DLA [47]）。骨干CNN架构的进步已经证明了朝向更有效和更有效的多尺度表示的趋势。
In this work, we propose a simple yet efficient multi-scale processing approach. Unlike most existing methods that enhance the layer-wise multi-scale representation strength of CNNs, we improve the multi-scale representation ability at a more granular level. To achieve this goal, we replace the 3 × 3 filters 1 of n channels, with a set of smaller filter groups, each with w channels(without loss of generality we use n = s × w ). As shown in Fig. 2, these smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extract features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters alone with another group of input feature maps. This process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of 1×1 filters to fuse information altogether. Along with any possible path that input features transformed to output features, the equivalent receptive field increases whene

最低0.47元/天解锁文章

东升董事长

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
RES2NET

Abstract—Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger mu...
复制链接

扫一扫