论文解读《DADNet: Dilated-Attention-Deformable ConvNet for Crowd Counting》ACM-MM2019

DADNet: Dilated-Attention-Deformable ConvNet for Crowd Counting

Dan Guo, Kun Li, Zheng-Jun Zha, Meng Wang

 

摘要:

1. 问题:

Large scale 尺度变化,低分辨率的密度图

2. 提出:

we propose a novel deep model called Dilated-Attention-Deformable ConvNet (DADNet), which consists of two schemes: multiscale dilated attention and deformable convolutional DME(Density Map Estimation).

我们提出了一种新的深度模型,称为扩张-注意-可变形卷积网络(DADNet),它包括两个方法:多尺度扩展注意和可变形卷积DME(密度图估计)。

3. The proposed model explores a scale-aware attention fusion with various dilation rates to capture different visual granularities of crowd regions of interest, and utilizes deformable convolutions to generate a high-quality density map. There are two merits as follows: (1) varying dilation rates can effectively identify discriminative regions by enlarging the receptive fields of convolutional kernels upon surrounding region cues, and (2) deformable CNN operations promote the accuracy of object localization in the density map by augmenting the spatial object location sampling with adaptive offsets and scalars. DADNet not only excels at capturing rich spatial context of salient and tiny regions of interest simultaneously, but also keeps a robustness to background noises, such as partially occluded objects.

该模型提出了具有不同扩张率的尺度感知注意力融合,以捕获感兴趣的人群区域的不同视觉粒度,并利用可变形卷积生成高质量的密度图。有两个优点如下:(1) 通过扩大卷积核的接受域在周边地区的线索,不同膨胀率可以有效地确定不同区域,和(2) 通过增加空间对象位置自适应补偿和标量的采样,可变形的CNN操作提高密度图对象定位的准确性。DADNet不仅擅长同时捕捉显著和微小的感兴趣区域的丰富空间上下文,而且对背景噪声(如部分遮挡的对象)具有鲁棒性。

 

引言:

∙ The proposed DADNet generates high-quality density maps by effectively learning visual context cues of multiscale features, which shows strong adaptability to resist scale variations, object occlusions, and background noises in crowd image.

∙本文提出的DADNet算法通过有效地学习多尺度特征的视觉上下文线索,生成高质量的密度图,对人群图像中的尺度变化、目标遮挡和背景噪声具有较强的抵抗能力。

∙ DADNet consists of a scale-aware attention fusion and a deformable DME. The former utilizes an adaptive 2D attention map mechanism on multi-scale features for exact visual representation, while the latter augments the flexibility of spatial sampling locations of objects with learnable offsets and scalars.

∙DADNet由感知尺度的注意力融合和可变形的DME组成。前者利用多尺度特征上的自适应二维注意图机制实现精确的视觉表达,而后者利用可学习的偏移量和标量增强了对象空间采样位置的灵活性。

∙ Extensive experiments on three crowd counting benchmark datasets (i.e., ShanghaiTech, UCF CC 50, UCFQNRF) and one vehicle counting dataset (TRANCOS) achieve the state-of-the-art performance. Ablation studies demonstrate the effectiveness of each module within the proposed model.

∙在三个人群计数基准数据集(即UCF CC 50, UCFQNRF)和一个车辆计数数据集(TRANCOS)达到最先进的性能。消融研究证实了模型中各模块的有效性。

方法:

Scale-aware Attention Fusion

尺度感知的注意力融合

To address the problem of crowd density variation, we design a multi-scale dilated convolution attention module as shown in Figure 2. We use different dilation scales to discover visual context cues. The core idea is to enable different visual context cues to perform the spatial referring on non-discriminative areas.

为了解决人群密度变化的问题,我们设计了一个多尺度扩张的卷积注意模块,如图2所示。我们使用不同的扩张尺度来发现视觉环境线索。其核心思想是使不同的视觉语境线索在非歧视性区域进行空间指称。

下面几个公式对应图2中扩张注意力方法的算法步骤

In this paper, we set the number of varying scales 𝑆 = 4 with corresponding dilation rate 𝑟 ∈ {1, 3, 6, 9}.

we calculate dilated feature maps by F𝑟𝑖 = ℱ𝑟(F𝑣𝑔𝑔) ,

where F𝑣𝑔𝑔 denotes the VGG feature maps and ℱ𝑟 denotes the dilated convolutional operation on F𝑣𝑔𝑔.

The corresponding 2D attention map I𝑟𝑖 of feature map F𝑟𝑖 ∈ R𝐻×𝑊×#𝑐ℎ is formulated as:

I𝑟𝑖 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(ℱ{1×1}(F𝑟𝑖 ,Θℱ)) ∈ R𝐻×𝑊

 

where 𝐻 ×𝑊 is the dimension of feature maps, #𝑐ℎ is the channel number, 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 denotes the sigmoid activation function, ℱ{1×1} denotes the 1 × 1 convolution operation, and Θℱ is the model parameters of ℱ{1×1}.

To make a scale-aware attention fusion, we normalize the attention maps  at each scale. At dilation rate 𝑟𝑖, the normalized 2D attention map W𝑟𝑖 is defined as follow:

 

where “./” is the element-wise division operation.

Finally, we employ the scale-aware attention maps [W𝑟1, … , W𝑟𝑆 ] to generate the fused feature maps F𝑓𝑢𝑠𝑖𝑜𝑛:

 

where ⊙ means element-wise product operation, and F𝑎𝑡𝑡𝑖 denotes the scale-aware feature map at dilation scale 𝑖. Note that the feature dimensions of F𝑎𝑡𝑡 𝑖 and F𝑓𝑢𝑠𝑖𝑜𝑛 are the same as F𝑣𝑔𝑔.

 

Deformable Density Map Estimation

DME module consists of a three-layer deformable convolution (𝑑𝑒𝑓-𝑐𝑜𝑛𝑣) network, in which the sampling location weight w(p), scalar Δs and offset Δp on each 𝑑𝑒𝑓-𝑐𝑜𝑛𝑣 layer are all the learning parameters. With the help of these parameters, the convolutional grid can self-adapt to obtain useful location cues on the fused feature map to generate the high-quality density map.

DME包括了三层可变形卷积的网络,网络中的参数weight w(p), scalar Δs and offset Δp,通过训练学习。这些参数可以使网络在融合的特征图中获得有用的位置信息,并生成高质量的密度图。

Loss Optimization

 

 

Given an image 𝐼, the groundtruth density map 𝑌 is obtained by the method in [31]. 𝑌 is calculated by convolving each pixel with a Gaussian kernel, which is formalized as

follows:

 

 

实验结果:

 

 

 

 

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值