论文解读《Crowd Counting via Multi-layer Regression》ACM-MM2019

最新推荐文章于 2024-01-18 14:48:31 发布

guoqiangszu

最新推荐文章于 2024-01-18 14:48:31 发布

阅读量477

点赞数

分类专栏： crowd counting paper 文章标签：计算机视觉深度学习

本文链接：https://blog.csdn.net/guoqiangszu/article/details/105282724

版权

crowd counting paper 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

Crowd Counting via Multi-layer Regression

Xin Tan, Chun Tao, Tongwei Ren, Jinhui Tang, and Gangshan Wu

摘要：

1. 问题：as congestion degree varies, people’s appearances may seem different.

随着拥堵程度的不同，人们的外表看起来也会有所不同。

2. 提出：Multi-layer Regression Network (MRNet), which consists of a multi-layer recognition branch and several density regressors.

MRNet由多层识别分支和一些密度回归器组成。

3. In practice, the recognition branch recognizes the congestion degree of the regions in a crowd image, then disintegrates the image into background and several crowd regions layer by layer, each regions are assigned different congestion degrees. In each layer, the recognized crowd regions with the specific congestion degree are delivered to a regressor with the corresponding density prior for crowd density estimation. The generated density maps at all layers are integrated to obtain the final density map for crowd density estimation.

在实践中，该识别分支对人群图像中各区域的拥塞程度进行识别，然后将图像逐层分解为背景和多个人群区域，每个区域分配不同的拥塞程度。在每一层中，将识别出的具有特定拥塞度的人群区域交给具有相应密度先验的回归器进行人群密度估计。对生成的各层密度图进行积分，得到最终的人群密度图，进行人群密度估计。

引言：

The main contributions of this paper include:

• The proposal of the first crowd counting method to solve the problem of congestion degree diversity, which is an essential element of the accuracy of predicting density maps.

人群计数方法中首次提出解决拥塞度多样性的问题，拥塞度多样性是预测密度图准确性的重要因素。

• The proposal of a novel multi-layer regression network consisting of a multi-layer recognition branch and multiple density regressors to generate density maps for regions with different congestion degrees separately.

提出了一种由多层识别分支和多个密度回归器组成的多层回归网络，分别生成不同拥塞度区域的密度图。

• Evaluation of MRNet’s performance on four typical datasets, and its superiority over state-of-the-art methods.

评估MRNet在四个典型数据集上的性能，以及它相对于最先进方法的优越性。

方法：

多层二分类的识别分支：

The purpose of the recognition branch is to discover the location and congestion degree of the gathering.

识别分支的目的是发现集合的位置和拥塞程度。

To tackle difficulties and eliminate uncertainties, recognition branch applies multi-layer disintegration strategy to segment the crowd in lieu of multi-class classification. Each layer of the recognition branch recognizes crowd regions with specific congestion degree and delivers the unidentified region to the next layer for further recognition. For instance, the first layer of recognition branch performs binary classification on the original image in which one of the classes stands for background and the other represents the area where the crowd density is greater than 0.

为了解决困难和消除不确定性，识别分支采用多层分解策略来分割人群，代替多类分类。识别分支的每一层对具有特定拥塞度的人群区域进行识别，并将未识别区域传递到下一层进行进一步识别。例如，识别分支的第一层对原始图像进行二分类，其中一个类表示背景，另一个类表示人群密度大于0的区域。

Our Recognition branch has two parts: frontend and backend. The frontend uses first 10 convolutional layers of VGG-16 [25] to extract crowd features, while the recognition backend, which consists of several residual blocks, recognizes and segments the crowd. The output channels of each residual block are 256, 128 and 64, followed by a 1×1 convolutional layer as output layer.

我们的识别结构分为两个部分:前端和后端。前端使用VGG-16[25]的前10个卷积层来提取人群特征，识别后端由几个残差块组成，对人群进行识别和分割。每个残块的输出通道分别为256、128、64，输出层为1×1的卷积层。如图Figure3。

密度回归器：

To a certain extent, density regressor shares a similar structure with the layer of recognition branch, both encompassing the frontend-backend pattern and a fine-tuned VGG-16 [25] as the frontend to extract low-level features. The architecture of regressor are shown in Figures 4. The recognition result is used to filter the output of the backbone which contains the image features by pixel-wise product operation, letting regressors learn the features of specific crowd regions.

在一定程度上，密度回归器与识别分支层具有相似的结构，既包含了前端-后端模式，又包含了一个微调的VGG-16[25]作为前端来提取底层特征。回归器的结构如图Figure4所示。识别结果通过像素级积运算对包含图像特征的主干输出进行滤波，让回归器学习特定人群区域的特征。

Ground Truth Generation：

Recognition map识别图：

It is time-consuming to use k nearest neighbors because the value of every pixel for recognition maps needs to be computed. Here we use sliding window scheme to generate recognition map, which is very easy to implement with a convolutional filter. First, the value of each pixel is set to the number of annotations of people’s heads appears in the window. Then we set a threshold to convert the head counts to degrees of congestion. If the values exceed the threshold, they are set to a fixed degree of congestion, and vice versa. In a 2-layer recognition branch, we set 3 as the threshold, which only classifies sparse and congested crowds. Different configurations used to generate recognition map for different datasets are shown in Table 2.

使用k个最近邻是非常耗时的，因为需要计算每个像素的识别图的值。在这里，我们使用滑动窗口模式来生成识别图，它很容易用卷积滤波器来实现。首先，将每个像素的值设置为窗口中出现的人头标注的数量。然后我们设置一个阈值，将人头计数转换为拥塞程度。如果值超过阈值，则将它们设置为固定的拥塞程度，反之亦然。在2层识别分支中，我们设置3为阈值，仅对稀疏和拥挤的人群进行分类。用于为不同数据集生成识别映射的不同配置如表2所示。

训练：

We train our method in an end-to-end way. The VGG-16 frontend is fine-tuned from a pre-trained VGG weight. Stochastic gradient descent is used as optimization method to train our model with a learning rate of 1e-6 for the density regressor and a learning rate of 5e-3 for the recognition branch. For recognition branch, we apply cross-entropy loss as the loss function to evaluate the performance of crowd recognition. For density regression, we use Euclidean distance to measure the difference between the output density map and the ground truth.

我们以端到端方式训练我们的模型。VGG-16前端是根据预先训练的VGG权重进行微调的。以SGD为优化方法，密度回归器的学习率为1e-6，识别分支的学习率为5e-3，对模型进行训练。对于识别分支，我们采用交叉熵损失作为损失函数来评价人群识别的性能。对于密度回归，我们使用欧氏距离来测量输出密度图和地面真值之间的差异。LE是欧氏距离（即L2损失），LC是交叉熵损失。