CrowdNet: A Deep Convolutional Network for Dense Crowd Counting
Lokesh Boominathan Srinivas S S Kruthiventi R. Venkatesh Babu
摘要:
提出一个新颖的深度学习框架估计静态图像中高密集人群的密度。
We use a combination of deep and shallow, fully convolutional networks to predict the density map for a given crowd image. Such a combination is used for effectively capturing both the high-level semantic information (face/body detectors) and the low-level features (blob detectors), that are necessary for crowd counting under large scale variations.
我们使用深层和浅层的结合,全卷积网络来预测给定人群图像的密度图。这种组合用于有效地捕获高级语义信息(人脸/身体检测器)和低级特征(blob检测器),后者是在大规模变化情况下进行人群计数所必需的。
we perform multiscale data augmentation. Augmenting the training samples in such a manner helps in guiding the CNN to learn scale invariant representations.
我们使用多尺度数据增强。以这种方式扩展训练样本有助于指导CNN学习尺度不变表示。
提出方法:
Our deep network captures the desired high-level semantics required for crowd counting using an architectural design similar to the well-known VGG-16 [17] network.
我们的深层网络使用类似于著名的VGG-16[17]网络的体系结构设计来捕获人群计数所需的高级语义。
In our model, we aim to recognize the low-level head blob patterns, arising from people away from the camera, using a shallow convolutional network.
在我们的模型中,我们的目标是使用一个浅卷积网络来识别远离摄像机的人产生的低级别头部blob模式。
Data augmentation:
We primarily perform two types of augmentation. The first type of augmentation helps in tackling the problem of scale variations in crowd images, while the second type improves the CNN's performance in regions where it is highly susceptible to making mistakes i.e., highly dense crowd regions.
我们主要执行两种类型的增强。第一种类型的增强有助于解决人群图像的尺度变化问题,而第二种类型的增强则提高了CNN在容易出错的区域的表现,即,高度密集的人群区域。
We crop patches from the multi-scale pyramidal representation of each training image. We consider scales of 0.5 to 1.2, incremented in steps of .1, times the original image resolution (as shown in Fig.3) for constructing the image pyramid. We crop 225×225 patches with 50% overlap from this pyramidal representation. With this augmentation, the CNN is trained to recognize people irrespective of their scales.
我们从每个训练图像的多尺度金字塔表示中裁剪小块。我们考虑0.5到1.2的比例,以.1为步长,乘以原始图像分辨率(如图3所示)来构建图像金字塔。我们从这个金字塔式的表示中裁剪出225×225的有50%重叠的图像块。通过这种增强,CNN被训练成识别不同尺度的人。
We observed that CNNs find highly dense crowds inherently difficult to handle. To overcome this, we augment the training data by sampling high density patches more often.
我们观察到cnn发现高密度的人群天生就很难处理。为了克服这个问题,我们通过更频繁地采样高密度的patch来增加训练数据。
训练:
5-fold cross validation
5折交叉验证
We sample 225 × 225 patches from each of the 40 training images following the previously described data augmentation method. This procedure yields an average of 50,292 training patches per fold.
我们使用前面描述的数据增强方法,从40幅训练图像中每幅图像中抽取225×225的patch作为样本。这个过程平均产生50,292个训练图像块。
Our network was trained using Stochastic Gradient Descent (SGD) optimization with a learning rate of 1e-7 and momentum of 0.9.
我们的网络使用随机梯度下降(SGD)优化进行训练,学习率为1e-7,动量为0.9。
实验结果: