**
Goal:
**
Leverage abundantly available unlabeled crowd imagery in a learning-to-rank framework to count crowd.
**
Why did the learning-to-rank framework work?
**
acquiring data for crowd counting is laborious,we propose a self-supervised task for crowd-counting which exploits crowd images which are not hand-labeled with person counts during training.
Rather than regressing to the absolute number of persons in the image, we train a network which compares images and ranks them according to the number of persons in the images.
The basic idea is all patches contained within a larger patch must have a fewer or equal number of persons than the larger patch
**
How to collect a large dataset?
**
Keyword query: We collect a crowd scene dataset from Google Images by using different key words
Query-by-example image retrieval: For each specific existing crowd counting dataset, we collect a dataset by using the training images as queries with the visual image search engine Google Images.
**
How to generate ranked datasets?
**
**
How do we learn from ranked datasets?
**
This question can be divided into three steps as following:
- Crowd density estimation network:
consider a network which is trained on available crowd counting datasets with ground truth annotations as the baseline method to which we compare. Our baseline network is derived from the VGG-16 network. We remove its two fully connected layers, and the max-pooling layer (pool5) to prevent further reduction of spatial resolution. In their place we add a single convolutional layer which directly regresses to the crowd density map.
As the counting loss, we use the Euclidean distance between the estimated and ground truth density maps.
Instead of using the whole image as an input, we randomly sample square patches of varying size (from 56 to 448 pixels). And we verify that this multi-scale sampling is important for good performance.
-
Crowd ranking network:
The data does not have crowd density maps and only ranking data is available via the sampling procedure described in Algorithm 1. We replace the Euclidean loss by an average pooling layer followed by a ranking loss. The average pooling layer converts the density map into an estimate of the number of persons per spatial unit. When network outputs the correct ranking there is no backpropagated gradient. However, when the network estimates are not in accordance with the correct ranking the backpropagated gradient causes the network to increase its estimate for the patch with lower score and to decrease its estimate for the one with higher score.
-
Combining counting and ranking data:
we discuss three approaches to combining ground truth labeled crowd scenes with data for which only rank information is available.
a) Ranking plus fine-tuning ; b) Alternating-task training ; c) Multi-task training: L = Lc + αLr;