**
Goal:
**
proposed a deep CNN with two related learning objectives –crowd density and crowd count.
**
Contribution :
**
- Our CNN model is trained for crowd scenes by a switchable learning process with two learning objectives, crowd density maps and crowd counts. The two different but related objectives can alternatively assist each other to obtain better local optima.
- The target scenes require no extra labels in our framework for cross-scene counting. The pre-trained CNN model is fine-tuned for each target scene to overcome the domain gap between different scenes. The fine-tuned model is specifically adapted to the new target scene.
- The framework does not rely on foreground segmentation results because only appearance information is considered in our method. No matter whether the crowd is moving or not, the crowd texture would be captured by the CNN model and can obtain a reasonable counting result.
- We also introduce a new dataset named WorldExpo’10 for evaluating cross-scene crowd counting methods. To the best of our knowledge, this is the largest dataset for evaluating crowd counting algorithms.
**
Architecture :
**
The main objective for our crowd CNN model is to learn a mapping F : X → D, where X is the set of low-level features extracted from training images and D is the crowd density map of the image.
**
training :
**
Training set :
Perspective normalization is necessary to estimate the pedestrian scales. Patches randomly selected from the training images are treated as training samples, and the density maps of corresponding patches are treated as the ground truth for the crowd CNN model.
The input is the image patches cropped from training images. In order to obtain pedestrians at similar scales, the size of each patch at different locations is chosen according to the perspective value of its center pixel.
Here we constrain each patch to cover a 3-meter by 3-meter square in the actual scene as shown in Figure 3. Then the patches are warped to 72 pixels by 72 pixels as the input of the Crowd CNN model.
Training target :
The two loss functions of density map and crowd number:
Training process:
**
Cross-scene Crowd Counting :
**
In order to bridge the distribution gap between the training and test scenes, we design a nonparametric fine-tuning scheme to adapt our pre-trained CNN model to unseen target scenes.
Giving a target video from the unseen scenes, samples with similar properties from the training scenes are retrieved and added to training data to fine-tune the crowd CNN model. The retrieval task consists of two steps, candidate scenes retrieval and local patch retrieval.
Two steps : (a) Retrieving candidate scenes by matching perspective maps of the training scenes and the test scene. (b) Local patches similar to those in the test scene are retrieved from the candidate scenes.