This is a paper on crowd counting published on CVPR 2013.
**
Goal:
**
We propose to leverage multiple sources of information to compute an estimate of the number of individuals present in an extremely dense crowd in a single image.
**
Main idea:
**
Our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region.
Secondly, we employ a global consistency constraint on counts using Markov Random Field. This caters for disparity in counts in local neighborhoods and across scales.
**
Framework:
**
The proposed framework begins by counting individuals in small patches uniformly sampled over the image. But even though the density varies across the image, it does so smoothly, suggesting the density in adjacent patches should be similar.
When counting people in patches, we assume the density is uniform but implicitly assume that the number of people in each patch is independent of adjacent patches. Once we estimate density or counts in each patch, we remove the independence assumption and place them in multi-scale Markov Random Field to model the dependence in counts among nearby patches.
**
Counting in Patches:
**
we estimate the counts of each patch from three different and complementary sources, alongside confidences for those counts:
1. HOG based Head Detections
We used Deformable Parts Model trained on INRIA Person dataset, and applied only the filter corresponding to head to the images. And we used a much lower threshold for detection.
The consistency in scale and confidence is a measure of how reliable head detections are in that patch.
2. Fourier Analysis
crowd density in the patch is uniform, can be captured by Fourier Transform. Where the periodic occurrence of heads shows as peaks in the frequency domain.
3. Interest Points based Counting
We use interest points not only to estimate counts but also to get a confidence whether the patch represents crowd or not. If head detection gives false positives in
some regions and Fourier Analysis is crowd-blind, so we should discard counts from patches such as sky, buildings and trees.
**
Fusion of Three Sources:
**
For learning and fusion at the patch level, we densely sample overlapping patches from the training images and using the annotation, obtain counts for the corresponding patches. Computing counts and confidences from the three sources, we scale individual features and regress using SVR, with the counts computed from the annotations.
**
Counting in Images:
**
Now, we know the number of people in each patch, and our goal is counting the image. And the image is smoothly. So we should impose smoothness among counts from different patches, we place them in a multi-scale MRF framework with grid structure.
The result after smooth is showed as following:
**
Dataset:
**
In the end, there is a new dataset.
We collected the dataset from publicly available web images. consists of 50 images with counts ranging between 94 and 4543 with an average of 1280 individuals per image.
the scenes in these images also belong to a diverse set of events: concerts, protests, stadiums, marathons, and pilgrimages.
But it is not named.