There are two questions in crowd counting:
large head scale variations caused by camera perspective and diverse crowd distributions with high background noisy scenes.
Faced with such problems, we proposed a model named SFANet to solve the two questions.
**
Abstract:
**
The proposed SFANet contains two main components: a VGG backbone convolutional neural network (CNN) as the front-end feature map extractor and a dual path multi-scale fusion networks as the back-end to generate density map. These dual path multi-scale fusion networks have the same structure, one path is responsible for generating attention map by highlighting crowd regions in images, the other path is responsible for fusing multiscale features as well as attention map to generate the final high-quality and high-resolution density maps.
**
Contribution:
**
- We design a multi-scale fusion network architecture to fuse the feature maps from multi-layers to make the network more robust for the head scale variation and background noise, and also generating high-resolution density maps.
- We incorporating the attention model into the network by adding a path of multi-scale feature fusion as attention map path, which makes the proposed method focus on head regions for the density map regression task, therefore improving its robustness to complex backgrounds and diverse crowd distributions.
- We propose a novel multi-task training loss, combining Euclidean loss and attention map loss to make network convergence faster and better performance. The former loss minimizes the pixel-wise error and the latter one focus on locating the head regions.
**
Architecture:
**
The architecture is made up of three paths:
- feature map extractor (FME)
- density map path (DMP)
- attention map path (AMP)
FME:
Our network adopts the first 13 layers from VGG16-bn as the front-end feature map extractor (FME) to extract multi-scale feature maps with different level semantics information and different scale feature information.
The low level and small scale features can well represent the detail edge patterns which are essential for regressing the value of congested region in density map.
The high level and large scale features have useful semantics information to eliminate the background noise. So we use them together.
DMP:
We design a path of multi-scale feature fusion as density map path (DMP) to combine these two advantages of different level features. Another advantage of multi-scale feature fusion structure is that we can gain the high-resolution density map by upsample operation.
Another issue is that DMP network does regression for every density map pixel, while do not explicitly give more attention to head regions during training and testing. That is to say, there are high background noise.
AMP:
To further tackle the high background noise issue. It has the same structure to learn a probability map that indicates high probability head regions. This attention map is used to suppress non-head regions in the last feature maps of DMP, which makes DMP focus on learning the regression task only in high probability head regions.
We also introduce a multi-task loss by adding a attention map loss for AMP, which improves the network performance with more explicit supervised signal.
**
Loss Function:
**
The Euclidean loss is used to measure estimation error at pixel level, which is defined as follows:
The attention map loss function is a binary class entropy, defined as:
α is a weighting weight that is set as 0.1 in the experiments.
**
Attention map groundtruth:
**
Based on density map groundtruth, we continue use Gaussian kernel to compute attention map groundtruth as follows:
where th is the threshold set as 0.001 in our experiments. With equation7, 8, we obtain a binary attention map groundtruth in order to guide the AMP to focus on the head regions and also the surround places. In experiment, we set µ = 3 and ρ = 2 for generating attention map groundtruth.