Literature Reading之密集场景

最新推荐文章于 2024-08-31 23:25:09 发布

NeXT_Vision

最新推荐文章于 2024-08-31 23:25:09 发布

阅读量259

点赞数

分类专栏： Literature Reading 文章标签：计算机视觉深度学习

本文链接：https://blog.csdn.net/next_voyager/article/details/112795783

版权

Literature Reading 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

@[toc]
&emsp;&emsp;
<img src="" width="80%"/>
<center>文字居中</center>
## title
- 【1】
- 0118,
### 问题&背景
- xxx
- xxx
### 本文方法
- xxx
- xxx
### 论文细节
#### xxx
### Experiments摘选
#### xxx
### 本文术语
- 新新术语
    - xxx ?
- 新术语
    - xxx ?
### TODO待了解&相关工作

文章目录

IterDet: Iterative Scheme for Object Detection in Crowded Environments

【1】
0118,
Index Terms: pedestrian detection field, Crowded occlusion Scenes, CrowdHuman datasets, WiderPerson datasets

问题&背景

designed for crowded environments;
预测框冗余且高度重叠，numerous predicted boxes containing almost identical image content;
In all variants of NMS there is always a trade-off between precision and recall, as one needs to both remove redundant detections of the same object and preserve the hard-to-detect occluded objects;

本文方法

develop an alternative iterative scheme (IterDet) for object detection, where a new subset of objects is detected at each iteration;
Rather than detecting all objects in the image simultaneously, our scheme provides detection results in iterations. At each iteration, a new subset of objects is detected. Detected boxes from the previous iterations are passed to the network at the next iterations to avoid repetitive detections (that the same object would not be detected twice);
Instead of using LSTM memory for storing the information about previously detected objects, we explicitly provide it to the network in a form of object masks; such approach guarantees that no previously detected bounding boxes are accidentally forgotten.

论文细节

the history map是如何得到的？What’s the iterative scheme $IterDet(D^{'})$ ?

In case of ResNet-like backbone,

根据公式(1)得到the history image $H_{xy}$ ;
$H_{xy}$ 经过one convolution layer (with 64 filters of size 3 and stride 2)得到the history map $H$ ;
the original image经过the first convolution layer (with 64 filters of size 7 and stride 2), Batch Normalization layer and ReLU activation layer of the backbone得到feature map $Feat_1$ ;
将 $H$ 与 $Feat_1$ 相加后，再送入the second layer of the backbone;

2021118_IterDet_01.jpg

2021118_IterDet_02.jpg

Training procedure of $IterDet(D^{'})$

2021118_IterDet_03.jpg

直觉上, 为什么IterDet有效？

检测器at each iteration只需保证高precision，检测a subset of objects就行了
由于the proposed scheme is iterative, and at each iteration we need to detect only a subset of objects. Therefore, we are free to miss the more difficult objects at the first iteration, since missed objects can be detected later on. 在每次迭代中，我们可以在NMS环节设置更高的iou阈值抑制更多的预测框来保证高precision，而不需要保证高召回率。
在训练时, 随机将gt bboxes划分为两个子集且只预测 $B_{new}$ forces the model to exploit the history and predict only new objects at each iteration of inference.
在训练时, sampling different combinations of $B_{old}$ and $B_{new}$ provides additional source of augmentations;

Implementation details of Experiments

2021118_IterDet_04.jpg

Experiments摘选

Experimental results on CrowdHuman dataset

2021118_IterDet_05.jpg

IterDet相比Baseline[17]，在RetinaNet下虽然AP有提升但Recall降低了，哪些object在IterDet中不能被召回了呢？在Faster R-CNN下AP和Recall都有提升。
个人觉得，感觉AP有提升主要是因为误检情况减少了，再就是多1次迭代可召回更多的TP；由Figure 3可见，in cases when objects significantly overlap, the second iteration indeed helps to recover many occluded objects.

本文术语

新新术语
- xxx ?
新术语
- xxx ?

TODO待了解&相关工作

PS-RCNN: Detecting Secondary Human Instances in a Crowd via Primary Object Suppression

【1】
0119,
Index Terms: pedestrian detection field, Human Body Detection, Crowded occlusion Scenes, R-CNN, CrowdHuman datasets, WiderPerson datasets

问题&背景

when detecting human body in highly crowded scenarios with various gestures, detectors may generate more false positives because the heavy overlaps between human instances. (Thus mMR usually yields extremely high score thresholds when dealing with crowded scenes.)
In this work, we call those slightly/none occluded human instances as “Primary Objects”, and those heavily occluded human instances as “Secondary Objects”, and abbreviate them to P-Objects and S-objects, respectively. There are two reasons leading to the poor detection performance on S-Objects.
- First, visual cues for S-Objects are weak compared to P-Objects which make them hard to be distinguished from backgrounds or other human bodies.
- Besides, the remarkable difference of visual features between P-Objects and S-Objects poses great challenge for a single detector in handling both of them.
- Second, S-Objects usually share large overlaps with other human instances, and thus they are easier to be treated as duplicate detections and then get suppressed during post-processing (i.e. NMS).

本文方法

introduce a variant of two-stage detectors called PS-RCNN.
- PS-RCNN first detects slightly/none occluded objects by an R-CNN [1] module (referred as P-RCNN), and then suppress the detected instances by human-shaped binary masks so that the features of heavily occluded instances can stand out.
- After that, PS-RCNN utilizes another R-CNN module specialized in heavily occluded human detection (referred as S-RCNN) to detect the rest missed objects by P-RCNN.
- Final results are the ensemble of the outputs from these two RCNNs.
Moreover, we introduce a High Resolution RoI Align (HRRA) module to retain as much of fine-grained features of visible parts of the heavily occluded humans as possible.

论文细节

the framework of PS-RCNN

2021119_PS-RCNN_01.jpg

we train P-RCNN with all the ground-truth instances, while train S-RCNN using only the missed ones by P-RCNN to make it more complementary with P-RCNN. Formally speaking,

define $G = \{bbox_1, bbox_2, ..., bbox_n\}$ as all the ground-truth bounding boxes for image $I$ . In the training phase, $G$ is used as the ground-truth objects for P-RCNN.
The instances detected by P-RCNN are defined as $G_d$ , and the missed instances are defined as $G_m$ . After $G_d$ is detected, a human-shaped binary mask is covered at the location of ground-truth of each detected BBox in $G_d$ on feature maps, which can be seen in Fig. 2.
After binary masking, the visual cues for $G_d$ should be invisible for S-RCNN. Thus only the missed instances $G_m$ are set as ground-truth objects to train S-RCNN.
In this way, the two R-CNNs handle two separate sets of human instances, following in a divide-and-conquer manner.
The final outputs of PS-RCNN are the union of the results from two R-CNN modules.

直觉上, 为什么PS-RCNN有效？

P-RCNN是基于original features的，S-RCNN是基于modified features的；
introduce a primary instance binary mask which effectively erases the primary instance, such that the weak feature of the occluded secondary instance can stand out.
since each R-CNN module is only responsible for detecting one kind of human instances (slightly/none or heavily occluded instances), the individual task of both primary object detection and secondary object detection can be improved.
To further improve the visibility of S-Objects, we introduce High Resolution RoI Align (HRRA) module which only extracts features from the layer with the highest resolution.

为什么S-RCNN只利用最高分辨率那层的特征图？

when extracting RoI features for S-RCNN, we only extract features from the feature maps in the highest resolution level. The introduction of HRRA module is based on an observation that though the full body of S-Objects are usually large, their informative visible regions could be extreme small. Directly assigning S-Objects to the layers in small resolutions according to the full-body scales would make the informative visible regions become even smaller and hard to identify. Therefore, we propose to perform RoI extraction of S-Objects from feature maps in higher resolution.

关于the human-shaped binary mask

The quality of human-shaped binary masks is crucial. An imperfect mask which leaves parts of a primary instance uncovered will lead to plenty of duplicate detection results in the S-RCNN detection stage, bringing about lots of false positives. A hand-crafted human-shaped mask is qualified in most cases. However, it can not handle complicated cases when human body performs some uncommon gestures like dancing or bending their arms on others’ shoulders, etc.

Incorporating instance segmentation to acquire more accurate instance masks can alleviate this issue to some extent. Thus beyond the hand-crafted binary masks, we propose another enhanced version of PS-RCNN, which incorporates an instance segmentation branch after P-RCNN module to get instance masks for P-Objects. The instance segmentation branch is trained with extra 9 epochs on COCOPerson and can be easily plugged into current PS-RCNN. S-RCNN does not need to be finetuned after the introduction of instance branch according to our experiments.

Experiments摘选

Experiments on CrowdHuman

2021119_PS-RCNN_02.jpg

As can be seen in Table 2, without HRRA for S-RCNN, the improvement on recall is only 2.05% because the visual cues of S-Objects are very weak. After adopting HRRA, our PS-RCNN with HRRA can bring 1.03% and 3.15% improvements on AP and recall, respectively.

本文术语

新新术语
- xxx ?
新术语
- xxx ?

TODO待了解&相关工作

Occlusion-aware r-cnn: detecting pedestrians in a crowd, ECCV_2018
Occluded pedestrian detection through guided attention in cnns, CVPR_2018
Mask-guided attention network for occluded pedestrian detection, ICCV_2019

NeXT_Vision

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Literature Reading之密集场景

@[toc]&emsp;&emsp;<img src="" width="80%"/><center>文字居中</center>## title- 【1】- 0118,### 问题&背景- xxx- xxx### 本文方法- xxx- xxx### 论文细节#### xxx### Experiments摘选#### xxx### 本文术语- 新新术语 - xxx ?- 新术语 - xxx ?#
复制链接

扫一扫