TITLE: Detect to Track and Track to Detect
AUTHOR: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
ASSOCIATION: Graz University of Technology, University of Oxford
FROM: arXiv:1710.03958
CONTRIBUTION
- A ConvNet architecture is set up for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression.
- Correlation features that represent object co-occurrences across time are introduced to aid the ConvNet during tracking.
- Frame-level detections are linked to produce high accuracy detections at the video-level based on across-frame tracklets .
METHOD
For frame-level detections, this work adopts R-FCN as the base framework to detect objects in a single frame. The inter-frame correlation features are extracted from the feature maps of the two frames. A multi-task loss of localization, classification and displacement is used to train the net work. The workflow of this work is shown in the following figure.
The key innovation of this work is an operation denoted as ROI tracking. The input of this operation is the bounding box regression features of the two frames
where
−d≤p≤d
and
−d≤q≤d
are offsets to compare features in a square neighbourhood around the locations
i
,
The loss function is written as
A class-wise linking score is defined to combine detections and tracks across time
where the pairwise term
ϕ
evaluates to 1 if the IoU overlap a track correspondences
Tt,t+τ
with the detection boxes
Dti
,
Dt+τi
is larger than 0.5.
pti,c
,
pt+τj,c
is the softmax probability for class
c
. The optimal path across a video can be found by maximizing the scores over the duration