Human-object interaction prediction in videos through gaze following

虔诚的码农

已于 2023-08-03 14:22:12 修改

阅读量307

点赞数 2

分类专栏：文献阅读笔记文章标签： object detection computer vision

于 2023-07-30 11:41:32 首次发布

本文链接：https://blog.csdn.net/weixin_46179086/article/details/131750671

版权

文献阅读笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Human-object interaction prediction in videos through gaze following

Abstract
Overview of the video-based HOI detection and anticipation framework.
Experiments
Comments

Paper link
Code link

Abstract

The video-based HOI anticipation task in the third-person view is rarely researched. In this paper, a framework to detect current HOIs and anticipate future HOIs in videos is propose. Since people often fixate on an object before interacting with it, in this model gaze features together with the scene contexts and the visual appearances of human–object pairs are fused through a spatio-temporal transformer. Besides, a set of person-wise multi-label metrics are proposed to evaluate the model in the HOI anticipation task in a multi-person scenario.

Overview of the video-based HOI detection and anticipation framework.

三年级打开
The framework consists of three modules:

Object Module
- The object module detects bounding boxes of humans $\{b^s_{t,i}\}$ and objects ${b_{t,j}\}$ , and recognizes object classes ${c_{t,j}\}$ . An object tracker obtains human and object trajectories ( $\{\textbf{H}_i\}$ and $\{\textbf{O}_j\}$ in the video. Then, the human visual features $\{v^s_{t,i}\}$ , object visual features ${v_{t,j}\}$ , visual relation features ${v_{t,<i,j>}\}$ , and spatial relation features ${m_{t,<i,j>}\}$ are extracted through a feature backbone. In addition, a word embedding model is applied to generate semantic features ${s_{t,j}\}$ of the object class.
Gaze Module
- The gaze module detects heads $\{b^h_{t,k}\}$ in RGB frames, assigns them to detected humans, and generates gaze feature maps for each human ${g_{t,i}\}$ using a gaze-following model.
Spatial-temporal Module
- Next, all features in a frame are projected by an input embedding block.
- The human-object pair features are concatenated to a sequence of pair representations $X_t$ , which are refined to $X_t^{sp}$ by a spatial encoder.
- The spatial encoder also extracts a global context feature $c_t$ from each frame. Then, the global features ${c_t\}$ and projected human gaze features ${g'_{t,i}\}$ are concatenated to build the person-wise sliding windows of context features.
- Several instance-level sliding windows are constructed, each only containing refined pair representations of one unique human–object pair across time $\left[x^{sp}_{t−L+1,<i,j>},… , X_{t,<i,j>}^{sp}\right]$ .
- A temporal encoder fuses context knowledge into the pair representations by the cross-attention mechanism.
- Finally, the prediction heads estimate the probability distribution $z_{t,⟨i,j⟩}$ of interactions for each human–object pair based on the last occurrence $x^{tmp}_{t,⟨i,j⟩}$ in the temporal encoder output.

Object module

在这里插入图片描述

The object module takes a sequence of RGB video frames as input and detects bounding boxes and classes for objects in each frame, including bounding boxes for humans.
An object tracker associates current detections with past ones to obtain trajectories for human and object bounding boxes. This allows analyzing each unique human-object pair.
Visual features are extracted for each box using a ResNet. Additional features are extracted for human-object pairs like visual relation features and spatial relation masks.
Object semantic features are generated from object categories using word embeddings, to reflect different likely interactions depending on object type.

Gaze module

在这里插入图片描述

The gaze-following method from Chong et al. (2020) is adopted to generate a gaze heatmap for each human.
A head detector is needed to identify human heads in the scene. Directly getting the head box from the human box can cause mismatches in some cases.
The gaze module first detects all heads in the full RGB frame. These are matched to human boxes using linear assignment.
An intersection over head (IoH) ratio is computed between each human box and head box. If the IoH ratio exceeds a threshold of 0.7, the head is shortlisted as a match for that human.
Finally, the gaze-following model combines the head information and the scene feature map using an attention mechanism, and encode the fused features and extract temporal dependencies to estimate the gaze heatmap using a convolutional Long Short-Term Memory (Conv-LSTM) network.

Spatial and temporal encoders

罗xxx

A spatial encoder exploits human-object relation representations from one frame to understand dependencies between appearances, spatial relations, and semantics. It extracts a global feature vector for each frame to represent contexts between all human-object pairs.
The spatial encoder takes human-object pair relations as input. After stacked self-attention layers, a learnable global token is prepended to summarize dependencies between pairs into a global feature vector, while the pair relation representations are refined.
The refined pair representations are concatenated into sequences for the temporal encoder. Unlike STTran which jointly processes all pairs, sequences are formulated so each only contains one human-object pair.
The human’s gaze feature is concatenated with the global frame feature into a person-wise context feature sequence. This is fed to the temporal encoder along with the pair sequences.
Positional encodings are added to entries in both context and pair sequences since the temporal encoder loses order. Sinusoidal encoding performs better than learned.
The temporal encoder fuses contexts and pairs via cross-attention to capture temporal evolution of dependencies to detect interactions over time.
Prediction heads generate probability distributions for interaction categories. Outputs are concatenated into the final model output.

Experiments

Datasets

VidHOI dataset

Currently the largest video dataset with complete HOI annotations. The VidHOI dataset applies keyframe-based annotations, where the keyframes are sampled in 1 frame per second (FPS). There are 78 object categories and 50 predicate classes.

Loss function

VidHOI dataset is an unbalanced dataset with long-tailed interaction distribution. To address the imbalance issue and avoid over-emphasizing the importance of the most frequent classes in the dataset, the class-balanced (CB) Focal loss (Cui et al.,2019) is adopted as follows:
$CB_{focal}(p_i,y_i)=-\frac{1-\beta}{1-\beta^{n_i}}(1-p_{i_i})^\gamma\log{p_{y_i}}$
with $p_{y_{i}}=\left\{\begin{array}{ll} p_{i} & \text { if } y_{i}=1 \\ 1-p_{i} & \text { otherwise } \end{array}\right.$
The term $-(1-p_{i_i})^\gamma\log{p_{y_i}}$ refers to the Focal loss proposed in Lin et al. (2017), where $p_i$ denotes the estimated probability for the 𝑖th class and $y_i \in \left\{0, 1\right\}$ is the ground-truth label. The variable $n_i$ denotes the number of samples in the ground truth of the 𝑖th class and $\beta \in [0, 1)$ is a tunable parameter. The mean of losses in all classes is considered as the loss for one prediction.

Action genome dataset

Another large-scale video dataset containing 35 object categories and 25 interaction classes. Nevertheless, only HOIs for a single person are annotated in each video even if more people show up. Moreover, the videos are generated by volunteers performing pre-defined tasks. Thus, models designed on the Action Genome dataset may be less useful in the real world.

Evaluation metrics

A predicted HOI triplet is assigned true positive if: (1) both detected human and object bounding boxes are overlapped with the ground truth with intersection over union (IoU) > 0.5, (2) the predicted
object class is correct, and (3) the predicted interaction is correct.

The metric mAP is reported on the VidHOI dataset over three different HOI category sets: (1) Full: all 557 HOI triplet categories, (2) Rare: 315 categories with < 25 instances in the validation set, and (3) Non-rare: 242 categories with ≥ 25 instances in the validation set.

A set of person-wise multi-label top-𝑘 metrics as additional evaluation metrics is proposed. For each frame, first assign the detected human–object pairs to the ground-truth pairs. Then, the top-𝑘 triplets of each human are used to compute the metrics for this human. The final results are averaged over all humans in the dataset, without frame-wise or video-wise mean computation.

All models are trained with ground-truth object trajectories. Models in Oracle mode are evaluated with ground-truth object bounding boxes, while models in Detection mode are evaluated with object detector.

Implementation details

object module, we employ YOLOv5 model (Jocher et al.,2022) as the object detector. The weights are pre-trained on COCO dataset (Lin et al., 2014) and finetuned for the VidHOI dataset.
apply the pre-trained DeepSORT model (Wojke et al., 2017) as the human tracker, ResNet-101 (He et al., 2016) as feature backbone, and GloVe model (Pennington et al., 2014) for word embedding.
In the gaze module, also apply YOLOv5 to detect heads from RGB frames. The model is pre-trained on the Crowdhuman dataset (Shao et al., 2018). The gaze-following method introduced in Chong et al. (2020) and pre-trained on the VideoAttentionTarget dataset (Chong et al., 2020) is adopted to generate gaze features. All weights in the object module and gaze module are frozen during the training of the spatio-temporal transformer.

Result

在这里插入图片描述

Comments

虔诚的码农

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
Human-object interaction prediction in videos through gaze following

The video-based HOI anticipation task in the third-person view is rarely researched. In this paper, a framework to detect current HOIs and anticipate future HOIs in videos is propose.
复制链接

扫一扫