Human-object interaction prediction in videos through gaze following

Paper link
Code link

Abstract

The video-based HOI anticipation task in the third-person view is rarely researched. In this paper, a framework to detect current HOIs and anticipate future HOIs in videos is propose. Since people often fixate on an object before interacting with it, in this model gaze features together with the scene contexts and the visual appearances of human–object pairs are fused through a spatio-temporal transformer. Besides, a set of person-wise multi-label metrics are proposed to evaluate the model in the HOI anticipation task in a multi-person scenario.

Overview of the video-based HOI detection and anticipation framework.

三年级打开
The framework consists of three modules:

  • Object Module
    • The object module detects bounding boxes of humans { b t , i s } \{b^s_{t,i}\} {bt,is} and objects { b t , j } \{b_{t,j}\} {bt,j}, and recognizes object classes { c t , j } \{c_{t,j}\} {ct,j}. An object tracker obtains human and object trajectories ( { H i } \{\textbf{H}_i\} {Hi} and { O j } \{\textbf{O}_j\} {Oj}in the video. Then, the human visual features { v t , i s } \{v^s_{t,i}\} {vt,is}, object visual features { v t , j } \{v_{t,j}\} {vt,j}, visual relation features { v t , < i , j > } \{v_{t,<i,j>}\} {vt,<i,j>}, and spatial relation features { m t , < i , j > } \{m_{t,<i,j>}\} {mt,<i,j>} are extracted through a feature backbone. In addition, a word embedding model is applied to generate semantic features { s t , j } \{s_{t,j}\} {st,j} of the object class.
  • Gaze Module
    • The gaze module detects heads { b t , k h } \{b^h_{t,k}\} {bt,kh} in RGB frames, assigns them to detected humans, and generates gaze feature maps for each human { g t , i } \{g_{t,i}\} {gt,i} using a gaze-following model.
  • Spatial-temporal Module
    • Next, all features in a frame are projected by an input embedding block.
    • The human-object pair features are concatenated to a sequence of pair representations X t X_t Xt, which are refined to X t s p X_t^{sp} Xtspby a spatial encoder.
    • The spatial encoder also extracts a global context feature c t c_t ct from each frame. Then, the global features { c t } \{c_t\} {ct} and projected human gaze features { g t , i ′ } \{g'_{t,i}\} {gt,i}are concatenated to build the person-wise sliding windows of context features.
    • Several instance-level sliding windows are constructed, each only containing refined pair representations of one unique human–object pair across time [ x t − L + 1 , < i , j > s p , … , X t , < i , j > s p ] \left[x^{sp}_{t−L+1,<i,j>},… , X_{t,<i,j>}^{sp}\right] [xtL+1,<i,j>sp,,Xt,<i,j>sp].
    • A temporal encoder fuses context knowledge into the pair representations by the cross-attention mechanism.
    • Finally, the prediction heads estimate the probability distribution z t , ⟨ i , j ⟩ z_{t,⟨i,j⟩} zt,i,j of interactions for each human–object pair based on the last occurrence x t , ⟨ i , j ⟩ t m p x^{tmp}_{t,⟨i,j⟩} xt,i,jtmpin the temporal encoder output.

Object module

在这里插入图片描述

  • The object module takes a sequence of RGB video frames as input and detects bounding boxes and classes for objects in each frame, including bounding boxes for humans.
  • An object tracker associates current detections with past ones to obtain trajectories for human and object bounding boxes. This allows analyzing each unique human-object pair.
  • Visual features are extracted for each box using a ResNet. Additional features are extracted for human-object pairs like visual relation features and spatial relation masks.
  • Object semantic features are generated from object categories using word embeddings, to reflect different likely interactions depending on object type.

Gaze module

在这里插入图片描述

  • The gaze-following method from Chong et al. (2020) is adopted to generate a gaze heatmap for each human.
  • A head detector is needed to identify human heads in the scene. Directly getting the head box from the human box can cause mismatches in some cases.
  • The gaze module first detects all heads in the full RGB frame. These are matched to human boxes using linear assignment.
  • An intersection over head (IoH) ratio is computed between each human box and head box. If the IoH ratio exceeds a threshold of 0.7, the head is shortlisted as a match for that human.
  • Finally, the gaze-following model combines the head information and the scene feature map using an attention mechanism, and encode the fused features and extract temporal dependencies to estimate the gaze heatmap using a convolutional Long Short-Term Memory (Conv-LSTM) network.

Spatial and temporal encoders

罗xxx

  • A spatial encoder exploits human-object relation representations from one frame to understand dependencies between appearances, spatial relations, and semantics. It extracts a global feature vector for each frame to represent contexts between all human-object pairs.
  • The spatial encoder takes human-object pair relations as input. After stacked self-attention layers, a learnable global token is prepended to summarize dependencies between pairs into a global feature vector, while the pair relation representations are refined.
  • The refined pair representations are concatenated into sequences for the temporal encoder. Unlike STTran which jointly processes all pairs, sequences are formulated so each only contains one human-object pair.
  • The human’s gaze feature is concatenated with the global frame feature into a person-wise context feature sequence. This is fed to the temporal encoder along with the pair sequences.
  • Positional encodings are added to entries in both context and pair sequences since the temporal encoder loses order. Sinusoidal encoding performs better than learned.
  • The temporal encoder fuses contexts and pairs via cross-attention to capture temporal evolution of dependencies to detect interactions over time.
  • Prediction heads generate probability distributions for interaction categories. Outputs are concatenated into the final model output.

Experiments

Datasets

VidHOI dataset

Currently the largest video dataset with complete HOI annotations. The VidHOI dataset applies keyframe-based annotations, where the keyframes are sampled in 1 frame per second (FPS). There are 78 object categories and 50 predicate classes.

Loss function

VidHOI dataset is an unbalanced dataset with long-tailed interaction distribution. To address the imbalance issue and avoid over-emphasizing the importance of the most frequent classes in the dataset, the class-balanced (CB) Focal loss (Cui et al.,2019) is adopted as follows:
C B f o c a l ( p i , y i ) = − 1 − β 1 − β n i ( 1 − p i i ) γ log ⁡ p y i CB_{focal}(p_i,y_i)=-\frac{1-\beta}{1-\beta^{n_i}}(1-p_{i_i})^\gamma\log{p_{y_i}} CBfocal(pi,yi)=1βni1β(1pii)γlogpyi
with p y i = { p i  if  y i = 1 1 − p i  otherwise  p_{y_{i}}=\left\{\begin{array}{ll} p_{i} & \text { if } y_{i}=1 \\ 1-p_{i} & \text { otherwise } \end{array}\right. pyi={pi1pi if yi=1 otherwise 
The term − ( 1 − p i i ) γ log ⁡ p y i -(1-p_{i_i})^\gamma\log{p_{y_i}} (1pii)γlogpyirefers to the Focal loss proposed in Lin et al. (2017), where p i p_i pi denotes the estimated probability for the 𝑖th class and y i ∈ { 0 , 1 } y_i \in \left\{0, 1\right\} yi{0,1} is the ground-truth label. The variable n i n_i ni denotes the number of samples in the ground truth of the 𝑖th class and β ∈ [ 0 , 1 ) \beta \in [0, 1) β[0,1) is a tunable parameter. The mean of losses in all classes is considered as the loss for one prediction.

Action genome dataset

Another large-scale video dataset containing 35 object categories and 25 interaction classes. Nevertheless, only HOIs for a single person are annotated in each video even if more people show up. Moreover, the videos are generated by volunteers performing pre-defined tasks. Thus, models designed on the Action Genome dataset may be less useful in the real world.

Evaluation metrics

A predicted HOI triplet is assigned true positive if: (1) both detected human and object bounding boxes are overlapped with the ground truth with intersection over union (IoU) > 0.5, (2) the predicted
object class is correct, and (3) the predicted interaction is correct.

The metric mAP is reported on the VidHOI dataset over three different HOI category sets: (1) Full: all 557 HOI triplet categories, (2) Rare: 315 categories with < 25 instances in the validation set, and (3) Non-rare: 242 categories with ≥ 25 instances in the validation set.

A set of person-wise multi-label top-𝑘 metrics as additional evaluation metrics is proposed. For each frame, first assign the detected human–object pairs to the ground-truth pairs. Then, the top-𝑘 triplets of each human are used to compute the metrics for this human. The final results are averaged over all humans in the dataset, without frame-wise or video-wise mean computation.

All models are trained with ground-truth object trajectories. Models in Oracle mode are evaluated with ground-truth object bounding boxes, while models in Detection mode are evaluated with object detector.

Implementation details

  • object module, we employ YOLOv5 model (Jocher et al.,2022) as the object detector. The weights are pre-trained on COCO dataset (Lin et al., 2014) and finetuned for the VidHOI dataset.

  • apply the pre-trained DeepSORT model (Wojke et al., 2017) as the human tracker, ResNet-101 (He et al., 2016) as feature backbone, and GloVe model (Pennington et al., 2014) for word embedding.

  • In the gaze module, also apply YOLOv5 to detect heads from RGB frames. The model is pre-trained on the Crowdhuman dataset (Shao et al., 2018). The gaze-following method introduced in Chong et al. (2020) and pre-trained on the VideoAttentionTarget dataset (Chong et al., 2020) is adopted to generate gaze features. All weights in the object module and gaze module are frozen during the training of the spatio-temporal transformer.

Result

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Comments

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
### 内容概要 《计算机试卷1》是一份综合性的计算机基础和应用测试卷,涵盖了计算机硬件、软件、操作系统、网络、多媒体技术等多个领域的知识点。试卷包括单选题和操作应用两大类,单选题部分测试学生对计算机基础知识的掌握,操作应用部分则评估学生对计算机应用软件的实际操作能力。 ### 适用人群 本试卷适用于: - 计算机专业或信息技术相关专业的学生,用于课程学习或考试复习。 - 准备计算机等级考试或职业资格认证的人士,作为实战演练材料。 - 对计算机操作有兴趣的自学者,用于提升个人计算机应用技能。 - 计算机基础教育工作者,作为教学资源或出题参考。 ### 使用场景及目标 1. **学习评估**:作为学校或教育机构对学生计算机基础知识和应用技能的评估工具。 2. **自学测试**:供个人自学者检验自己对计算机知识的掌握程度和操作熟练度。 3. **职业发展**:帮助职场人士通过实际操作练习,提升计算机应用能力,增强工作竞争力。 4. **教学资源**:教师可以用于课堂教学,作为教学内容的补充或学生的课后练习。 5. **竞赛准备**:适合准备计算机相关竞赛的学生,作为强化训练和技能检测的材料。 试卷的目标是通过系统性的题目设计,帮助学生全面复习和巩固计算机基础知识,同时通过实际操作题目,提高学生解决实际问题的能力。通过本试卷的学习与练习,学生将能够更加深入地理解计算机的工作原理,掌握常用软件的使用方法,为未来的学术或职业生涯打下坚实的基础。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值