MUREN(Relational Context Learning for Human-Object Interaction Detection)
Contributions
- multiplex relation embedding module, which generates context information using unary, pairwise and ternary relations in an HOI instance.
- attentive fusion module that propagates requisite context information for context exchange.
- three-branch architecture to learn more discriminative features for sub-tasks(human detection, object detection and interaction classification)
- MUREN outperforms SOTA methods on HICO-DET and V-COCO
Introduction
-
single-branch(single transformer decoder)
- update a token set through a single transformer decoder
- detect HOI instances using the subsequent FFNs directly
- disadvantages:
- a single transformer decoder is responsible for all sub-tasks(human detection, object detection, and interaction classification )
- limited in adapting to the different subtasks with multi-task learning, simultaneously
-
two-branch(two separated transformer decoder)
- one detects human-object pairs
- the other classifies interaction classes
- disadvantages:
- the insufficient context exchange between the branches prevents the two-branch methods [15,38,40] from learning relational contexts , which plays a crucial role in identifying HOI instances
- Some tackle this issue with additional context exchange, but they are limited to propagating human-object context to interaction context
-
MUREN
- advantages:
- performs rich context exchange
- three types: using unary, pairwise and ternary relations of human, object and interaction tokens
- unary and pairwise relation contexts provide more fine-grained information
- unary contexts: riding helps to infer pair of a human and an interaction
- pairwise context(human and riding) helps to detect an object(bicycle)
- multiplex relation embedding module constructs the context information(consists of the three relation contexts)
- ternary contexts provide holistic information
- unary and pairwise relation contexts provide more fine-grained information
- three types: using unary, pairwise and ternary relations of human, object and interaction tokens
- performs rich context exchange
- advantages:
总结
参考目前的静态图片HOI的SOTA模型(MUREN),针对以往的1/2 branch的decoder的不足,它提出了three-branch architecture(human detection, object detection, interaction classification)
- single-branch的缺点:单个transformer decoder需要解决多个子任务,多任务学习效果较差。
- two-branch的缺点:两个独立的transformer decoder分别用于检测人物对 和 交互分类,由于两个branch之间的上下文交换不足,无法学习关系上下文。
MUREN模型架构
- 首先利用CNN提取特征,并进行位置编码,扁平化后传入transformer encoder去提取Tokens。
- 三个branch的decoder提取task-specific tokens去预测sub-task
- MURE的输入为task-specific tokens,为关系推理 生成multiplex relation context
- Attentive Fusion Module将multiplex relation context传播到每个子任务上,进行上下文交换
- 每个branch最后一层的输出用来预测HOI人物交互对
MURE架构
由于task-specific tokens均由独立分支生成,那么就会缺少关系上下文信息,为了解决这个问题,Multiplex Relation Embedding Module(MURE)为关系推理生成多重关系上下文,包括了一元、二元和三元关系上下文。其输入是 i-th task-specific tokens 和 image tokens,输出将会到attentive fusion module中进行上下文交换。
Attentive Fusion
目的:传播多重关系上下文给task-specific tokens
由于每个子任务需要不同的上下文信息进行关系推理,因此使用MLP(with each task-specific token)对多重关系上下文与进行转换,从而传播针对每个子任务条件的上下文信息。然后利用channel attention为每个子任务选择出必要的上下文信息。具体公式如下:
最终预测出人物交互对的公式: