MUREN(Relational Context Learning for Human-Object Interaction Detection)

本文链接：https://blog.csdn.net/m0_46246301/article/details/131922482

MUREN是一个用于人类-对象交互检测的模型，通过多层关系嵌入模块生成上下文信息，并使用注意力融合模块促进不同子任务间的关系上下文交换。相比单一或双分支解码器，MUREN的三分支架构在人检测、对象检测和交互分类上学习更具有判别性的特征，从而在HICO-DET和V-COCO数据集上表现出优越性能。

摘要由CSDN通过智能技术生成

MUREN(Relational Context Learning for Human-Object Interaction Detection)

Contributions

multiplex relation embedding module, which generates context information using unary, pairwise and ternary relations in an HOI instance.
attentive fusion module that propagates requisite context information for context exchange.
three-branch architecture to learn more discriminative features for sub-tasks(human detection, object detection and interaction classification)
MUREN outperforms SOTA methods on HICO-DET and V-COCO

Introduction

single-branch(single transformer decoder)
- update a token set through a single transformer decoder
- detect HOI instances using the subsequent FFNs directly
- disadvantages:
  - a single transformer decoder is responsible for all sub-tasks(human detection, object detection, and interaction classification )
  - limited in adapting to the different subtasks with multi-task learning, simultaneously
two-branch(two separated transformer decoder)
- one detects human-object pairs
- the other classifies interaction classes
- disadvantages:
  - the insufficient context exchange between the branches prevents the two-branch methods [15,38,40] from learning relational contexts , which plays a crucial role in identifying HOI instances
  - Some tackle this issue with additional context exchange, but they are limited to propagating human-object context to interaction context
MUREN
- advantages:
  - performs rich context exchange
    - three types: using unary, pairwise and ternary relations of human, object and interaction tokens
      - unary and pairwise relation contexts provide more fine-grained information
        unary contexts: riding helps to infer pair of a human and an interaction
        pairwise context(human and riding) helps to detect an object(bicycle)
        multiplex relation embedding module constructs the context information(consists of the three relation contexts)
      - ternary contexts provide holistic information

总结

参考目前的静态图片HOI的SOTA模型(MUREN)，针对以往的1/2 branch的decoder的不足，它提出了three-branch architecture(human detection, object detection, interaction classification)

single-branch的缺点：单个transformer decoder需要解决多个子任务，多任务学习效果较差。
two-branch的缺点：两个独立的transformer decoder分别用于检测人物对和交互分类，由于两个branch之间的上下文交换不足，无法学习关系上下文。

MUREN模型架构

首先利用CNN提取特征，并进行位置编码，扁平化后传入transformer encoder去提取Tokens。
三个branch的decoder提取task-specific tokens去预测sub-task
MURE的输入为task-specific tokens，为关系推理生成multiplex relation context
Attentive Fusion Module将multiplex relation context传播到每个子任务上，进行上下文交换
每个branch最后一层的输出用来预测HOI人物交互对

MURE架构

由于task-specific tokens均由独立分支生成，那么就会缺少关系上下文信息，为了解决这个问题，Multiplex Relation Embedding Module(MURE)为关系推理生成多重关系上下文，包括了一元、二元和三元关系上下文。其输入是 i-th task-specific tokens 和 image tokens，输出将会到attentive fusion module中进行上下文交换。

Attentive Fusion

目的：传播多重关系上下文给task-specific tokens

由于每个子任务需要不同的上下文信息进行关系推理，因此使用MLP(with each task-specific token)对多重关系上下文与进行转换，从而传播针对每个子任务条件的上下文信息。然后利用channel attention为每个子任务选择出必要的上下文信息。具体公式如下：