MUREN(Relational Context Learning for Human-Object Interaction Detection)

MUREN是一个用于人类-对象交互检测的模型,通过多层关系嵌入模块生成上下文信息,并使用注意力融合模块促进不同子任务间的关系上下文交换。相比单一或双分支解码器,MUREN的三分支架构在人检测、对象检测和交互分类上学习更具有判别性的特征,从而在HICO-DET和V-COCO数据集上表现出优越性能。
摘要由CSDN通过智能技术生成

MUREN(Relational Context Learning for Human-Object Interaction Detection)

Contributions

  1. multiplex relation embedding module, which generates context information using unary, pairwise and ternary relations in an HOI instance.
  2. attentive fusion module that propagates requisite context information for context exchange.
  3. three-branch architecture to learn more discriminative features for sub-tasks(human detection, object detection and interaction classification)
  4. MUREN outperforms SOTA methods on HICO-DET and V-COCO

Introduction

  • single-branch(single transformer decoder)

    • update a token set through a single transformer decoder
    • detect HOI instances using the subsequent FFNs directly
    • disadvantages:
      • a single transformer decoder is responsible for all sub-tasks(human detection, object detection, and interaction classification )
      • limited in adapting to the different subtasks with multi-task learning, simultaneously
  • two-branch(two separated transformer decoder)

    • one detects human-object pairs
    • the other classifies interaction classes
    • disadvantages:
      • the insufficient context exchange between the branches prevents the two-branch methods [15,38,40] from learning relational contexts , which plays a crucial role in identifying HOI instances
      • Some tackle this issue with additional context exchange, but they are limited to propagating human-object context to interaction context
  • MUREN

    • advantages:
      • performs rich context exchange
        • three types: using unary, pairwise and ternary relations of human, object and interaction tokens
          • unary and pairwise relation contexts provide more fine-grained information
            • unary contexts: riding helps to infer pair of a human and an interaction
            • pairwise context(human and riding) helps to detect an object(bicycle)
            • multiplex relation embedding module constructs the context information(consists of the three relation contexts)
          • ternary contexts provide holistic information

总结

参考目前的静态图片HOI的SOTA模型(MUREN),针对以往的1/2 branch的decoder的不足,它提出了three-branch architecture(human detection, object detection, interaction classification)

  1. single-branch的缺点:单个transformer decoder需要解决多个子任务,多任务学习效果较差。
  2. two-branch的缺点:两个独立的transformer decoder分别用于检测人物对 和 交互分类,由于两个branch之间的上下文交换不足,无法学习关系上下文。

pCXtjPg.md.png

MUREN模型架构

  1. 首先利用CNN提取特征,并进行位置编码,扁平化后传入transformer encoder去提取Tokens。
  2. 三个branch的decoder提取task-specific tokens去预测sub-task
  3. MURE的输入为task-specific tokens,为关系推理 生成multiplex relation context
  4. Attentive Fusion Module将multiplex relation context传播到每个子任务上,进行上下文交换
  5. 每个branch最后一层的输出用来预测HOI人物交互对

pCXtzxs.png

MURE架构

由于task-specific tokens均由独立分支生成,那么就会缺少关系上下文信息,为了解决这个问题,Multiplex Relation Embedding Module(MURE)为关系推理生成多重关系上下文,包括了一元、二元和三元关系上下文。其输入是 i-th task-specific tokens 和 image tokens,输出将会到attentive fusion module中进行上下文交换。

Attentive Fusion

目的:传播多重关系上下文给task-specific tokens

由于每个子任务需要不同的上下文信息进行关系推理,因此使用MLP(with each task-specific token)对多重关系上下文与进行转换,从而传播针对每个子任务条件的上下文信息。然后利用channel attention为每个子任务选择出必要的上下文信息。具体公式如下:

pCXNCq0.md.png

最终预测出人物交互对的公式:

pCXNkIU.md.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

路过的风666

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值