SLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation

1. BaseInfo

TitleSLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation
Adresshttps://www.ijcai.org/proceedings/2023/0144.pdf
Journal/TimeIJCAI 2023
AuthorZhejiang University
Codehttps://github.com/NaturalKnight/SLViT
Read2024/08/09
Table#RIS #Seg

2. Creative Q->A

  1. 视觉特征提取和跨模态融合分别考虑, 视觉-语言对齐不足。-> Language-Guided Multi-Scale Fusion Attention (LMFA)
  2. 两个模块是顺序存在的,信息交互不充分。-> An Uncertain Region Cross-Scale Enhancement module (URCE)
    SLViT

3. Concrete

3.1. Model

在这里插入图片描述

类似LAVT在 encoder 里做融合

3.1.1. Input

图片+文本

3.1.2. Backbone

ViT + BERT
2,3,4 特征图下采样得到:stride of 2 and kernel size of 3 × 3 + batch normalization layer

  • Language-Guided Multi-Scale Fusion Attention
    • Conv 5 x 5 初步局部特征
    • 多尺度卷积 1 x k i k_{i} ki
    • Cross Attention
    • Gate 门控
    • 两个分支融合后的 Conv 1 x 1
      在这里插入图片描述
  • Uncertain Region Cross-Scale Enhancement
    在这里插入图片描述
    多头自注意力

3.1.3. Neck

3.1.4. Decoder

Hamburger [Geng et al., 2021] :Hamburger function 、1×1 convolution 、 an upsampling function
ImageNet-22K from the SegNeXt [Guo et al., 2022]

3.1.5. Loss

CE

3.2. Training

BERT,12层,维度 768
In convolutional branches of LMFA, we use k1 = 7, k2 = 11, k3 = 21 kernel sizes for our convolutions. The rest of weights in our model are randomly initialized.
AdamW optimizer weight decay 0.01.
The learning rate is initialed as 3e-5 and scheduled by polynomial learning rate decay with a power of 0.9.
Epoch 60, Batch Size 16.
Image Size 480 x 480.

3.2.1. Resource

3.2.2 Dataset

NameImages Numberreferencesreference expressionsTaskNote
RefCOCO19,99450,000142,209Referring Expression Segmentation
RefCOCO+19,99249,856141,564
G-Ref26,71154,822104,560比前两个的句子表达长,object少

3.3. Eval

verall intersectionover-union (oIoU),
mean intersection-over-union (mIoU),
precision at the 0.5, 0.7, and 0.9 threshold values.
在这里插入图片描述

3.4. Ablation

在这里插入图片描述

4. Reference

[Geng et al., 2021] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? arXiv preprint arXiv:2109.04553, 2021.
[Guo et al., 2022] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 2022.

  • 16
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值