1. BaseInfo
Title | SLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation |
Adress | https://www.ijcai.org/proceedings/2023/0144.pdf |
Journal/Time | IJCAI 2023 |
Author | Zhejiang University |
Code | https://github.com/NaturalKnight/SLViT |
Read | 2024/08/09 |
Table | #RIS #Seg |
2. Creative Q->A
- 视觉特征提取和跨模态融合分别考虑, 视觉-语言对齐不足。-> Language-Guided Multi-Scale Fusion Attention (LMFA)
- 两个模块是顺序存在的,信息交互不充分。-> An Uncertain Region Cross-Scale Enhancement module (URCE)
3. Concrete
3.1. Model
类似LAVT在 encoder 里做融合
3.1.1. Input
图片+文本
3.1.2. Backbone
ViT + BERT
2,3,4 特征图下采样得到:stride of 2 and kernel size of 3 × 3 + batch normalization layer
- Language-Guided Multi-Scale Fusion Attention
- Conv 5 x 5 初步局部特征
- 多尺度卷积 1 x k i k_{i} ki 同时存在三个具有不同内核大小的卷积分支,以捕捉不同感受野的局部特征,在模拟丰富的局部视觉信息时具有空间归纳偏向。卷积核大小分别为 7,11,21。
- Cross Attention
- Gate 门控
- 两个分支融合后的 Conv 1 x 1
代码中的 GatedCrossModalAttention 就和 LAVT 中的 PWAM 一样。
- Uncertain Region Cross-Scale Enhancement
多头自注意力
3.1.3. Neck
3.1.4. Decoder
Hamburger [Geng et al., 2021] :Hamburger function 、1×1 convolution 、 an upsampling function
ImageNet-22K from the SegNeXt [Guo et al., 2022]
3.1.5. Loss
CE
3.2. Training
BERT,12层,维度 768
In convolutional branches of LMFA, we use k1 = 7, k2 = 11, k3 = 21 kernel sizes for our convolutions. The rest of weights in our model are randomly initialized.
AdamW optimizer weight decay 0.01.
The learning rate is initialed as 3e-5 and scheduled by polynomial learning rate decay with a power of 0.9.
Epoch 60, Batch Size 16.
Image Size 480 x 480.
3.2.1. Resource
3.2.2 Dataset
Name | Images Number | references | reference expressions | Task | Note |
---|---|---|---|---|---|
RefCOCO | 19,994 | 50,000 | 142,209 | Referring Expression Segmentation | |
RefCOCO+ | 19,992 | 49,856 | 141,564 | ||
G-Ref | 26,711 | 54,822 | 104,560 | 比前两个的句子表达长,object少 |
3.3. Eval
verall intersectionover-union (oIoU),
mean intersection-over-union (mIoU),
precision at the 0.5, 0.7, and 0.9 threshold values.
3.4. Ablation
4. Reference
[Geng et al., 2021] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? arXiv preprint arXiv:2109.04553, 2021.
[Guo et al., 2022] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 2022.
5. Think
文章的 Fig 2 用可视化的图很好的表现了该工作在精细化部分的优点,勺子柄的分离。