Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

承钺

已于 2023-03-09 19:02:04 修改

阅读量1.8k

点赞数 1

分类专栏： diffusion model 论文学习笔记文章标签：计算机视觉

于 2023-03-09 18:19:10 首次发布

本文链接：https://blog.csdn.net/qq_41436098/article/details/129425280

版权

diffusion model 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

论文学习笔记

1 篇文章 0 订阅

订阅专栏

论文连接：
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
论文代码：
Code
在这里插入图片描述

摘要

Text-to-image diffusion以文本嵌入作为输入能生成高质量的图像，这表明diffusion model的表征与高级语义概念高度关联。此外，CLIP能够赋予图像准确地开集预测（即zero-shot 分类能力），因此将二者的表征空间结合，能够在语义分割领域实现高效Open-Vocabulary Panoptic Segmentation。作者实验证明了该方法的可行性，并取得了sota性能。

主要贡献

首次利用diffusion model在开集分割任务中。
提出了结合text-image diffusion 模型和CLIP的高效开集学习框架。
相较于其他方法取得了明显的性能提升。

方法

Training

在这里插入图片描述

整体Training的pipline简洁高效，示意图如上，大致流程为：

使用Text-to-image Diffusion预训练模型代替传统的CNN/Transformer提取图像特征，由于该模型需要用到图像对应的文本嵌入，为了生成Input Image对应的文本嵌入，作者这里设计了一个Implicit Captioner：使用CLIP预训练模型抽取图像特征，然后通过一个可学习的MLP将其映射到text空间得到对应的文本embedding, 用于diffusion model输入。
选择 Mask2Former 用于生成 $N = 100$ 个proposal mask，然后通过pooling获取mask embedding。
对于每个proposal embedding，存在一个类别标签，因此可以优化交叉熵分类损失。（不过这里应该存在一些细节问题，proposal和真实的GT_proposal存在出入，那么类别标签的assignment？）
作者还使用了基于Image-caption的grounding损失：对于COCO数据集而言，随机选择Input-Image的一个caption提取其中的名词，用于Image的候选类别（类别单词）集合
$\mathbf{C}_{\text{word}}=\{w_{k}\}^{K_{\text{word}}}_{k=1}$ ，然后计算proposal embedding与这些类别集合（本质上作者是将图像的caption转换了实体类别标签用于度量学习）的相似度：
$g\left(x^{(m)}, s^{(m)}\right)=\frac{1}{K} \sum_{k=1}^{K} \sum_{i=1}^{N} \mathbf{p}\left(z_{i}, \mathbf{C}_{\text {word }}\right)_{k} \cdot\left\langle z_{i}, \mathcal{T}\left(w_{k}\right)\right\rangle,$
，那么对于输入的一批图像中，其对应的caption正样本只有一个它自己的caption，其它图像的都可以近似看作负样本，因此可以使用CLIP那样对比损失：
$\begin{aligned} \mathcal{L}_{\mathrm{G}}= & -\frac{1}{B} \sum_{m=1}^{B} \log \frac{\exp \left(g\left(x^{(m)}, s^{(m)}\right) / \tau\right)}{\sum_{n=1}^{B} \exp \left(g\left(x^{(m)}, s^{(n)}\right) / \tau\right)} \\ & -\frac{1}{B} \sum_{m=1}^{B} \log \frac{\exp \left(g\left(x^{(m)}, s^{(m)}\right) / \tau\right)}{\sum_{n=1}^{B} \exp \left(g\left(x^{(n)}, s^{(m)}\right) / \tau\right)}. \end{aligned}$

Inference

测试pipline示意图如下:

在这里插入图片描述

和训练一样，没有text-embedding 的输入，Diffusion 没法提取特征，因此使用训练好的implict captioner来获取text嵌入，提取特征, 进而送入Mask-Generator获得proposal。
作者发现虽然internal representation of the diffusion model 能够具有不错的proposal mask分类能力，但是结合CLIP的判别器会具有更好的分类准确度，因此作者最后是结合了CLIP的特征分类结果与Diffusion特征的分类结果.
$\mathbf{p}_{\text {final }}\left(z_{i}, \mathbf{C}_{\text {test }}\right) \propto \mathbf{p}\left(z_{i}, \mathbf{C}_{\text {test }}\right)^{\lambda} \mathbf{p}\left(z_{i}^{\prime}, \mathbf{C}_{\text {test }}\right)^{(1-\lambda)}.$