Prompt-Free Diffusion学习笔记

kangxi11122344

已于 2023-06-06 20:08:17 修改

阅读量574

点赞数

文章标签：学习笔记深度学习

于 2023-06-06 18:55:55 首次发布

本文链接：https://blog.csdn.net/kangxi11122344/article/details/131013881

版权

Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models（从文本到图像的扩散模型中去掉“文本”）

- Prompt-Free Diffusion（提出SeeCoder）
- - Prompt-Free Diffusion
  - Semantic Context Encoder（SeeCoder:语义上下文编码器）

motivation： the text prompt engineering，要想获得高质量图片需要花时间得到高质量的prompt
contribution： 本文提出将“text”从预训练的 $T 2 I$ 模型中删除，prompt free diffusion
核心结构为Semantic Context Encoder（SeeCoder，语义上下文编码器）

Prompt-Free Diffusion（提出SeeCoder）

具有任意分辨率的pixel-based图像被自动转换为有意义的visual embeddings（可表示低级信息如textures，effects或高级信息objects，semantics）。

Prompt-Free Diffusion

用新提出的SeeCoder取代了CLIP的文本编码器
在常见 $T 2 I$ 模型中，text prompt首先被tokenized，然后使用CLIP编码为N-by-C context embeddings（N和C表示embeddings的数量和维度）
SeeCoder仅将图像作为输入，捕捉视觉线索，将其转换为表示纹理、对象、背景等的兼容性N-by-C embeddings

Semantic Context Encoder（SeeCoder:语义上下文编码器）

SeeCoder可分为三个部分Backbone Encoder、Decoder、Query Transformer
Backbone Encoder 使用SWIN-L，因为它将任意分辨率的图像转换为特征金字塔，从而更好地捕捉不同尺度的视觉线索
Decoder（a transformer-based network with several convolutions） 将encoder提取的特征经过6个muti-head self-attention modules（with linear projections and LayerNorms），最终得到2D输出，sum with（相加） lateral-linked（横向连接）的输入特征（这是什么？）
Query Transformer
将多级视觉特征最终化为单个1D视觉embedding
包含多个cross-attention和self-attention的混合
cross-attention：local quires作为Q，视觉特征作为K和V，作用：将视觉特征转换到local queries中
self-attention：使用global quires和local quires的串联作为QKV，作用：将local queries提取到global quires。
global quires和local quires被连接并传递到扩散器以生成内容
在这里插入图片描述

训练： 只对SeeCoder的decoder和query transformer进行了具有variation lower-bound loss（变化下界损失）和所需梯度的定期训练。所有其他权重（即VAE、扩散器和SeeCoder的主干编码器）保持冻结状态。

kangxi11122344

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Prompt-Free Diffusion学习笔记

将“text”从预训练的T2I模型中删除
复制链接

扫一扫