InstantStyle技术小结

最新推荐文章于 2025-04-02 15:18:58 发布

莫叶何竹

最新推荐文章于 2025-04-02 15:18:58 发布

阅读量1k

点赞数 22

分类专栏： diffusion model 文章标签： instant style stable diffusion

本文链接：https://blog.csdn.net/weixin_40779727/article/details/139332615

版权

diffusion model 专栏收录该内容

22 篇文章

订阅专栏

paper	https://arxiv.org/abs/2404.02733
code	https://github.com/InstantStyle/InstantStyle
org	InstantX
个人博客位置	http://myhz0606.com/article/instantStyle

前置阅读：IP-Adapter

Motivation

InstantStyle为了解决Tuning-free reference image style transfer的问题。其核心思路架构沿用IP-adapter。但IP-adapter在做reference image 风格迁移时有两个痛点

content leakage。
image condition的引入会导致text condition变弱

虽然通过调节image weight能够缓解这个问题，但需要手动调节weight，并且不能保证成功。InstantStyle 针对上述两个痛点进行了优化。

在这里插入图片描述

Method

Instant-style的核心架构还是IP-Adapter。主要进行了两个优化：

优化1 ：设计了一个简单但有效的方法对reference image的object和style进行解离来缓解content-leakage问题。

记reference image为 $R$ ，reference image的object text为 $P_{ref}$ ，原始IP-adapter所得到的reference image的image embedding为

$\mathrm{LN}(\mathrm{Proj}(\mathrm{CLIPEncoder}_{\mathrm{img}}(R)))$

而instant-style所得到的image embedding为：

$\mathrm{LN}(\mathrm{Proj}(\mathrm{CLIPEncoder}_{\mathrm{img}}(R) - \mathrm{CLIPEncoder}_{\mathrm{text}}(P_{ref})))$

其中 $\mathrm{LN}$ 为linear normalization， $\mathrm{Proj}$ 为全连接层，用于调整权重。

简单来说，就是将CLIP得到的image feature减去object feature。

优化2：减少decouple-cross-attention的替换数量，提升IP-Adapter的prompt following能力。

作者通过对不同的DM中不同的attention layer进行分析，发现up blocks.0.attentions.1和down blocks.2.attentions.1对style和layout的捕捉能力最强。为了减轻IP-adapter中的image condition对prompt following的影响，仅在这两层将cross attention换成decouple cross-attention。
（感觉这部分的实验不够充分）

代码位置：https://github.com/InstantStyle/InstantStyle/blob/main/ip_adapter/ip_adapter.py#L165

在这里插入图片描述