diffusion model (九) EmuEdit技术小结

Paper: https://emu-edit.metademolab.com/assets/emu_edit.pdf

Project web: https://emu-edit.metademolab.com/

Code: have not opensource

背景

在发布Emu后,近日,META又发布了两个非常惊艳的工作:EmuEdit、EmuVideo。文本将对EmuEdit相关技术进行总结。

在这里插入图片描述

1 核心思想

作者将intruction-base image editing任务建模为生成任务,并用diffusion model进行求解。核心创新点有两个

  • 详细定义了instruction-based image edit处理的任务,并设计了一个高效高质量的数据构建方法。
  • 为提升模型对instruction的理解能力,引入learnable task embedding,能较好的解决上述问题。并且提出task inversion的训练方法,只需少量数据就能有效将模型扩展到新的task(类似textual inversion的思想)。

2 方法

2.1 方法建模

前面提到,作者将一系列的intruction-base image editing任务建模为生成任务,并用diffusion model来求解。具体来看intruction-base image editing任务做的是这么一件事:给定一张参考图片和一段表述文本,输出符合上述两个条件的图片。从上述描述可知:intruction-base image editing的训练数据应当至少是一个三元组 D = { ( c T i , c I i , x i ) ∣ i = 1 , ⋯ N } \mathcal{D} = \{(c_T^{i}, c_I^{i}, x^{i})|i = 1, \cdots N\} D={(cTi,cIi,xi)i=1,N},其中

c I c_I cI: 参考图片(condition of image)

c T c_T cT: 参考文本(condition of text)

x x x: 目标图片

这样,基于diffusion model的优化目标可建模为:
min ⁡ θ E y , ϵ , t [ ∥ ϵ − ϵ θ ( z t , t , E ( c I ) , c T ) ∥ 2 2 ] 2 (1) \min _ { \theta } \mathbb{E} _ { y , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } ) \Vert _ { 2 } ^ { 2 } ] ^ { 2 } \tag{1} θminEy,ϵ,t[ϵϵθ(zt,t,E(cI),cT)22]2(1)
和经典的classifier-free有所区别的是,此处多了一个参考图片的condition E ( c I ) E ( c _ { I } ) E(cI)。条件融入的方法上,

  • 作者参考Instructpix2pix将image condition融入到输入层(在通道维度进行concat)。

  • 参考classifier-free将text condition的融入在cross-attention。

通过实验,作者发现用上述方法训练的模型对task的理解不够准确如下图所示。为此,作者引入learnable task embedding来增强模型对task的理解。此时的优化目标建模为:
min ⁡ θ , v 1 , … , v k E y ^ , ϵ , t [ ∥ ϵ − ϵ θ ( z t , t , E ( c I ) , c T , v i ) ∥ 2 2 ] (2) \min _ { \theta , v _ { 1 } , \dots , v _ { k } } \mathbb{E} _ { \hat { y } , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } , v _ { i } ) \Vert _{2}^{2} ] \tag{2} θ,v1,,vkminEy^,ϵ,t[ϵϵθ(zt,t,E(cI),cT,vi)22](2)
为了求解上述目标方程,构造的训练数据集的每一个元素应当是一个四元组 D n e w = { ( c T i , c I i , x i , v j i ) ∣ i = 1 , ⋯ N } \mathcal{D}_{new} = \{(c_T^{i}, c_I^{i}, x^{i}, v^{i}_{j})|i = 1, \cdots N\} Dnew={(cTi,cIi,xi,vji)i=1,N} v j v_j vj为这条数据所属的task类别。并且此时的diffusion model的噪声预测模型多了一个task embedding v i v_i vi的条件。作者的融入方式是将其与time-step的embedding进行相加,共同融入到cross-attention中。这样设计还保留了可扩展性:当有一个新的task时,可以将优化目标转化为
min ⁡ v n e w E y , ϵ , t [ ∥ ϵ − ϵ θ ( z t , t , E ( c I ) , c T , v n e w ) ∥ 2 2 (3) \min _ { v _ { \mathrm{n e w} } } \mathbb{E} _ { y , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } , v _ { n e w } ) \Vert _ { 2 } ^ { 2 } \tag{3} vnewminEy,ϵ,t[ϵϵθ(zt,t,E(cI),cT,vnew)22(3)
此时训练的参数仅为新增的task embedding,其它参数都freeze。作者将其称之为task inversion(类似textual inversion)。

在用户层的推理阶段,用户无需输入task index,作者基于Flan-T5-XL训练了一个task index预测模型,来根据用户输入的instruction预测出相应的task index。
在这里插入图片描述

从实现原理上,上述方法不难想到。论文取得的卓越的效果取决于训练的数据集。下面来看作者是如何用一种高效的方法构建高质量的数据集。

2.2 数据工程

前文提到,训练一个image-edit diffusion model训练数据至少是一个三元组 D = { ( c T i , c I i , x i ) ∣ i = 1 , ⋯ N } \mathcal{D} = \{(c_T^{i}, c_I^{i}, x^{i})|i = 1, \cdots N\} D={(cTi,cIi,xi)i=1,N} (其中 c I c_I cI: 参考图片(condition of image) c T c_T cT: 参考文本(condition of text) x x x: 目标图片)。手动构建数据集的成本非常大,开源数据规模又不够大,一些规模大的合成数据多样性和质量又不高,因此需要探寻如何用cheap的方法来构建一个高质量、大规模、高多样的image-edit数据集。为了结合task inversion,新构建的数据集应当是一个四元组 D n e w = { ( c T i , c I i , x i , v j i ) ∣ i = 1 , ⋯ N } \mathcal{D}_{new} = \{(c_T^{i}, c_I^{i}, x^{i}, v^{i}_{j})|i = 1, \cdots N\} Dnew={(cTi,cIi,xi,vji)i=1,N}, v j v_j vj为这条数据所属的task类别。

2.2.1 image-edit任务类别定义

作者将image-edit分为了三大类,分别是Region-based Editing、Free-From Editing、Vision tasks,每个大类中有若干小类。下图展示了每一个image-edit任务所做的事

在这里插入图片描述

2.2.2 指令集生成

任务定义:已知image caption和编辑任务,输出满足编辑任务新的caption

  • 输入:image caption + edit任务

  • 输出:edit instruction, edit instruction应当包含:1)edit指令;2)edit的目标(edited object);3)新的image caption;4)原始目标(original object)(7.2节提到有这个字段,但在7.1中的示例没有,实际上应当要有这个字段,否则后续的mask提取无法进行)

举个例子(对于add的image-edit任务)

输入:{“image_caption”: “Beautiful cat with mojito sitting in a cafe on the street”, “task”: “Add”}

输出:{“edit”: “include a hat”, “edited object”: “hat”, “output”: “Beautiful cat wearing a hat with mojito sitting in a cafe on the street”, “original object”: “cat”}

作者用context learning的方法来实现上述任务的目标。作者构建的prompt方案如下:(作者所用的LLM是微调了的70B LLama2,我用chatgpt尝试了一下,也能实现类似的效果)

import random 
import torch 
from random import choice, shuffle

few_shot_examples = [
    """
    [INST]User: "Beautiful cat with mojito sitting in a cafe on the street"[/INST] 
    Assistant: {"edit": "include a hat", "edited object": "hat", "output": "Beautiful cat wearing a hat with mojito sitting in a cafe on the street"}""",
    """
    [INST]User: "robot playing chess at home."[/INST] 
    Assistant: {"edit": "add a cheerful smiling face.", "edited object": "robot", "output": "robot playing chess at home with a cheerful smiling face."} """,
    """
    [INST]User: "A cute creature sits at the beach."[/INST] 
    Assistant: {"edit": "set a dog besides the creature", "edited object": "dog", "output": "A cute creature and a dog sit at the beach."} """,
    """
    [INST]User: "Superhero on the street in sunny day working on his tablet."[/INST] 
    Assistant: {"edit": "put a vintage tie on the superhero.", "edited object": "tie", "output": "Superhero with a vintage tie on the street in sunny day working on his tablet."} """,
    """
    [INST]User: "Picture clouds, birds, the wind, foliage, rainbow, hill, art, pair, guy"[/INST] 
    Assistant: {"edit": "together with a dog on the left", "edited object": "dog", "output": "Picture clouds, birds, the wind, foliage, rainbow, hill, art, pair, guy, dog on the left"} """,
    """
    [INST]User: "horse on a red Boat Near Mountains During Golden Hour"[/INST] 
    Assistant: {"edit": "give the horse sunglassess", "edited object": "sunglassess", "output": "horse with sunglassess on a red Boat Near Mountains During Golden Hour"} """,
    """
    [INST]User: "An animal family on studio background."[/INST] 
    Assistant: {"edit": "make them hold a teddy bear.", "edited object": "teddy bear", "output": "An animal family holding a teddy bear on studio background."} """,
    """
    [INST]User: "Baked Salmon With Bell Peppers"[/INST] 
    Assistant: {"edit": "insert kale pesto to the dish", "edited object": "kale pesto", "output": "Baked Salmon With Kale Pesto And Bell Peppers"} """,
    """
    [INST]User: "An airplaine is flying in the sky in rainy day."[/INST] 
    Assistant: {"edit": "add flowers in the windows", "edited object": "flowers", "output": "An airplaine with flowers in the windows is flying in the sky in rainy day."} """,
    """
    [INST]User: "photo of mountains and trees"[/INST] 
    Assistant: {"edit": "position a castle between the trees", "edited object": "castle", "output": "photo of mountains, trees and castle between the trees"} """,
    """
    [INST]User: "Little bunny in the park"[/INST] 
    Assistant: {"edit": "Make the bunny play with a kite.", "edited object": "kite", "output": "Little bunny playing with a kite in the park"} """,
    """
    [INST]User: "Attic Bedroom With Large Ceilings"[/INST] 
    Assistant: {"edit": "decorate the room with beautiful chandeliers", "edited object": "chandeliers", "output": "Attic Bedroom With Beautiful Chandeliers on Large Ceilings"} """,
    """
    [INST]User: "Wedding rings and yellow flower on a red background"[/INST]  
    Assistant: {"edit": "place a bird by the yellow flower", "edited object": "bird", "output": "Wedding rings, a bird, and yellow flower on a red background"} """,
    """
    [INST]User: "Tree Near the lake in the morning"[/INST] 
    Assistant: {"edit": "Give it autumn leaves on top", "edited object": "leaves", "output": "Tree with autumn leaves on top Near the lake in the morning"} """,
    """
    [INST]User: "robot and alien sitting on hanging bridge at daytime"[/INST]  
    Assistant: {"edit": "make them hold three books.", "edited object": "threebooks", "output": "robot and alien holding three books while sitting on hanging bridge at daytime"} """,
    """
    [INST]User: "Skogafoss waterfall in the south of Iceland"[/INST]  
    Assistant: {"edit": "Set a colorful rainbow in the backhground!", "edited object": "rainboe", "output": "Skogafoss waterfall with a colorful rainbow in the south of Iceland"} """,
    """
    [INST]User: "Polar Bear with rubber gloves pushing shopping carts"[/INST]  
    Assistant: {"edit": "Make it wear a coat", "edited object": "coat", "output": "Polar Bear with a coat pushing shopping carts"}
    """
    ]


def get_content_instruction(new_prompt): 
    optional_verbs = choice(["include", "place", "position", "set", "incorporate", "alongside", " give", "put", "insert", "together with", "with", "make", "integrate", "have", "append", " make", "add", "include"]) 
     # system message # 
    system_message = f"""
        <<SYS>>You are an assistant that only speaks JSON. Do not write normal text. The assistant answer is 
            JSON with the following string fields: 'edit', 'edited object','output'. 
            Here is the latest conversation between Assistant and User.
        <</SYS>>
        """
     # introduction message 
    intro_message = f"""
    [INST]User: Hi, My job to take a given caption ('input') and to output the following: an instruction for {optional_verbs} an object to the image ('edit'), the object to {optional_verbs} ('edited object'), and the caption with the object ('output'). Please help me do it.
        I will give you the 'input', and you will help. When you reply, use the following format: {"edit": '<instruction>', 'edited object': '<object>', 'output': '<caption>'}
    [/INST]
    Assistant: Sure, I'd be happy to help! Please provide the actual input caption you'd like me to read and I'll assist you with writing an instruction to {optional_verbs} an object to the image, writing the added object and writing the caption with the object."""  
    random.seed(torch.randint(1 << 32, ()).item())
    shuffle(few_shot_examples)
    few_shot_examples = few_shot_examples[:int(len(few_shot_examples) * 0.6)] 
    prompt = system_message + intro_message + "".join(few_shot_examples)  # add the test prompt 
    prompt = prompt + f"[INST]User: {new_prompt}[/INST]"
    return prompt
2.2.3 图片对的生成

通过上面的步骤我们拿到了4元组 ( c T , c I , x , v j ) (c_T, c_I, x, v_{j}) (cT,cI,x,vj),中的 c T , c I , v j c_T, c_I, v_{j} cT,cI,vj,其中 c T c_T cT还有很多附加信息:

如:编辑的对象,新的image caption,如:

{“edit”: “include a hat”, “edited object”: “hat”, “output”: “Beautiful cat wearing a hat with mojito sitting in a cafe on the street”, “original object”: “cat”}

此处需要进行的是根据上面的条件,得到对应的图片pair ( x x x)。

任务目标:根据输入图片、instruction信息生成对应的图片pair ( x x x)并且除了编辑的区域, x x x c I c_I cI的差异应当尽可能的小。
max ⁡ S I M ( I n s t r u c t i o n , c I E d i t ) min ⁡ D i s t ( x , c I n o t E d i t ) (4) \begin{aligned} &\max \mathrm{SIM}(\mathrm{Instruction}, c_I^{\mathrm{Edit}}) \\ &\min \mathrm{Dist}(x, c_I^{\mathrm{not Edit}}) \end{aligned} \tag{4} maxSIM(Instruction,cIEdit)minDist(x,cInotEdit)(4)

为了解决上述的任务目标,作者提出一种mask-based attention control的方法(相当于DiffEdit和P2P的结合)。具体分为以下几个步骤:

已知条件:

  • C a p o r i \mathrm{Cap_{ori}} Capori: image caption 。 example:Beautiful cat with mojito sitting in a cafe on the street
  • I m g o r i : \mathrm{Img_{ori}}: Imgori:image caption用DM生成的图片
  • C a p e d i t \mathrm{Cap_{edit}} Capedit:编辑后的image caption。Beautiful cat wearing a hat with mojito sitting in a cafe on the street
  • O b j o r i \mathrm{Obj_{ori}} Objoriimage caption的原始目标(original object)。cat
  • O b j e d i t \mathrm{Obj_{edit}} Objedit编辑目标(edited object):hat

STEP1: 提取mask。将 O b j o r i \mathrm{Obj_{ori}} Objori I m g o r i \mathrm{Img_{ori}} Imgori送入到SAM+DINO模型中,得到3类mask

  1. 精确的mask,有sam+dino的生成
  2. 将1中的mask进行膨胀,在进行高斯模糊,作为新的mask
  3. 取第一步mask的bounding box作为新的mask

SETP2: 通过mask-based attention control进行图片生成。具体为:先用P2P的cross-attention control的方法将common token的对应的attention map进行注入,随后用diffedit的根据mask融合方法进行融合。

STEP3: 图片Filter。通过上述步骤得到3个目标图片,留存最好的一个。filter的规则为

  1. 用CLIP filtering metrics,留存最相关的一个
  2. 留存edit image与input image在深度图上的L1 距离最小的图片。

每一类Edit方法的详细的数据构造细节见论文7.2.3

最后得到的各类训练数据比例如下:

在这里插入图片描述

3 结果

EmuEdit在多个测试集取得了SOTA。并且作者公开了一个新的基于EmuEdit的benchmark:https://huggingface.co/datasets/facebook/emu_edit_test_set

在这里插入图片描述

一些惊艳结果:

在这里插入图片描述

  • 23
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值