【AIGC】基于文生图的人脸ID定制化方法

Dolly_DL

已于 2024-07-26 15:33:46 修改

阅读量578

点赞数 17

分类专栏： paper-reading 文章标签：深度学习人工智能

于 2024-07-26 15:30:21 首次发布

本文链接：https://blog.csdn.net/qq_37706246/article/details/140699702

版权

3 篇文章 0 订阅

订阅专栏

主要介绍IP-Adapter(FaceID), InstantID, PuLID三篇paper

ID customization: As a special category of customized text-to-image(T2I) generation, identity(ID) customization allow users to adapt pre-trained T2I diffusion models to align with their personalized ID
tuning-based:为每一个ID定制化需要10+min的微调,会导致个性化处理与生成成本较大(如: dreambooth, lora)
tuning-free:使用1个encoder(clip图像编码器)提取ID特征, 然后采用特征的方法将特征嵌入到基本扩散模型中,比较高效。如：借助cross-attention层融合ID特征。

在这里插入图片描述

IP-Adapter(FaceID)
Image-prompt:全局图像—>人脸图像
Encoder: ImageEncoder–>Face Encoder
Decoupled Cross-Attention: 将cross-attention层分为图像特征+文本特征
1)原始sd中, cross-attention中的K，V源于文本特征, Q为上层U-net block的输出,见式3
2)借鉴原始sd中, 文本特征融入U-net的方式，将图像特征嵌入，见式4
3)不同于将二者特征先直接concat的方法+送入cross-attention中,而是将解耦后的直接相加,更高效，见式(5)

在这里插入图片描述

方法
主要分为3部分,1-2部分与IP-AdapterFaceID方法相同:
1)ID embedding的提取
2)轻量级的解耦adapter
3)IdtentityNet:从参考人脸图像中编码细节特征,作为额外的弱空间控制
4)仅支持SDXL版本
IdentityNet
1)在attention之前,直接concat图像+文本的prompt,不利于模型的细粒度控制
2)将解耦后的att输出相加,更倾向于弱化文本token的控制
3)依托ControlNet的形式,设计IdentityNet，control-img=参考人脸的关键点map图, 文本输入替换为FaceEmbedding,增强与ID相关信息的控制
4)缺点：参考人脸的landmark的控制信息,限制了生成人脸的多样性(角度,表情等)
论文效果

动机
1)ID信息的插入扰乱了原始模型的表现力
a) an ideal ID insertion should alter only ID-related aspects, such as face, hairstyle, and skin color, while image elements not directly associated with the specific identity, such as background, lighting, composition, and style, should be consistent with the behavior of the original model.
b) after the ID insertion, it should still retain the ability of the original T2I model to follow prompts.
2)缺乏ID保真度
方法
基于有/无ID嵌入构建对比路径,并且引入对齐loss, 指导模型在不干扰原始模型的表达力下如何插入ID控制信息

在这里插入图片描述
基于对比对齐的非concat的ID嵌入方法：设计语义对齐+布局对齐的损失函数

优化ID Loss: 通过计算引人ID-emb之后生成图像的与参考图像剑人脸ID相似度
论文中采用的multi-stage的训练方法也值得借鉴