StableDiffusion模型在PCIE上的迁移与精度对齐

算能开发者社区

于 2022-11-01 13:40:52 发布

阅读量3.2k

点赞数 4

分类专栏： TPU实战课文章标签：深度学习人工智能计算机视觉 1024程序员节算法

本文链接：https://blog.csdn.net/lily_19861986/article/details/127631496

版权

TPU实战课专栏收录该内容

17 篇文章 7 订阅

订阅专栏

StableDiffusion模型在PCIE上的迁移与精度对齐

简介
模型介绍
2.1 Diffusion 过程解析
模型细节
迁移细节：绕过不适配算子
4.1 获得原始模型
4.2 迁移CLIP中TextEncoder模型
4.3 迁移VAE模型
4.4 迁移Conditional U-Net模型：绕过不适配算子 dictconstruct, boardcast_to, eisum 算子
pipeline 搭建与精度对齐
5.1 精度对齐
5.2 问题分析
参考资料

1. 简介

StableDiffusion 模型是一种基于 Diffusion 模型的图像生成模型，其在图像生成质量上有着显著的提升。本文将介绍如何将 StableDiffusion 模型迁移到 BM1684 中，并对其精度进行对齐。

2. 模型介绍

StableDiffusion 是一种潜在的文本到图像扩散模型，能够在给定任何文本输入的情况下生成照片般逼真的图像。

在这里插入图片描述

StableDiffusion 由三个部分组成：一个是 Diffusion 模型，一个是一个基于 Transformer 的文本编码器，最后是一个VAE，图像生成器。Diffusion 模型是一个基于扩散过程的生成模型，它能够生成高质量的latent。Transformer 是一种基于注意力机制的编码器，它能够将文本编码为一个向量。Transformer 与 Diffusion 模型的结合，使得 StableDiffusion 能够在给定任何文本输入的情况下生成照片般逼真的图像。

2.1 Diffusion 过程解析

Diffusion 过程是一个逐渐从noise中生成有意义信息的过程，其过程如下图所示：

在这里插入图片描述

训练时会从 X0 逐渐扩散到XT, 其中XT是随机噪声，X0是真实图像。在推理时，模型会从噪声中生成图像。

扩散过程(X0 −> XT)：逐步对图像加噪声，这一逐步过程可以认为是参数化的马尔可夫过程。
逆扩散过程(X0 <- XT)：从噪声中反向推导，逐渐消除噪声以逆转生成图像。

训练完成后，就能通过随机采样高斯噪声来生成图像了。实际上扩散模型和AE、VAE很类似，一个粗略的发展过程可以认为是AE–VAE–VQVAE–Diffusion，而扩散模型也逐步从DDPM–GLIDE–DALLE2–Stable Diffusion。

在这里插入图片描述

扩散模型可以看成层次式VAE，这或许证明了即使Encoder微小、隐空间维度固定、马尔科夫跃迁，当推广到无限层时，模型仍然能够学习到强大能力。

3. 模型细节

在这里插入图片描述
它由VAE，Unet， CLIP 文本编码器组成。

VAE。在训练期间，VAE编码器用于获取图像的潜在表示（latents）以进行前向扩散过程，即加噪声。而在推理过程中，逆扩散的去噪过程使用VAE解码器生成图像。
U-Net。编码解码都由ResNet组成。编码器得到图像表示，而解码器还原图像，且此时得到的应该是噪声较小的。更具体地说， U-Net 输出预测可用于计算预测去噪图像表示的噪声残差。此外，交叉注意力层被添加到 U-Net 的编码器和解码器部分（ResNet 块之间）以调节输出。
CLIP。文本编码将文本转换为 U-Net输入。且和Imagen一样，Stable Diffusion在训练期间不训练文本编码器，而只是使用 CLIP 已经训练好的CLIPTextModel。

4. 迁移细节：绕过不适配算子

4.2获得原始模型

使用作者提供的预训练模型。

安装必要的库 pip install diffusers==0.2.4 transformers scipy ftfy

from diffusers import StableDiffusionPipeline

# get your token at https://huggingface.co/settings/tokens
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=YOUR_TOKEN)

prompt = "a photograph of an astronaut riding a horse" #输入文本
image = pipe(prompt)["sample"][0] #得到生成的图片

由此可以从huggingface下载预训练模型

在这里插入图片描述
原始仓库提供了多个版本

在这里插入图片描述

下载的模型会在.cache里面
在这里插入图片描述

4.2 迁移CLIP中TextEncoder模型

直接迁移 CLIP 中的 TextEncoder 模型，所提供的ONNX版本是可以的。
需要注意的是： TextEncoder 模型的输入是一个文本token序列，token是从0开始的整数，对应在设置模型的描述时添加数据类型的声明。

转换脚本是：

python3 -m bmneto --model=./text_encoder.onnx \
                  --outdir="./" \
                  --target="BM1684" \
                  --shapes="1,77" \
                  --opt=1 \
                  --cmp=false \
                  --net_name="text_encoder" \
                  --descs="[0,int64,0,49409]"

如果不设置cmp=false，则会报错.

4.3 迁移VAE模型

VAE decoder onnx 无法使用bmneto转换, 因此本次采用的是 PyTorch 的方式：

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("/mnt/sdb/wangyang.zuo/.cache/huggingface/diffusers/CompVis--stable-diffusion-v1-4.main.7c3034b58f838791fc1c581d435c452ea80af274")

def fn(input_tensor): # 构造输入函数 
    with torch.no_grad():
        return pipe.vae.decode(input_tensor)
    
jitmodel = torch.jit.trace(fn, torch.rand(1,4,64,64))
jitmodel.save("vae_decoder.pt")

转换命令

python3 -m bmnetp   --model=./vae_decoder.pt \
                    --outdir="./" \
                    --target="BM1684" \
                    --shapes="1,4,64,64" \
                    --net_name="vae_decoder" \
                    --opt=2 \
                    --cmp=false

4.4 迁移Conditional U-Net模型：绕过不适配算子 dictconstruct, boardcast_to, eisum 算子

这个模型较大，得到的jit模型有3.4G，首先需要注意的是，模型是多输入，其输入顺序如源码所示：

在这里插入图片描述

通过构造输入，得到jit模型，其中各个输入信息如下：

timestep 1
latent_model_input.shape
(2, 4, 64, 64)
text_embeddings.shape
(2, 77, 768)

通过调试可以知道timestep的初始值为999，因此timestep 为 torch.tensor(999).

转换脚本为：

import bmnetp
## compile fp32 model
bmnetp.compile(
  model = "./unet/unet_jit_remove_pickle_error.pt",        ## Necessary
  outdir = "./compilation5",                 ## Necessary
  target = "BM1684",              ## Necessary
  shapes = [[2,4,64,64], [2], [2,77,768]],  ## Necessary
  net_name = "unet2",              ## Necessary
  opt = 0,                        ## optional, if not set, default equal to 1
  dyn = False,                    ## optional, if not set, default equal to False
  cmp = False,                     ## optional, if not set, default equal to True
  enable_profile = False,           ## optional, if not set, default equal to False
)

提示报错为：
在这里插入图片描述

修改DictConstruct

然后后续执行jit时有诸多问题：非torch类型的数据无法trace，存在输出为dict。

通过源码可以看到，Conditional U-Net模型的输出是一个dict，我们将其修改为输出tensor，然后再进行trace。
在这里插入图片描述

修改不适配 aten::boardcast to

查看源码可以看到，
在这里插入图片描述

继续查看源码，发现只有这一处出现了boardcast_to算子，而且是在网络前半部分出现的，只于timestamp有关，其作用时将timestemp的维度改为输入的batch维度. 因此，此算子可以抽离出来，将其转化为预处理，因此修改源码为：
在这里插入图片描述

这两个修改后，可以得到jit模型，但是在转换时仍会报错，eisum求和错误。
在这里插入图片描述

这是因为我们的SDK，对于部分eisum算子不支持。

考虑到eisum是运算节点，而非数据存储节点，因此可以将其转化为其他算子，比如bmm和transpose等，此操作不影响模型加载预训练文件。需要注意的是，如果是带有数据的节点，如conv算子，则进行更改可能会影响预训练模型的加载，因此需要谨慎操作。

修改源码如下：
在这里插入图片描述

然后重新加载模型，可以得到jit模型，进行转换。

此时仍然会报错：ASSERT info: alloc 1073741824 failed

在这里插入图片描述

需要设置环境变量 export CMODEL_GLOBAL_MEM_SIZE=8589934592

到此我们可以成功将3个模型转为bmodel模型，完成模型转换工作。

5. pipeline 搭建与精度对齐

pipeline 搭建参考 stable_diffusion.openvino ,其模型为单独文件engine.py , 使用这个仓库，我们只需要替换engine.py , 然后简单修改其他文件即可.

修改后的engine.py 为:

import sophon.sail as sail 
import numpy as np 


class EngineOV:
    
    def __init__(self, model_path="",device_id=0) :
        self.model = sail.Engine(model_path, device_id, sail.IOMode.SYSIO)
        self.graph_name = self.model.get_graph_names()[0]
        self.input_name = self.model.get_input_names(self.graph_name)[0]
        self.input_shape= self.model.get_input_shape(self.graph_name, self.input_name)
        self.output_name= self.model.get_output_names(self.graph_name)[0]
        self.output_shape= self.model.get_output_shape(self.graph_name, self.output_name)
        print("input_name={}, input_shape={}, output_name={}, output_shape={}".format(self.input_name,self.input_shape,self.output_name,self.output_shape))
        
    def __str__(self):
        return "EngineOV: model_path={}, device_id={}".format(self.model_path,self.device_id)
    
    def __call__(self, args):
        output = self.model.process(self.graph_name, args)
        return output[self.output_name]

并在pipeline加载时修改 __init__函数：

        # text features
        self.text_encoder = EngineOV("./text_encoder/text_encoder.bmodel")
        # diffusion
        self.unet = EngineOV("./unet/compilation.bmodel",device_id=1)
        self.latent_shape = (4,64,64 )
        # decoder
        self.vae_decoder = EngineOV("./vae_decoder/vae_decoder.bmodel")

因为我们对conditional unet 做了更改，因此需要修改输入,设置为预处理。代码如下所示：

        batch_size = latent_model_input.shape[0]
        newt = np.tile(t, (batch_size))

        noise_pred = self.unet({
            "input.1": latent_model_input,
            "timesteps.1": newt,
            "input0.1": text_embeddings
        })

5.1 精度对齐

最开始发现我们的pipeline结果是：
在这里插入图片描述

与原始的结果不一致。

为此，我们需要对比每个模型的输出。为了方便，1. 只修改model文件，不修改pipeline文件，然后对比pipeline的输出；2. 我们将每个模型的输入输出保存到文件中，然后进行对比。对比的方法是计算输出的L1距离和cos相似度。

另外因为 diffusion 模型的输入是随机的，因此我们需要固定输入，然后对比输出。



    # text encoder informtion record 
    tokens = torch.tensor(tokens).unsqueeze(0)
    np.save("tokens.npy", tokens.numpy())
    # text_embedding use npu engine to inference 
    text_embeddings = self.text_encoder(tokens)[0].detach().cpu().numpy()
    np.save('text_embeddings.npy', text_embeddings)
    # text_encoder onnx output shape is (1, 77, 768)
    # do classifier free guidance
    if guidance_scale > 1.0:
        tokens_uncond = self.tokenizer(
            "",
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            truncation=True
        ).input_ids
        np.save("tokens_uncond.npy", tokens_uncond)
        uncond_embeddings = self.text_encoder(torch.tensor(tokens_uncond).unsqueeze(0))
        uncond_embeddings = uncond_embeddings[0].detach().numpy()
        np.save("uncond_embeddings.npy", uncond_embeddings)
        text_embeddings = np.concatenate((uncond_embeddings, text_embeddings), axis=0)


    # conditional unet information record  
    latent_model_input = np.load("./bins/latent_model_input-{}.npy".format(i))
    newt = np.load("./bins/newt-{}.npy".format(i))
    text_embeddings = np.load("./bins/text_embeddings-{}.npy".format(i))
    
    noise_pred = self.unet({
        "input.1": latent_model_input,
        "timesteps.1": newt,
        "input0.1": text_embeddings
    })

    np.save("./predict/noise_pred-{}.npy".format(i), noise_pred)


    # vae decoder information record 
    np.save('latents.npy', latents)
    
    image = self.model.model.vae.decode(torch.tensor(latents,dtype=torch.float32).unsqueeze(0)).detach().cpu().numpy()
    np.save("image.npy", image)

通过对比，我们发现：

text encoder 模型，在输入一致时，输出相似度为99.9999%, l1距离为1e-7;
conditional unet 模型，在输入一致时，输出相似度为99.7%, l1距离为0.01;
vae decoder 模型，在输入一致时，输出相似度为99.9999%, l1距离为1e-7;

粗看并没有太大的差异，但是conditional unet会循环32次，因此我们跟踪每次的conditional unet的输入输出，对比发现：

在这里插入图片描述

循环损失下，l1距离和cos都很差。

5.2 问题分析

单次的输入是一致时, conditional unet 的差异只是0.01，所以模型的问题应该并不大，需要避免循环损失。进一步查看源代码发现： conditional unet 的输入中timestamp，会走embedding，这说明timestamp应该是整数，而不是浮点数。而我们转模型时，timestamp的输入是浮点数，因此我们需要修改timestamp的输入为整数。
修改转模型命令为：

import bmnetp
## compile fp32 model
bmnetp.compile(
  model = "./unet/unet_jit_remove_pickle_error.pt",        ## Necessary
  outdir = "./compilation5",                 ## Necessary
  target = "BM1684X",              ## Necessary
  shapes = [[2,4,64,64], [2], [2,77,768]],  ## Necessary
  net_name = "unet2",              ## Necessary
  opt = 0,                        
  dyn = False,                    
  cmp = False,                     ## optional, if not set, default equal to True
  enable_profile = False,           ## optional, if not set, default equal to False
  desc="[1,int64,0,10000000]", # 额外添加的参数  
)

并修改预处理为：

        batch_size = latent_model_input.shape[0]
        newt = np.tile(t, (batch_size)).astype(np.int64)

        noise_pred = self.unet({
            "input.1": latent_model_input,
            "timesteps.1": newt,
            "input0.1": text_embeddings
        })

完成后，重新转模型，重新部署，重新测试，发现精度对齐。

参考资料

[1] High-Resolution Image Synthesis With Latent Diffusion Models

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

[2] https://blog.csdn.net/qq_39388410/article/details/126576756