高通 AI Stack 稳定扩散Demo指南(5)
1.3 Qualcomm AI Engine 直接模型执行
在本笔记本中,您将学习如何在 Snapdragon 或 Android 设备上的 Windows 上执行优化和准备好的模型,包括如何单独运行模型以及如何作为在给定用户提示时执行的稳定扩散管道的一部分。
1.3.1 Qualcomm AI Engine 在 Snapdragon 上的 Windows 上直接执行模型
Qualcomm AI Engine 在 Snapdragon 上的 Windows 上直接稳定扩散模型执行
本指南将详细介绍如何使用 Qualcomm AI Engine Direct SDK 在 Snapdragon 设备上的 Windows 上执行稳定扩散模型。
请注意,在本文档的其余部分中,术语 Qualcomm 神经网络 (QNN) 将与 Qualcomm AI Engine Direct SDK 互换使用。
先决条件
- QNN SDK
- 为构成稳定扩散的三个模型生成 QNN 上下文二进制文件:文本编码器、U-Net 和变分自动编码器解码器
平台要求
- 表面 Pro9
- 操作系统版本:22621.169
- Windows 功能体验包:1000.22632.1000.0
- Snapdragon 上的 Windows:SC8280X
- qcadsprpc 文件版本:应为 1.0.3530.9800 或更高版本
要获取 qcadsprpc 版本,请按照以下步骤操作:
- 使用最新版本更新 Windows 操作系统和 Windows 驱动程序。
- 打开“系统信息”菜单并选择“软件环境” > “系统驱动程序”。
- 找到名为“qcadsprpc”的驱动程序并记下文件路径。
- 打开 Windows 资源管理器并导航到步骤 3 中记下的文件路径。
- 左键单击 qcadsprpc8280.sys 文件并将鼠标悬停在其上以查看文件版本号(或者,您可以右键单击该文件 > 选择属性> 选择详细信息并查看文件版本)。
- 确认文件版本为1.0.3530.9800或以上。
工作流程
- (可选)设置 Jupyter 笔记本。以下代码已在Windows环境的jupyter笔记本上进行了验证。因此,我们建议按照这些说明设置 Jupyter 笔记本。
- 在稳定扩散管道中运行模型。根据用户提示,在 WoS 平台上的 QNN HTP 上将模型作为稳定扩散管道执行以生成图像。
设置 Jupyter 笔记本
1.下载 Visual Studio 构建工具 (2022):https://visualstudio.microsoft.com/visual-cpp-build-tools/
2. 安装过程中,确保在 Visual Studio 安装程序中选择“使用 C++ 进行桌面开发” 。
3. pip install notebook
4. jupyter notebook
准备所有要执行的二进制文件和库
# Copy the required libraries and binaries to libs folder
import shutil
import os
execution_ws = os.getcwd()
SDK_dir = execution_ws + "qnn_assets\\" + #"<Insert path to unzipped QNN SDK here>"
lib_dir = SDK_dir + "\\lib\\aarch64-windows-msvc\\"
binary = SDK_dir + "\\bin\\aarch64-windows-msvc\qnn-net-run.exe"
skel = SDK_dir + "\\lib\\hexagon-v68\\unsigned\libQnnHtpV68Skel.so"
des_dir = execution_ws + "qnn_assets\\QNN_binaries"
# Copy necessary libraries to a common location
libs = ["QnnHtp.dll", "QnnHtpNetRunExtensions.dll", "QnnHtpPrepare.dll", "QnnHtpV68Stub.dll"]
for lib in libs:
shutil.copy(lib_dir+lib, des_dir)
# Copy binary
shutil.copy(binary, des_dir)
# Copy Skel
shutil.copy(skel, des_dir)
使用 Qualcomm AI Engine Direct 在 Stable Diffusion 管道中执行模型
本部分继续使用 Qualcomm AI Engine Direct SDK,并包含稳定的 Diffusion 端到端管道所需的组件。它演示了如何使用 Qualcomm AI Engine 直接软件和硬件在 Snapdragon 设备上的 Windows 上运行稳定扩散模型。
用户输入
import numpy as np
# Any user defined prompt
user_prompt = "decorated modern country house interior, 8 k, light reflections"
# User defined seed value
user_seed = np.int64(1.36477711e+14)
# User defined step value, any integer value in {20, 50}
user_step = 20
# User define text guidance, any float value in [5.0, 15.0]
user_text_guidance = 7.5
# Error checking for user_seed
assert isinstance(user_seed, np.int64) == True,"user_seed should be of type int64"
# Error checking for user_step
assert isinstance(user_step, int) == True,"user_step should be of type int"
assert user_step == 20 or user_step == 50,"user_step should be either 20 or 50"
# Error checking for user_text_guidance
assert isinstance(user_text_guidance, float) == True,"user_text_guidance should be of type float"
assert user_text_guidance >= 5.0 and user_text_guidance <= 15.0,"user_text_guidance should be a float from [5.0, 15.0]"
嵌入函数
预计执行时间:4分钟
import torch
from diffusers import UNet2DConditionModel
from diffusers.models.embeddings import get_timestep_embedding
# pre-load time embedding
time_embeddings = UNet2DConditionModel.from_pretrained('runwayml/stable-diffusion-v1-5',
subfolder='unet', cache_dir='./cache/diffusers').time_embedding
def get_time_embedding(timestep):
timestep = torch.tensor([timestep])
t_emb = get_timestep_embedding(timestep, 320, True, 0)
emb = time_embeddings(t_emb).detach().numpy()
return emb
分词器
import numpy as np
from tokenizers import Tokenizer
# Define Tokenizer output max length (must be 77)
tokenizer_max_length = 77
# Initializing the Tokenizer
tokenizer = Tokenizer.from_pretrained("openai/clip-vit-base-patch32")
# Setting max length to tokenizer_max_length
tokenizer.enable_truncation(tokenizer_max_length)
tokenizer.enable_padding(pad_id=49407, length=tokenizer_max_length)
def run_tokenizer(prompt):
# Run Tokenizer encoding
token_ids = tokenizer.encode(prompt).ids
# Convert tokens list to np.array
token_ids = np.array(token_ids, dtype=np.float32)
return token_ids
调度程序
import numpy as np
import torch
from diffusers import DPMSolverMultistepScheduler
# Initializing the Scheduler
scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085,
beta_end=0.012, beta_schedule="scaled_linear")
# Setting up user provided time steps for Scheduler
scheduler.set_timesteps(user_step)
def run_scheduler(noise_pred_uncond, noise_pred_text, latent_in, timestep):
# Convert all inputs from NHWC to NCHW
noise_pred_uncond = np.transpose(noise_pred_uncond, (0,3,1,2)).copy()
noise_pred_text = np.transpose(noise_pred_text, (0,3,1,2)).copy()
latent_in = np.transpose(latent_in, (0,3,1,2)).copy()
# Convert all inputs to torch tensors
noise_pred_uncond = torch.from_numpy(noise_pred_uncond)
noise_pred_text = torch.from_numpy(noise_pred_text)
latent_in = torch.from_numpy(latent_in)
# Merge noise_pred_uncond and noise_pred_text based on user_text_guidance
noise_pred = noise_pred_uncond + user_text_guidance * (noise_pred_text - noise_pred_uncond)
# Run Scheduler step
latent_out = scheduler.step(noise_pred, timestep, latent_in).prev_sample.numpy()
# Convert latent_out from NCHW to NHWC
latent_out = np.transpose(latent_out, (0,2,3,1)).copy()
return latent_out
# Function to get timesteps
def get_timestep(step):
return np.int32(scheduler.timesteps.numpy()[step])
使用 Qualcomm AI Engine Direct 进行推理
import numpy as np
import os
import shutil
# Define QNN binaries path
QNN_binaries_path = 'qnn_assets\\QNN_binaries'
# Define generic qnn-net-run block
def run_qnn_net_run(model_context, input_data_list):
# Define tmp directory path for intermediate artifacts
tmp_dirpath = os.path.abspath('tmp')
os.makedirs(tmp_dirpath, exist_ok=True)
# Dump each input data from input_data_list as raw file
# and prepare input_list_filepath for qnn-net-run
input_list_text = ''
for index, input_data in enumerate(input_data_list):
# Create and dump each input into raw file
raw_file_path = f'{tmp_dirpath}/input_{index}.raw'
input_data.tofile(raw_file_path)
# Keep appending raw_file_path into input_list_text for input_list_filepath file
input_list_text += raw_file_path + ' '
# Create input_list_filepath and add prepared input_list_text into this file
input_list_filepath = f'{tmp_dirpath}/input_list.txt'
with open(input_list_filepath, 'w') as f:
f.write(input_list_text)
# Execute qnn-net-run on shell
!{QNN_binaries_path}\qnn-net-run.exe --retrieve_context {model_context} --backend {QNN_binaries_path}/QnnHtp.dll \
--input_list {input_list_filepath} --output_dir {tmp_dirpath} > {tmp_dirpath}/log.txt
# Read the output data generated by qnn-net-run
output_data = np.fromfile(f'{tmp_dirpath}/Result_0/output_1.raw', dtype=np.float32)
# Delete all intermediate artifacts
shutil.rmtree(tmp_dirpath)
return output_data
# Define models context path
models_context_path = 'qnn_assets\\stable_diffusion_models'
# qnn-net-run for text encoder
def run_text_encoder(input_data):
output_data = run_qnn_net_run(f'{models_context_path}\\text_encoder\\text_encoder.serialized.bin', [input_data])
# Output of Text encoder should be of shape (1, 77, 768)
output_data = output_data.reshape((1, 77, 768))
return output_data
# qnn-net-run for U-Net
def run_unet(input_data_1, input_data_2, input_data_3):
output_data = run_qnn_net_run(f'{models_context_path}\\unet\\unet.serialized.bin', [input_data_1, input_data_2, input_data_3])
# Output of UNet should be of shape (1, 64, 64, 4)
output_data = output_data.reshape((1, 64, 64, 4))
return output_data
# qnn-net-run for VAE
def run_vae(input_data):
output_data = run_qnn_net_run(f'{models_context_path}\\vae_decoder\\vae_decoder.serialized.bin', [input_data])
# Convert floating point output into 8 bits RGB image
output_data = np.clip(output_data*255.0, 0.0, 255.0).astype(np.uint8)
# Output of VAE should be of shape (512, 512, 3)
output_data = output_data.reshape((512, 512, 3))
return output_data
执行稳定扩散管道
预计执行时间:3分钟
# Run Tokenizer
uncond_tokens = run_tokenizer("")
cond_tokens = run_tokenizer(user_prompt)
# Run Text Encoder on Tokens
uncond_text_embedding = run_text_encoder(uncond_tokens)
user_text_embedding = run_text_encoder(cond_tokens)
# Initialize the latent input with random initial latent
random_init_latent = torch.randn((1, 4, 64, 64), generator=torch.manual_seed(user_seed)).numpy()
latent_in = random_init_latent.transpose((0, 2, 3, 1)).copy()
# Run the loop for user_step times
for step in range(user_step):
print(f'Step {step} Running...')
# Get timestep from step
timestep = get_timestep(step)
# Run U-net for const embeddings
unconditional_noise_pred = run_unet(latent_in, get_time_embedding(timestep), uncond_text_embedding)
# Run U-net for user text embeddings
conditional_noise_pred = run_unet(latent_in, get_time_embedding(timestep), user_text_embedding)
# Run Scheduler
latent_in = run_scheduler(unconditional_noise_pred, conditional_noise_pred, latent_in, timestep)
# Run VAE
output_image = run_vae(latent_in)
from PIL import Image
from IPython.display import display
# Display the generated output
display(Image.fromarray(output_image, mode="RGB"))
结果示例