Task02：文生图片技术路径、原理与SD实战

yxg2012_04_06

已于 2024-03-10 12:50:32 修改

阅读量961

点赞数 24

文章标签：语言模型人工智能开源 transformer stable diffusion

于 2024-03-09 21:34:59 首次发布

本文链接：https://blog.csdn.net/yxg2012_04_06/article/details/136591023

版权

AIGC技术基础知识-Stable Diffusion

 1.1AIGC是什么？全称叫做AI generated content，AlGC (Al-
 Generated Content，人工智能生产内容)，是利用AlI自动生产内容的生产方
 式。

在这里插入图片描述

 1.2 AIGC技术的发展

在这里插入图片描述

 	- 基于生成对抗网络的（GAN）模型

在这里插入图片描述

 	- 基于自回归(Autoregressive)模型

在这里插入图片描述

 	- 基于扩散(diffusion)模型

在这里插入图片描述

			-LDM原理图：

在这里插入图片描述

		- 基于Transformers的扩散（diffusion）模型

在这里插入图片描述

	1.3、使用AIGC模型以及优化AIGC生成效果

在这里插入图片描述

	1.4更多基于Stablediffusion的小应用

1.facechain
2.InstantID
3.anytext
4.replaceanything
5.outfitanyone

	1.5视频生成技术发展

在这里插入图片描述

	1.6modescope本地diffusers+model

在这里插入图片描述

Transformers技术解析+实战(LLM)，多种Transformers diffusion模型技术图像生成技术+实战

2.1SelfAttention
2.1.1Attention
From：https://arxiv.org/pdf/1703.03906.pdf

from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F
from selfattention import SelfAttention
class Model(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.emb = nn.Embedding(config.vocab_size, config.hidden_dim)
        self.attn = SelfAttention(config)
        self.fc = nn.Linear(config.hidden_dim, config.num_labels)
    
    def forward(self, x):
        batch_size, seq_len = x.shape
        h = self.emb(x)
        attn_score, h = self.attn(h)
        h = F.avg_pool1d(h.permute(0, 2, 1), seq_len, 1)
        h = h.squeeze(-1)
        logits = self.fc(h)
        return attn_score, logits
		@dataclass
class Config:
    
    vocab_size: int = 5000
    hidden_dim: int = 512
    num_heads: int = 16
    head_dim: int = 32
    dropout: float = 0.1
    
    num_labels: int = 2
    
    max_seq_len: int = 512
    
    num_epochs: int = 10
	config = Config(5000, 512, 16, 32, 0.1, 2)
	model = Model(config)
	x = torch.randint(0, 5000, (3, 30))
	x.shape
	attn, logits = model(x)
	attn.shape, logits.shape
import pandas as pd
from sklearn.model_selection import train_test_split
file_path = "./data/ChnSentiCorp_htl_all.csv"
df.label.value_counts()
df = pd.concat([df[df.label==1].sample(2500), df[df.label==0]])
df.shape
df.label.value_counts()
from tokenizer import Tokenizer
tokenizer = Tokenizer(config.vocab_size, config.max_seq_len)
tokenizer.build_vocab(df.review)
tokenizer(["你好", "你好呀"])
def collate_batch(batch):
    label_list, text_list = [], []
    for v in batch:
        _label = v["label"]
        _text = v["text"]
        label_list.append(_label)
        text_list.append(_text)
    inputs = tokenizer(text_list)
    labels = torch.LongTensor(label_list)
    return inputs, labels
from dataset import Dataset
ds = Dataset()
ds.build(df, "review", "label")
len(ds), ds[0]
train_ds, test_ds = train_test_split(ds, test_size=0.2)
train_ds, valid_ds = train_test_split(train_ds, test_size=0.1)
len(train_ds), len(valid_ds), len(test_ds)
from torch.utils.data import DataLoader
BATCH_SIZE = 8
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
valid_dl = DataLoader(valid_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
test_dl = DataLoader(test_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
len(train_dl), len(valid_dl), len(test_dl)
for v in train_dl: break
v[0].shape, v[1].shape, v[0].dtype, v[1].dtype
from trainer import train, test
NUM_EPOCHS = 10
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

config = Config(5000, 64, 1, 64, 0.1, 2)
model = Model(config)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-3)
train(model, optimizer, train_dl, valid_dl, config)

test(model, test_dl)
from inference import infer, plot_attention
import numpy as np
sample = np.random.choice(test_ds)
while len(sample["text"]) > 20:
    sample = np.random.choice(test_ds)

print(sample)

inp = sample["text"]
inputs = tokenizer(inp)
attn, prob = infer(model, inputs.to(device))
attn_prob = attn[0, 0, :, :].cpu().numpy()
tokens = tokenizer.tokenize(inp)
tokens, prob
plot_attention(attn_prob, tokens, tokens)

2.2 LLaMA
2.2.1 llm
- Tokenize
- Decoding
- Transformer Block
在这里插入图片描述

2.3思考与练习
	2.3.1Attention
	1. 你怎么理解Attention？
	2. 乘性Attention和加性Attention有什么不同？
	3. Self-Attention为什么采用 Dot-Product Attention？
	4. Self-Attention中的Scaled因子有什么作用？必须是 `sqrt(d_k)` 吗？
	5. Multi-Head Self-Attention，Multi越多越好吗，为什么？
	6. Multi-Head Self-Attention，固定`hidden_dim`，你认为增加 `head_dim` （需要缩小 `num_heads`）和减少 `head_dim` 会对结果有什么影响？
	7. 为什么我们一般需要对 Attention weights 应用Dropout？哪些地方一般需要Dropout？Dropout在推理时是怎么执行的？你怎么理解Dropout？
	8. Self-Attention的qkv初始化时，bias怎么设置，为什么？
	9. 你还知道哪些变种的Attention？它们针对Vanilla实现做了哪些优化和改进？
	10. 你认为Attention的缺点和不足是什么？
	11. 你怎么理解Deep Learning的Deep？现在代码里只有一个Attention，多叠加几个效果会好吗？
	12. DeepLearning中Deep和Wide分别有什么作用，设计模型架构时应怎么考虑？
	2.3.2 LLM
	1. 你怎么理解Tokenize？你知道几种Tokenize方式，它们有什么区别？
	2. 你觉得一个理想的Tokenizer模型应该具备哪些特点？
	3. Tokenizer中有一些特殊Token，比如开始和结束标记，你觉得它们的作用是什么？我们为什么不能通过模型自动学习到开始和结束标记？
	4. 为什么LLM都是Decoder-Only的？
	5. RMSNorm的作用是什么，和LayerNorm有什么不同？为什么不用LayerNorm？
	6. LLM中的残差连接体现在哪里？为什么用残差连接？
	7. PreNormalization和PostNormalization会对模型有什么影响？为什么现在LLM都用PreNormalization？
	8. FFN为什么先扩大后缩小，它们的作用分别是什么？
	9. 为什么LLM需要位置编码？你了解几种位置编码方案？
	10. 为什么RoPE能从众多位置编码中脱颖而出？它主要做了哪些改进？
	11. 如果让你设计一种位置编码方案，你会考虑哪些因素？
	12. 请你将《LLM部分》中的一些设计（如RMSNorm）加入到《Self-Attention部分》的模型设计中，看看能否提升效果？

3.基于Transformers，diffusion技术解析+实战

3.1Transformers+diffusion技术背景简介
	3.1.1Transformers diffusion背景
	3.1.2什么是ViT：Vision Transformer (ViT) 模型， 基本上
	是 Transformers，但应用于图像。
	ViT架构：

在这里插入图片描述

Paper: https://arxiv.org/abs/2010.11929
Official repo (in JAX): https://github.com/google-research/vision_transformer

	3.1.3ViT在大语言模型中的使用（Qwen-VL为例）

在这里插入图片描述

	3.1.4ViViT：视频ViT

在这里插入图片描述

	3.1.5Latte:用于视频生成的潜在扩散变压器

在这里插入图片描述

3.2代码实战

Patch最佳实践

# Image preprocessing

import tensorflow as tf
def read_image(image_file="/mnt/workspace/image_1.png", scale=True, image_dim=336):

    image = tf.keras.utils.load_img(
        image_file, grayscale=False, color_mode='rgb', target_size=None,
        interpolation='nearest'
    )
    image_arr_orig = tf.keras.preprocessing.image.img_to_array(image)
    if(scale):
        image_arr_orig = tf.image.resize(
            image_arr_orig, [image_dim, image_dim],
            method=tf.image.ResizeMethod.BILINEAR, preserve_aspect_ratio=False
        )
    image_arr = tf.image.crop_to_bounding_box(
        image_arr_orig, 0, 0, image_dim, image_dim
    )

    return image_arr

# Patching
def create_patches(image):
    im = tf.expand_dims(image, axis=0)
    patches = tf.image.extract_patches(
        images=im,
        sizes=[1, 32, 32, 1],
        strides=[1, 32, 32, 1],
        rates=[1, 1, 1, 1],
        padding="VALID"
    )
    patch_dims = patches.shape[-1]
    patches = tf.reshape(patches, [1, -1, patch_dims])

    return patches

image_arr = read_image()
patches = create_patches(image_arr)

# Drawing
import numpy as np
import matplotlib.pyplot as plt

def render_image_and_patches(image, patches):
    plt.figure(figsize=(16, 16))
    plt.suptitle(f"Cropped Image", size=48)
    plt.imshow(tf.cast(image, tf.uint8))
    plt.axis("off")
    n = int(np.sqrt(patches.shape[1]))
    plt.figure(figsize=(16, 16))
    plt.suptitle(f"Image Patches", size=24)
    for i, patch in enumerate(patches[0]):
        ax = plt.subplot(n, n, i+1)
        patch_img = tf.reshape(patch, (32, 32, 3))
        ax.imshow(patch_img.numpy().astype("uint8"))
        ax.axis("off")

def render_flat(patches):
    plt.figure(figsize=(32, 2))
    plt.suptitle(f"Flattened Image Patches", size=24)
    n = int(np.sqrt(patches.shape[1]))
    for i, patch in enumerate(patches[0]):
        ax = plt.subplot(1, 101, i+1)
        patch_img = tf.reshape(patch, (32, 32, 3))
        ax.imshow(patch_img.numpy().astype("uint8"))
        ax.axis("off")
        if(i == 100):
            break


render_image_and_patches(image_arr, patches)
render_flat(patches)

在这里插入图片描述 `
ViT最佳实践

#load 模型
from transformers import ViTForImageClassification
import torch
from modelscope import snapshot_download

model_dir = snapshot_download('AI-ModelScope/vit-base-patch16-224')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = ViTForImageClassification.from_pretrained(model_dir)
model.to(device)
#加载图片
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image
#常规图像预处理
from transformers import ViTImageProcessor

processor = ViTImageProcessor.from_pretrained(model_dir)
inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values
print(pixel_values.shape)
import torch

with torch.no_grad():
  outputs = model(pixel_values)
logits = outputs.logits
logits.shape

在这里插入图片描述
UViT最佳实践

!git clone https://github.com/baofff/U-ViT
!pip install einops
!pip install --upgrade pip
import os
os.chdir('/mnt/workspace/U-ViT')
os.environ['PYTHONPATH'] = '/env/python:/content/U-ViT'

import torch
from dpm_solver_pp import NoiseScheduleVP, DPM_Solver
import libs.autoencoder
from libs.uvit import UViT
import einops
from torchvision.utils import save_image
from PIL import Image
from modelscope.hub.file_download import model_file_download
image_size = "256" #@param [256, 512]
image_size = int(image_size)

if image_size == 256:
    model_file_download(model_id='thu-ml/imagenet256_uvit_huge',file_path='imagenet256_uvit_huge.pth', cache_dir='/mnt/workspace')
    !mv /mnt/workspace/thu-ml/imagenet256_uvit_huge/imagenet256_uvit_huge.pth /mnt/workspace/U-ViT
else:
    model_file_download(model_id='thu-ml/imagenet512_uvit_huge',file_path='imagenet512_uvit_huge.pth', cache_dir='/mnt/workspace')
    !mv /mnt/workspace/thu-ml/imagenet512_uvit_huge/imagenet512_uvit_huge.pth /mnt/workspace/U-ViT
 
z_size = image_size // 8
patch_size = 2 if image_size == 256 else 4
device = 'cuda' if torch.cuda.is_available() else 'cpu'

nnet = UViT(img_size=z_size,
       patch_size=patch_size,
       in_chans=4,
       embed_dim=1152,
       depth=28,
       num_heads=16,
       num_classes=1001,
       conv=False)

nnet.to(device)
nnet.load_state_dict(torch.load(f'imagenet{image_size}_uvit_huge.pth', map_location='cpu'))
nnet.eval()
model_file_download(model_id='AI-ModelScope/autoencoder_kl_ema',file_path='autoencoder_kl_ema.pth', cache_dir='/mnt/workspace')
!mv /mnt/workspace/AI-ModelScope/autoencoder_kl_ema/autoencoder_kl_ema.pth /mnt/workspace/U-ViT
autoencoder = libs.autoencoder.get_model('autoencoder_kl_ema.pth')
autoencoder.to(device)
seed = 4321 #@param {type:"number"}
steps = 25 #@param {type:"slider", min:0, max:1000, step:1}
cfg_scale = 3 #@param {type:"slider", min:0, max:10, step:0.1}
class_labels = 207, 360, 387, 974, 88, 979, 417, 279 #@param {type:"raw"}
samples_per_row = 4 #@param {type:"number"}
torch.manual_seed(seed)

def stable_diffusion_beta_schedule(linear_start=0.00085, linear_end=0.0120, n_timestep=1000):
    _betas = (
        torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2
    )
    return _betas.numpy()


_betas = stable_diffusion_beta_schedule()  # set the noise schedule
noise_schedule = NoiseScheduleVP(schedule='discrete', betas=torch.tensor(_betas, device=device).float())


y = torch.tensor(class_labels, device=device)
y = einops.repeat(y, 'B -> (B N)', N=samples_per_row)

def model_fn(x, t_continuous):
    t = t_continuous * len(_betas)
    _cond = nnet(x, t, y=y)
    _uncond = nnet(x, t, y=torch.tensor([1000] * x.size(0), device=device))
    return _cond + cfg_scale * (_cond - _uncond)  # classifier free guidance


z_init = torch.randn(len(y), 4, z_size, z_size, device=device)
dpm_solver = DPM_Solver(model_fn, noise_schedule, predict_x0=True, thresholding=False)

with torch.no_grad():
  with torch.cuda.amp.autocast():  # inference with mixed precision
    z = dpm_solver.sample(z_init, steps=steps, eps=1. / len(_betas), T=1.)
    samples = autoencoder.decode(z)
samples = 0.5 * (samples + 1.)
samples.clamp_(0., 1.)
save_image(samples, "sample.png", nrow=samples_per_row * 2, padding=0)
samples = Image.open("sample.png")
display(samples)

在这里插入图片描述

ViViT最佳实践

!pip install ipywidgets
!pip install -qq medmnist
import os
import io
import imageio
import medmnist
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# setting seed for reproducibility
SEED = 42
os.environ["TF_CUDNN_DETERMINISTIC"] = "1"
keras.utils.set_random_seed(SEED)
# DATA
DATASET_NAME = "organmnist3d"
BATCH_SIZE = 32
AUTO = tf.data.AUTOTUNE
INPUT_SHAPE = (28, 28, 28, 1)
NUM_CLASSES = 11

# OPTIMIZER
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 1e-5

# TRAINING
EPOCHS = 60

# TUBELET EMBEDDING
PATCH_SIZE = (8, 8, 8)
NUM_PATCHES = (INPUT_SHAPE[0] // PATCH_SIZE[0]) ** 2

# ViViT ARCHITECTURE
LAYER_NORM_EPS = 1e-6
PROJECTION_DIM = 128
NUM_HEADS = 8
NUM_LAYERS = 8
!wget https://modelscope.oss-cn-beijing.aliyuncs.com/resource/organmnist3d.npz
def download_and_prepare_dataset(data_info: dict):
    """
    Utility function to download the dataset and return train/valid/test
    videos and labels.
    Arguments:
        data_info (dict): Dataset metadata
    """
    data_path = "/mnt/workspace/organmnist3d.npz"

    with np.load(data_path) as data:
        # Get videos
        train_videos = data["train_images"]
        valid_videos = data["val_images"]
        test_videos = data["test_images"]

        # Get labels
        train_labels = data["train_labels"].flatten()
        valid_labels = data["val_labels"].flatten()
        test_labels = data["test_labels"].flatten()

    return (
        (train_videos, train_labels),
        (valid_videos, valid_labels),
        (test_videos, test_labels),
    )


# Get the metadata of the dataset
info = medmnist.INFO[DATASET_NAME]

# Get the dataset
prepared_dataset = download_and_prepare_dataset(info)
(train_videos, train_labels) = prepared_dataset[0]
(valid_videos, valid_labels) = prepared_dataset[1]
(test_videos, test_labels) = prepared_dataset[2]
@tf.function
def preprocess(frames: tf.Tensor, label: tf.Tensor):
    """Preprocess the frames tensors and parse the labels"""
    # Preprocess images
    frames = tf.image.convert_image_dtype(
        frames[
            ..., tf.newaxis
        ],  # The new axis is to help for further processing with Conv3D layers
        tf.float32,
    )

    # Parse label
    label = tf.cast(label, tf.float32)
    return frames, label


def prepare_dataloader(
    videos: np.ndarray,
    labels: np.ndarray,
    loader_type: str = "train",
    batch_size: int = BATCH_SIZE,
):
    """Utility function to prepare dataloader"""
    dataset = tf.data.Dataset.from_tensor_slices((videos, labels))

    if loader_type == "train":
        dataset = dataset.shuffle(BATCH_SIZE * 2)

    dataloader = (
        dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
        .batch(batch_size)
        .prefetch(tf.data.AUTOTUNE)
    )

    return dataloader


trainloader = prepare_dataloader(train_videos, train_labels, "train")
validloader = prepare_dataloader(valid_videos, valid_labels, "valid")
testloader = prepare_dataloader(test_videos, test_labels, "test")
class TubeletEmbedding(layers.Layer):
    def __init__(self, embed_dim, patch_size, **kwargs):
        super().__init__(**kwargs)
        self.projection = layers.Conv3D(
            filters=embed_dim,
            kernel_size=patch_size,
            strides=patch_size,
            padding="VALID",
        )
        self.flatten = layers.Reshape(target_shape=(-1, embed_dim))

    def call(self, videos):
        projected_patches = self.projection(videos)
        flattened_patches = self.flatten(projected_patches)
        return flattened_patches
		class PositionalEncoder(layers.Layer):
    def __init__(self, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim

    def build(self, input_shape):
        _, num_tokens, _ = input_shape
        self.position_embedding = layers.Embedding(
            input_dim=num_tokens, output_dim=self.embed_dim
        )
        self.positions = tf.range(start=0, limit=num_tokens, delta=1)

    def call(self, encoded_tokens):
        # Encode the positions and add it to the encoded tokens
        encoded_positions = self.position_embedding(self.positions)
        encoded_tokens = encoded_tokens + encoded_positions
        return encoded_tokens
		class PositionalEncoder(layers.Layer):
    def __init__(self, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim

    def build(self, input_shape):
        _, num_tokens, _ = input_shape
        self.position_embedding = layers.Embedding(
            input_dim=num_tokens, output_dim=self.embed_dim
        )
        self.positions = tf.range(start=0, limit=num_tokens, delta=1)

    def call(self, encoded_tokens):
        # Encode the positions and add it to the encoded tokens
        encoded_positions = self.position_embedding(self.positions)
        encoded_tokens = encoded_tokens + encoded_positions
        return encoded_tokens
		class PositionalEncoder(layers.Layer):
    def __init__(self, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim

    def build(self, input_shape):
        _, num_tokens, _ = input_shape
        self.position_embedding = layers.Embedding(
            input_dim=num_tokens, output_dim=self.embed_dim
        )
        self.positions = tf.range(start=0, limit=num_tokens, delta=1)

    def call(self, encoded_tokens):
        # Encode the positions and add it to the encoded tokens
        encoded_positions = self.position_embedding(self.positions)
        encoded_tokens = encoded_tokens + encoded_positions
        return encoded_tokens
		class PositionalEncoder(layers.Layer):
    def __init__(self, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim

    def build(self, input_shape):
        _, num_tokens, _ = input_shape
        self.position_embedding = layers.Embedding(
            input_dim=num_tokens, output_dim=self.embed_dim
        )
        self.positions = tf.range(start=0, limit=num_tokens, delta=1)

    def call(self, encoded_tokens):
        # Encode the positions and add it to the encoded tokens
        encoded_positions = self.position_embedding(self.positions)
        encoded_tokens = encoded_tokens + encoded_positions
        return encoded_tokens
		class PositionalEncoder(layers.Layer):
    def __init__(self, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim

    def build(self, input_shape):
        _, num_tokens, _ = input_shape
        self.position_embedding = layers.Embedding(
            input_dim=num_tokens, output_dim=self.embed_dim
        )
        self.positions = tf.range(start=0, limit=num_tokens, delta=1)

    def call(self, encoded_tokens):
        # Encode the positions and add it to the encoded tokens
        encoded_positions = self.position_embedding(self.positions)
        encoded_tokens = encoded_tokens + encoded_positions
        return encoded_tokens

Latte最佳实践

!git clone https://github.com/maxin-cn/Latte.git
%cd Latte
!pip install timm
!pip install einops
!pip install omegaconf
!pip install diffusers==0.24.0
%cd models
!git lfs install
!git clone https://www.modelscope.cn/AI-ModelScope/Latte.git
!export CUDA_VISIBLE_DEVICES=0
!export PYTHONPATH=../
!python sample/sample_t2v.py --config configs/t2v/t2v_sample.yaml

声音生成TTS技术解析与实战
4.1Model training tutorial

5.详细资料学习链接

yxg2012_04_06

关注

24
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
Task02：文生图片技术路径、原理与SD实战

3. Tokenizer中有一些特殊Token，比如开始和结束标记，你觉得它们的作用是什么？12. 请你将《LLM部分》中的一些设计（如RMSNorm）加入到《Self-Attention部分》的模型设计中，看看能否提升效果？5. RMSNorm的作用是什么，和LayerNorm有什么不同？5. Multi-Head Self-Attention，Multi越多越好吗，为什么？你知道几种Tokenize方式，它们有什么区别？8. Self-Attention的qkv初始化时，bias怎么设置，为什么？
复制链接

扫一扫