-
1.1AIGC是什么?全称叫做AI generated content,AlGC (Al- Generated Content,人工智能生产内容),是利用AlI自动生产内容的生产方 式。
1.2 AIGC技术的发展
- 基于生成对抗网络的(GAN)模型
- 基于自回归(Autoregressive)模型
- 基于扩散(diffusion)模型
-LDM原理图:
- 基于Transformers的扩散(diffusion)模型
1.3、使用AIGC模型以及优化AIGC生成效果
1.4更多基于Stablediffusion的小应用
1.facechain
2.InstantID
3.anytext
4.replaceanything
5.outfitanyone
1.5视频生成技术发展
1.6modescope本地diffusers+model
-
Transformers技术解析+实战(LLM),多种Transformers diffusion模型技术图像生成技术+实战
2.1SelfAttention
2.1.1Attention
From:https://arxiv.org/pdf/1703.03906.pdf
from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F
from selfattention import SelfAttention
class Model(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.emb = nn.Embedding(config.vocab_size, config.hidden_dim)
self.attn = SelfAttention(config)
self.fc = nn.Linear(config.hidden_dim, config.num_labels)
def forward(self, x):
batch_size, seq_len = x.shape
h = self.emb(x)
attn_score, h = self.attn(h)
h = F.avg_pool1d(h.permute(0, 2, 1), seq_len, 1)
h = h.squeeze(-1)
logits = self.fc(h)
return attn_score, logits
@dataclass
class Config:
vocab_size: int = 5000
hidden_dim: int = 512
num_heads: int = 16
head_dim: int = 32
dropout: float = 0.1
num_labels: int = 2
max_seq_len: int = 512
num_epochs: int = 10
config = Config(5000, 512, 16, 32, 0.1, 2)
model = Model(config)
x = torch.randint(0, 5000, (3, 30))
x.shape
attn, logits = model(x)
attn.shape, logits.shape
import pandas as pd
from sklearn.model_selection import train_test_split
file_path = "./data/ChnSentiCorp_htl_all.csv"
df.label.value_counts()
df = pd.concat([df[df.label==1].sample(2500), df[df.label==0]])
df.shape
df.label.value_counts()
from tokenizer import Tokenizer
tokenizer = Tokenizer(config.vocab_size, config.max_seq_len)
tokenizer.build_vocab(df.review)
tokenizer(["你好", "你好呀"])
def collate_batch(batch):
label_list, text_list = [], []
for v in batch:
_label = v["label"]
_text = v["text"]
label_list.append(_label)
text_list.append(_text)
inputs = tokenizer(text_list)
labels = torch.LongTensor(label_list)
return inputs, labels
from dataset import Dataset
ds = Dataset()
ds.build(df, "review", "label")
len(ds), ds[0]
train_ds, test_ds = train_test_split(ds, test_size=0.2)
train_ds, valid_ds = train_test_split(train_ds, test_size=0.1)
len(train_ds), len(valid_ds), len(test_ds)
from torch.utils.data import DataLoader
BATCH_SIZE = 8
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
valid_dl = DataLoader(valid_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
test_dl = DataLoader(test_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
len(train_dl), len(valid_dl), len(test_dl)
for v in train_dl: break
v[0].shape, v[1].shape, v[0].dtype, v[1].dtype
from trainer import train, test
NUM_EPOCHS = 10
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config = Config(5000, 64, 1, 64, 0.1, 2)
model = Model(config)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-3)
train(model, optimizer, train_dl, valid_dl, config)
test(model, test_dl)
from inference import infer, plot_attention
import numpy as np
sample = np.random.choice(test_ds)
while len(sample["text"]) > 20:
sample = np.random.choice(test_ds)
print(sample)
inp = sample["text"]
inputs = tokenizer(inp)
attn, prob = infer(model, inputs.to(device))
attn_prob = attn[0, 0, :, :].cpu().numpy()
tokens = tokenizer.tokenize(inp)
tokens, prob
plot_attention(attn_prob, tokens, tokens)
2.2 LLaMA
2.2.1 llm
- Tokenize
- Decoding
- Transformer Block
2.3思考与练习
2.3.1Attention
1. 你怎么理解Attention?
2. 乘性Attention和加性Attention有什么不同?
3. Self-Attention为什么采用 Dot-Product Attention?
4. Self-Attention中的Scaled因子有什么作用?必须是 `sqrt(d_k)` 吗?
5. Multi-Head Self-Attention,Multi越多越好吗,为什么?
6. Multi-Head Self-Attention,固定`hidden_dim`,你认为增加 `head_dim` (需要缩小 `num_heads`)和减少 `head_dim` 会对结果有什么影响?
7. 为什么我们一般需要对 Attention weights 应用Dropout?哪些地方一般需要Dropout?Dropout在推理时是怎么执行的?你怎么理解Dropout?
8. Self-Attention的qkv初始化时,bias怎么设置,为什么?
9. 你还知道哪些变种的Attention?它们针对Vanilla实现做了哪些优化和改进?
10. 你认为Attention的缺点和不足是什么?
11. 你怎么理解Deep Learning的Deep?现在代码里只有一个Attention,多叠加几个效果会好吗?
12. DeepLearning中Deep和Wide分别有什么作用,设计模型架构时应怎么考虑?
2.3.2 LLM
1. 你怎么理解Tokenize?你知道几种Tokenize方式,它们有什么区别?
2. 你觉得一个理想的Tokenizer模型应该具备哪些特点?
3. Tokenizer中有一些特殊Token,比如开始和结束标记,你觉得它们的作用是什么?我们为什么不能通过模型自动学习到开始和结束标记?
4. 为什么LLM都是Decoder-Only的?
5. RMSNorm的作用是什么,和LayerNorm有什么不同?为什么不用LayerNorm?
6. LLM中的残差连接体现在哪里?为什么用残差连接?
7. PreNormalization和PostNormalization会对模型有什么影响?为什么现在LLM都用PreNormalization?
8. FFN为什么先扩大后缩小,它们的作用分别是什么?
9. 为什么LLM需要位置编码?你了解几种位置编码方案?
10. 为什么RoPE能从众多位置编码中脱颖而出?它主要做了哪些改进?
11. 如果让你设计一种位置编码方案,你会考虑哪些因素?
12. 请你将《LLM部分》中的一些设计(如RMSNorm)加入到《Self-Attention部分》的模型设计中,看看能否提升效果?
3.基于Transformers,diffusion技术解析+实战
3.1Transformers+diffusion技术背景简介
3.1.1Transformers diffusion背景
3.1.2什么是ViT:Vision Transformer (ViT) 模型, 基本上
是 Transformers,但应用于图像。
ViT架构:
Paper: https://arxiv.org/abs/2010.11929
Official repo (in JAX): https://github.com/google-research/vision_transformer
3.1.3ViT在大语言模型中的使用(Qwen-VL为例)
3.1.4ViViT:视频ViT
3.1.5Latte:用于视频生成的潜在扩散变压器
3.2代码实战
# Image preprocessing
import tensorflow as tf
def read_image(image_file="/mnt/workspace/image_1.png", scale=True, image_dim=336):
image = tf.keras.utils.load_img(
image_file, grayscale=False, color_mode='rgb', target_size=None,
interpolation='nearest'
)
image_arr_orig = tf.keras.preprocessing.image.img_to_array(image)
if(scale):
image_arr_orig = tf.image.resize(
image_arr_orig, [image_dim, image_dim],
method=tf.image.ResizeMethod.BILINEAR, preserve_aspect_ratio=False
)
image_arr = tf.image.crop_to_bounding_box(
image_arr_orig, 0, 0, image_dim, image_dim
)
return image_arr
# Patching
def create_patches(image):
im = tf.expand_dims(image, axis=0)
patches = tf.image.extract_patches(
images=im,
sizes=[1, 32, 32, 1],
strides=[1, 32, 32, 1],
rates=[1, 1, 1, 1],
padding="VALID"
)
patch_dims = patches.shape[-1]
patches = tf.reshape(patches, [1, -1, patch_dims])
return patches
image_arr = read_image()
patches = create_patches(image_arr)
# Drawing
import numpy as np
import matplotlib.pyplot as plt
def render_image_and_patches(image, patches):
plt.figure(figsize=(16, 16))
plt.suptitle(f"Cropped Image", size=48)
plt.imshow(tf.cast(image, tf.uint8))
plt.axis("off")
n = int(np.sqrt(patches.shape[1]))
plt.figure(figsize=(16, 16))
plt.suptitle(f"Image Patches", size=24)
for i, patch in enumerate(patches[0]):
ax = plt.subplot(n, n, i+1)
patch_img = tf.reshape(patch, (32, 32, 3))
ax.imshow(patch_img.numpy().astype("uint8"))
ax.axis("off")
def render_flat(patches):
plt.figure(figsize=(32, 2))
plt.suptitle(f"Flattened Image Patches", size=24)
n = int(np.sqrt(patches.shape[1]))
for i, patch in enumerate(patches[0]):
ax = plt.subplot(1, 101, i+1)
patch_img = tf.reshape(patch, (32, 32, 3))
ax.imshow(patch_img.numpy().astype("uint8"))
ax.axis("off")
if(i == 100):
break
render_image_and_patches(image_arr, patches)
render_flat(patches)
`
ViT最佳实践
#load 模型
from transformers import ViTForImageClassification
import torch
from modelscope import snapshot_download
model_dir = snapshot_download('AI-ModelScope/vit-base-patch16-224')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ViTForImageClassification.from_pretrained(model_dir)
model.to(device)
#加载图片
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image
#常规图像预处理
from transformers import ViTImageProcessor
processor = ViTImageProcessor.from_pretrained(model_dir)
inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values
print(pixel_values.shape)
import torch
with torch.no_grad():
outputs = model(pixel_values)
logits = outputs.logits
logits.shape
!git clone https://github.com/baofff/U-ViT
!pip install einops
!pip install --upgrade pip
import os
os.chdir('/mnt/workspace/U-ViT')
os.environ['PYTHONPATH'] = '/env/python:/content/U-ViT'
import torch
from dpm_solver_pp import NoiseScheduleVP, DPM_Solver
import libs.autoencoder
from libs.uvit import UViT
import einops
from torchvision.utils import save_image
from PIL import Image
from modelscope.hub.file_download import model_file_download
image_size = "256" #@param [256, 512]
image_size = int(image_size)
if image_size == 256:
model_file_download(model_id='thu-ml/imagenet256_uvit_huge',file_path='imagenet256_uvit_huge.pth', cache_dir='/mnt/workspace')
!mv /mnt/workspace/thu-ml/imagenet256_uvit_huge/imagenet256_uvit_huge.pth /mnt/workspace/U-ViT
else:
model_file_download(model_id='thu-ml/imagenet512_uvit_huge',file_path='imagenet512_uvit_huge.pth', cache_dir='/mnt/workspace')
!mv /mnt/workspace/thu-ml/imagenet512_uvit_huge/imagenet512_uvit_huge.pth /mnt/workspace/U-ViT
z_size = image_size // 8
patch_size = 2 if image_size == 256 else 4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
nnet = UViT(img_size=z_size,
patch_size=patch_size,
in_chans=4,
embed_dim=1152,
depth=28,
num_heads=16,
num_classes=1001,
conv=False)
nnet.to(device)
nnet.load_state_dict(torch.load(f'imagenet{image_size}_uvit_huge.pth', map_location='cpu'))
nnet.eval()
model_file_download(model_id='AI-ModelScope/autoencoder_kl_ema',file_path='autoencoder_kl_ema.pth', cache_dir='/mnt/workspace')
!mv /mnt/workspace/AI-ModelScope/autoencoder_kl_ema/autoencoder_kl_ema.pth /mnt/workspace/U-ViT
autoencoder = libs.autoencoder.get_model('autoencoder_kl_ema.pth')
autoencoder.to(device)
seed = 4321 #@param {type:"number"}
steps = 25 #@param {type:"slider", min:0, max:1000, step:1}
cfg_scale = 3 #@param {type:"slider", min:0, max:10, step:0.1}
class_labels = 207, 360, 387, 974, 88, 979, 417, 279 #@param {type:"raw"}
samples_per_row = 4 #@param {type:"number"}
torch.manual_seed(seed)
def stable_diffusion_beta_schedule(linear_start=0.00085, linear_end=0.0120, n_timestep=1000):
_betas = (
torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2
)
return _betas.numpy()
_betas = stable_diffusion_beta_schedule() # set the noise schedule
noise_schedule = NoiseScheduleVP(schedule='discrete', betas=torch.tensor(_betas, device=device).float())
y = torch.tensor(class_labels, device=device)
y = einops.repeat(y, 'B -> (B N)', N=samples_per_row)
def model_fn(x, t_continuous):
t = t_continuous * len(_betas)
_cond = nnet(x, t, y=y)
_uncond = nnet(x, t, y=torch.tensor([1000] * x.size(0), device=device))
return _cond + cfg_scale * (_cond - _uncond) # classifier free guidance
z_init = torch.randn(len(y), 4, z_size, z_size, device=device)
dpm_solver = DPM_Solver(model_fn, noise_schedule, predict_x0=True, thresholding=False)
with torch.no_grad():
with torch.cuda.amp.autocast(): # inference with mixed precision
z = dpm_solver.sample(z_init, steps=steps, eps=1. / len(_betas), T=1.)
samples = autoencoder.decode(z)
samples = 0.5 * (samples + 1.)
samples.clamp_(0., 1.)
save_image(samples, "sample.png", nrow=samples_per_row * 2, padding=0)
samples = Image.open("sample.png")
display(samples)
!pip install ipywidgets
!pip install -qq medmnist
import os
import io
import imageio
import medmnist
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# setting seed for reproducibility
SEED = 42
os.environ["TF_CUDNN_DETERMINISTIC"] = "1"
keras.utils.set_random_seed(SEED)
# DATA
DATASET_NAME = "organmnist3d"
BATCH_SIZE = 32
AUTO = tf.data.AUTOTUNE
INPUT_SHAPE = (28, 28, 28, 1)
NUM_CLASSES = 11
# OPTIMIZER
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 1e-5
# TRAINING
EPOCHS = 60
# TUBELET EMBEDDING
PATCH_SIZE = (8, 8, 8)
NUM_PATCHES = (INPUT_SHAPE[0] // PATCH_SIZE[0]) ** 2
# ViViT ARCHITECTURE
LAYER_NORM_EPS = 1e-6
PROJECTION_DIM = 128
NUM_HEADS = 8
NUM_LAYERS = 8
!wget https://modelscope.oss-cn-beijing.aliyuncs.com/resource/organmnist3d.npz
def download_and_prepare_dataset(data_info: dict):
"""
Utility function to download the dataset and return train/valid/test
videos and labels.
Arguments:
data_info (dict): Dataset metadata
"""
data_path = "/mnt/workspace/organmnist3d.npz"
with np.load(data_path) as data:
# Get videos
train_videos = data["train_images"]
valid_videos = data["val_images"]
test_videos = data["test_images"]
# Get labels
train_labels = data["train_labels"].flatten()
valid_labels = data["val_labels"].flatten()
test_labels = data["test_labels"].flatten()
return (
(train_videos, train_labels),
(valid_videos, valid_labels),
(test_videos, test_labels),
)
# Get the metadata of the dataset
info = medmnist.INFO[DATASET_NAME]
# Get the dataset
prepared_dataset = download_and_prepare_dataset(info)
(train_videos, train_labels) = prepared_dataset[0]
(valid_videos, valid_labels) = prepared_dataset[1]
(test_videos, test_labels) = prepared_dataset[2]
@tf.function
def preprocess(frames: tf.Tensor, label: tf.Tensor):
"""Preprocess the frames tensors and parse the labels"""
# Preprocess images
frames = tf.image.convert_image_dtype(
frames[
..., tf.newaxis
], # The new axis is to help for further processing with Conv3D layers
tf.float32,
)
# Parse label
label = tf.cast(label, tf.float32)
return frames, label
def prepare_dataloader(
videos: np.ndarray,
labels: np.ndarray,
loader_type: str = "train",
batch_size: int = BATCH_SIZE,
):
"""Utility function to prepare dataloader"""
dataset = tf.data.Dataset.from_tensor_slices((videos, labels))
if loader_type == "train":
dataset = dataset.shuffle(BATCH_SIZE * 2)
dataloader = (
dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE)
)
return dataloader
trainloader = prepare_dataloader(train_videos, train_labels, "train")
validloader = prepare_dataloader(valid_videos, valid_labels, "valid")
testloader = prepare_dataloader(test_videos, test_labels, "test")
class TubeletEmbedding(layers.Layer):
def __init__(self, embed_dim, patch_size, **kwargs):
super().__init__(**kwargs)
self.projection = layers.Conv3D(
filters=embed_dim,
kernel_size=patch_size,
strides=patch_size,
padding="VALID",
)
self.flatten = layers.Reshape(target_shape=(-1, embed_dim))
def call(self, videos):
projected_patches = self.projection(videos)
flattened_patches = self.flatten(projected_patches)
return flattened_patches
class PositionalEncoder(layers.Layer):
def __init__(self, embed_dim, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
def build(self, input_shape):
_, num_tokens, _ = input_shape
self.position_embedding = layers.Embedding(
input_dim=num_tokens, output_dim=self.embed_dim
)
self.positions = tf.range(start=0, limit=num_tokens, delta=1)
def call(self, encoded_tokens):
# Encode the positions and add it to the encoded tokens
encoded_positions = self.position_embedding(self.positions)
encoded_tokens = encoded_tokens + encoded_positions
return encoded_tokens
class PositionalEncoder(layers.Layer):
def __init__(self, embed_dim, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
def build(self, input_shape):
_, num_tokens, _ = input_shape
self.position_embedding = layers.Embedding(
input_dim=num_tokens, output_dim=self.embed_dim
)
self.positions = tf.range(start=0, limit=num_tokens, delta=1)
def call(self, encoded_tokens):
# Encode the positions and add it to the encoded tokens
encoded_positions = self.position_embedding(self.positions)
encoded_tokens = encoded_tokens + encoded_positions
return encoded_tokens
class PositionalEncoder(layers.Layer):
def __init__(self, embed_dim, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
def build(self, input_shape):
_, num_tokens, _ = input_shape
self.position_embedding = layers.Embedding(
input_dim=num_tokens, output_dim=self.embed_dim
)
self.positions = tf.range(start=0, limit=num_tokens, delta=1)
def call(self, encoded_tokens):
# Encode the positions and add it to the encoded tokens
encoded_positions = self.position_embedding(self.positions)
encoded_tokens = encoded_tokens + encoded_positions
return encoded_tokens
class PositionalEncoder(layers.Layer):
def __init__(self, embed_dim, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
def build(self, input_shape):
_, num_tokens, _ = input_shape
self.position_embedding = layers.Embedding(
input_dim=num_tokens, output_dim=self.embed_dim
)
self.positions = tf.range(start=0, limit=num_tokens, delta=1)
def call(self, encoded_tokens):
# Encode the positions and add it to the encoded tokens
encoded_positions = self.position_embedding(self.positions)
encoded_tokens = encoded_tokens + encoded_positions
return encoded_tokens
class PositionalEncoder(layers.Layer):
def __init__(self, embed_dim, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
def build(self, input_shape):
_, num_tokens, _ = input_shape
self.position_embedding = layers.Embedding(
input_dim=num_tokens, output_dim=self.embed_dim
)
self.positions = tf.range(start=0, limit=num_tokens, delta=1)
def call(self, encoded_tokens):
# Encode the positions and add it to the encoded tokens
encoded_positions = self.position_embedding(self.positions)
encoded_tokens = encoded_tokens + encoded_positions
return encoded_tokens
!git clone https://github.com/maxin-cn/Latte.git
%cd Latte
!pip install timm
!pip install einops
!pip install omegaconf
!pip install diffusers==0.24.0
%cd models
!git lfs install
!git clone https://www.modelscope.cn/AI-ModelScope/Latte.git
!export CUDA_VISIBLE_DEVICES=0
!export PYTHONPATH=../
!python sample/sample_t2v.py --config configs/t2v/t2v_sample.yaml
- 声音生成TTS技术解析与实战
4.1Model training tutorial
5.详细资料学习链接