第十课.图片风格迁移和GAN

最新推荐文章于 2025-03-25 21:49:16 发布

tzc_fly

最新推荐文章于 2025-03-25 21:49:16 发布

阅读量3.2k

点赞数 3

分类专栏：白景屹的Pytorch笔记本文章标签：深度学习

本文链接：https://blog.csdn.net/qq_40943760/article/details/111781856

版权

白景屹的Pytorch笔记本专栏收录该内容

24 篇文章

订阅专栏

Neural Style Transfer

Neural Style Transfer原理

图片风格迁移，结合一张图片的内容和另一张图片的风格，生成一张新风格的图片（内容与第一张图片接近，风格与第二张图片接近）；
fig1

在早期有一篇文章：A Neural Algorithm of Artistic Style实现了图片风格迁移；通过VGG作为特征提取工具，使用特征分别重新组合去表达图像的内容和风格，假设两张输入图像为content和style，输出图像为target，则目的是使target的内容与content接近，风格与style接近。
特征提取器使用训练自ImageNet的VGG19，网络结构如下：

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace)
    (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (17): ReLU(inplace)
    (18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace)
    (23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (24): ReLU(inplace)
    (25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (26): ReLU(inplace)
    (27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace)
    (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (31): ReLU(inplace)
    (32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (33): ReLU(inplace)
    (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (35): ReLU(inplace)
    (36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace)
    (5): Dropout(p=0.5)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

在论文中,vgg19.features.0,5,10,19,28层的输出张量是有用的，即一共包含5层特征，已知两个输入图像content： $(c, h, w)$ 和style： $(c, h, w)$ ，以及隐式输入图像target： $(c, h, w)$ ，target一般直接从content复制，三个图像的size均一致，然后并行输入VGG，分别返回0,5,10,19,28层的输出张量：

$f_{content}$ ：vgg(content)，content features，共5层特征；
$f_{style}$ ：vgg(style)，style features，共5层特征；
$f_{target}$ ：vgg(target)，target features，共5层特征；

使用这5层特征计算图像内容损失content loss：
$contentloss=\sum_{l=1}^{5}mean[(f_{content,l}-f_{target,l})^{2}]$
对于风格损失style loss，需要先设计一种可以表达风格的方式，在论文中，使用格拉姆矩阵Gram matrix表达风格，例如，对于 $f_{style}$ ，取出第 $l$ 层的特征 $f_{style,l}$ ，形状为 $(1, c h a n n e l, h e i g h t, w i d t h)$ ，reshape到 $(channel,height\times width)$ ，则该层特征 $f_{style,l}$ 对应的风格即Gram matrix为：
$gram_{style,l}=f_{style,l}\cdot f_{style,l}^{T}$
显然， $gram_{style,l}$ 的形状为 $(c h a n n e l, c h a n n e l)$ ，基于格拉姆矩阵表达的风格，可以计算风格的损失style loss为：
$styleloss=\sum_{l=1}^{5}\frac{mean[(gram_{style,l}-gram_{target,l})^{2}]}{channel_{l}\cdot height_{l}\cdot width_{l}}$
因此，整体损失为内容损失和风格损失之和，为了平衡数量级，为各项损失增加权重：
$l o s s = c o n t e n t l o s s + 100 s t y l e l o s s$

Gram matrix
向量内积反映相似程度，而格拉姆矩阵由各个向量内积组成，反映了各个向量的内在联系：
fig2 某一层特征reshape为 $(channel,height\times width)$ ，即上图左部的张量，在第七课提到，卷积网络是一种局部特征提取器，输出张量的通道数等于该层CNN的滤波器个数，输出的张量的一个通道实际上代表一类特征，因此，格拉姆矩阵其实反映了各个特征之间的联系，从风格角度看，各个特征之间正是通过风格关联起来的；
因此，格拉姆矩阵表达出了特征中间隐藏的风格信息

准备工作

首先导入必要的包和模块，from __future__ import division是精确除法(python2中/代表整除，导入division才能操作精确除法)，在python3中其实可以省略，相关说明见python记事本：

from __future__ import division
from torchvision import models
from torchvision import transforms
from PIL import Image
import argparse
import torch
import torchvision
import torch.nn as nn

import numpy as np
import matplotlib.pyplot as plt

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

定义读取图片的函数load_image：

# 读取图片
def load_image(image_path,transform=None,max_size=None,shape=None):
    # 读入图片并转为3通道
    image=Image.open(image_path).convert('RGB')
    if max_size:
        scale=max_size/max(image.size)
        size=np.array(image.size)*scale
        # ndarray.astype(T)：复制ndarray，再转换到类型T
        # Image.ANTIALIAS：平滑
        image=image.resize(size.astype(int),Image.ANTIALIAS)
        
    if shape:
        # Image.LANCZOS：一种插值方法
        image=image.resize(shape,Image.LANCZOS)
        
    if transform:
        # 进行transform
        image=transform(image)
        # 增加batch维度
        image=image.unsqueeze(dim=0)
        
    return image.to(device)

transform=transforms.Compose([
    transforms.ToTensor(),
    # 由于VGG在ImageNet上预训练,所以使用ImageNet的标准化参数去标准化
    # 有利于模型学习的稳定性
    transforms.Normalize(mean=[0.485,0.456,0.406],
                         std=[0.229,0.224,0.225])
])


content=load_image("content.png",
                   transform=transform,
                   max_size=400)
print(content.size())

# 注意image.resize中参数shape与张量size的对应关系
# 张量size为(c,h,w)
style=load_image("style.png",
                 transform=transform,
                 shape=[content.size(3),content.size(2)])
print(style.size())

"""
torch.Size([1, 3, 301, 400])
torch.Size([1, 3, 301, 400])
"""

另外，定义张量可视化函数imshow：

unloader=transforms.ToPILImage()

def imshow(tensor:"(N,C,H,W)",title=None):
    image=tensor.cpu().clone()
    # 去除batch维度
    image=image.squeeze(dim=0)
    image=unloader(image)
    
    plt.figure()
    plt.imshow(image)
    if title is not None:
        plt.title(title)
    plt.show()

imshow(content,title="content")
imshow(style,title="style")

fig3

定义模型并加载预训练的模型参数

在Neural Style Transfer的任务中，并不需要训练网络，网络仅仅是特征提取工具，所以直接加载ImageNet预训练的vgg19参数即可，模型定义如下，所做工作仅仅是获取指定的5层输出特征：

class VGGnet(nn.Module):
    def __init__(self,model_state_path=None):
        super().__init__()
        
        vgg19=models.vgg19(pretrained=False)
        # 模型及对应参数文件在文档里找:
        # https://github.com/pytorch/vision/tree/master/torchvision/models
        if model_state_path:
            vgg19.load_state_dict(torch.load(model_state_path))
            
        # 论文中,vgg19.features.0,5,10,19,28层是有用的
        self.select=['0','5','10','19','28']
        self.vgg=vgg19.features
        
    def forward(self,x):
        features=[]
        
        # _modules返回元素有序的字典
        for name,layer in self.vgg._modules.items():
            x=layer(x)
            if name in self.select:
                features.append(x)
            
        return features

来自torchvision的预训练模型及对应参数可以在torchvision/models里查找：

model_urls = {
    'vgg11': 'https://download.pytorch.org/models/vgg11-bbd30ac9.pth',
    'vgg13': 'https://download.pytorch.org/models/vgg13-c768596a.pth',
    'vgg16': 'https://download.pytorch.org/models/vgg16-397923af.pth',
    'vgg19': 'https://download.pytorch.org/models/vgg19-dcbb9e9d.pth',
    'vgg11_bn': 'https://download.pytorch.org/models/vgg11_bn-6002323d.pth',
    'vgg13_bn': 'https://download.pytorch.org/models/vgg13_bn-abd245e5.pth',
    'vgg16_bn': 'https://download.pytorch.org/models/vgg16_bn-6c64b313.pth',
    'vgg19_bn': 'https://download.pytorch.org/models/vgg19_bn-c79401a0.pth',
}

实例化模型并加载预训练参数：

model=VGGnet(
    model_state_path="./vgg19-dcbb9e9d.pth"
    ).to(device)

对图像content进行特征提取如下：

model.eval()
features=model.forward(content)

可以打印查看收集的5层特征：

for feat in features:
    print(feat.size())

"""
torch.Size([1, 64, 301, 400])
torch.Size([1, 128, 150, 200])
torch.Size([1, 256, 75, 100])
torch.Size([1, 512, 37, 50])
torch.Size([1, 512, 18, 25])
"""

训练target以及结果可视化

target为风格迁移后的图像，注意，论文的做法不同于以往的模型训练，使用ImageNet训练的VGG作为特征提取器，真正优化的是target张量，所以要设置target可计算梯度：

# clone()返回的tensor是非叶子节点(有计算图连接)
target=content.clone().requires_grad_(True)

选择优化方法为Adam，梯度更新对象为target：

optimizer=torch.optim.Adam([target],
                           lr=0.003,
                           betas=[0.5,0.999])

优化target：

# 优化target
num_step=2000
for step in range(num_step):
    """
    每张图像前向计算后得到特征列表
    torch.Size([1, 64, 301, 400])
    torch.Size([1, 128, 150, 200])
    torch.Size([1, 256, 75, 100])
    torch.Size([1, 512, 37, 50])
    torch.Size([1, 512, 18, 25])
    """
    target_features=model.forward(target)
    content_features=model.forward(content)
    style_features=model.forward(style)
    
    # loss = content loss + style loss
    style_loss=0
    content_loss=0
    
    for f1,f2,f3 in zip(target_features,content_features,style_features):
        # 每层的f1,f2,f3形状均为(n,c,h,w), n=1
        content_loss+=torch.mean((f1-f2)**2) # a number
             
        # 很明显本实验batch_size=1,n为1
        n,c,h,w = f1.size()
        f1 = f1.view(c, h*w)
        f3 = f3.view(c, h*w)
        
        # 计算gram matrix
        # torch.mm 是严格的矩阵乘法
        f1 = torch.mm(f1, f1.t()) # (c,c)
        f3 = torch.mm(f3, f3.t()) # (c,c)

        # 由target的gram matrix和style的gram matrix计算style loss
        style_loss+=torch.mean((f1-f3)**2)/(c*h*w)
        
    # 加上权重平衡数量级
    loss=content_loss+style_loss*100.
    
    # 梯度清零
    optimizer.zero_grad()
    
    # 反向传播计算梯度
    loss.backward()
    
    # 更新张量target
    optimizer.step()
    
    if step % 10 == 0:
        print("Step [{}/{}], Content Loss: {:.4f}, Style Loss: {:.4f}".format(step,
                                                                              num_step,
                                                                              content_loss.item(),
                                                                              style_loss.item()))

优化后的target本身还处在标准化的分布上(注意，content与style采用的是相同参数进行标准化)，所以务必要采取反标准化(标准化还原)，标准化还原的参数计算回顾第九课中的标准化与反标准化部分，计算出反标准化参数为：均值[-2.12, -2.04, -1.80]，标准差[4.37, 4.46, 4.44]；
分别可视化反标准化前后的target为：

# 还原Normalize
denorm=transforms.Normalize([-2.12, -2.04, -1.80],
                            [4.37, 4.46, 4.44])

# 去除batch维度
img=target.clone().squeeze()
# 可视化反标准化前的target
imshow(img)

# ToPILImage接收的tensor值在0到1
img=denorm(img).clamp_(0,1)

imshow(img,title="target")

fig4

生成对抗网络GAN

GAN原理

GAN即Generative Adversarial Network，GAN由两部分组成，一个是生成器Generator，另一个是判别器Discriminator：

Generator：从隐式空间latent space映射出生成数据，并让生成数据的分布接近真实数据；
Discriminator：分类器，鉴别真实数据与Generator生成的伪造数据；

GAN的训练不需要在训练前给数据集进行标注，属于无监督学习：
fig5

使用二元交叉熵 $B C E L o s s (x, y)$ 计算损失时，对于某个样本 $i$ ，损失为：
$J_{i}=-[y_{i}logx_{i}+(1-y_{i})log(1-x_{i})]$
其中， $y_{i}$ 为样本的真实标签(0或1)， $x_{i}$ 为网络输出的概率值；

GAN的训练过程如下，假设使用二元交叉熵计算损失：

首先可以随机生成latent space，使用生成器 $G$ 生成伪造数据 $f a k e i m a g e s$ ，设置标签 $f a k e l a b e l s$ 为0，代表伪造数据；判别器 $D$ 从真实数据 $i m a g e s$ 中采样，真实数据的标签 $r e a l l a b e l s$ 设置为1，代表真实数据，判别器(二分类器)通过数据进行前向计算，即得到：
$f a k e o u t p u t s = D (f a k e i m a g e s)$
$r e a l o u t p u t s = D (i m a g e s)$
损失计算为：
$loss_{d}=BCELoss(fakeoutputs,fakelabels)+BCELoss(realoutputs,reallabels)$
基于 $loss_{d}$ 计算梯度，并更新判别器 $D$ ；
在训练完判别器后，继续使用之前生成器生成的数据 $f a k e i m a g e s$ ，传给判别器进行前向计算：
$g o u t p u t s = D (f a k e i m a g e s)$
与训练判别器不同，目的是使生成器学会生成逼真的数据从而欺骗判别器，所以损失计算为：
$loss_{g}=BCELoss(goutputs,reallabels)$
基于 $loss_{g}$ 计算梯度，并更新生成器 $G$ ；
重复上述步骤；

GAN生成Mnist

准备工作

导入必要的包和模块：

import torch
import torch.nn as nn

import torchvision
from torchvision import transforms

import matplotlib.pyplot as plt
import numpy as np

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

加载mnist数据集：

# 加载mnist
batch_size=32
transform=transforms.Compose([
    transforms.ToTensor(),
])

# mnistdata[i]有两个对象,第一个代表图片本身,第二个代表图片所属类别
mnistdata=torchvision.datasets.MNIST("./DataSet",train=True,transform=transform)

使用dataloader：

dataloader=torch.utils.data.DataLoader(dataset=mnistdata,
                                      batch_size=batch_size,
                                      shuffle=True)

使用dataloader获取一个batch，从batch选一张图像可视化：

batch=next(iter(dataloader)) # batch[0]保存images,batch[1]保存labels

batch_images=batch[0] # [batch_size,1,28,28]
print(batch_images.size())
# torch.Size([32, 1, 28, 28])

image=batch_images[6]

unloader=transforms.ToPILImage()
image=unloader(image)    
plt.figure()
# imshow接收数组或PIL图像,tensor是(c,h,w),PIL是(h,w,c)
# 当imshow(X)中X为(M,N)时,以热力图(彩色)形式绘制
plt.imshow(image)    
plt.show()

注意plt.inshow()，接收数组或PIL图像，当imshow(X)中X为 $(M, N)$ 时，以热力图(彩色)形式绘制：
fig6

tensor与PILImage的形状区别
tensor是 $(c, h, w)$ ,PIL是 $(h, w, c)$

模型定义

对于mnist这样的简单数据，使用简单的全连接网络即可达到效果，判别器定义如下：

image_size=28*28
hidden_size=256

D=nn.Sequential(
    nn.Linear(image_size,hidden_size),
    nn.LeakyReLU(0.2),
    nn.Linear(hidden_size,hidden_size),
    nn.LeakyReLU(0.2),
    # 二分类
    nn.Linear(hidden_size,1),
    nn.Sigmoid()
).to(device)

注意到判别器使用了LeakyReLU进行非线性变换：
$LeakReLU(x)=max(0,x)+slope\times min(0,x)$
pytorch中，函数为：

torch.nn.LeakyReLU(negative_slope=0.01, inplace=False)

如果inplace=True，则对于张量X，LeakyReLU(X)会同步改变X，即inplace操作；

对于生成器定义，注意，由于真实样本ToTensor后的值在0到1之间，所以生成器产生的样本也需要用sigmoid映射到0到1之间：

# 生成器:从LatentSpace开始
latent_size=64

G=nn.Sequential(
    nn.Linear(latent_size,hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size,hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size,image_size),
    # 由于真实样本ToTensor后的值在0到1之间,所以生成器产生的样本也需要用sigmoid映射到0到1之间
    nn.Sigmoid()
).to(device)

训练与可视化

使用二元交叉熵计算损失，并选择Adam为优化方法：

# Binary Cross Entropy Loss
loss_fn=nn.BCELoss()

d_optimizer=torch.optim.Adam(D.parameters(),lr=0.0002)
g_optimizer=torch.optim.Adam(G.parameters(),lr=0.0002)

定义可视化函数，用于可视化生成器生成的数据，隐式空间为 $z$ ，则生成器生成的数据为 $G (z)$ ：

# G(z)的可视化
def gzimshow(Gen:"Generator model",title=None):
    latent_space = torch.randn(1, latent_size).to(device)
    image = Gen(latent_space).view(1, 28, 28).cpu()
    image=unloader(image)
    
    plt.figure()
    plt.imshow(image)
    if title is not None:
        plt.title(title)
    plt.show()

训练判别器与生成器：

# 训练
total_steps=len(dataloader)
num_epochs=100

for epoch in range(num_epochs):
    for i,( images, _ ) in enumerate(dataloader):
        images=images.view(batch_size,image_size).to(device)
        
        real_labels=torch.ones(batch_size,1).to(device)
        fake_labels=torch.zeros(batch_size,1).to(device)
        
        real_outputs=D(images)
        
        # 真实图片的损失
        d_loss_real=loss_fn(real_outputs,real_labels)
        
        # 生成 fake images
        # latent space
        z=torch.randn(batch_size,latent_size).to(device)
        fake_images=G(z)
        
        # 将G的输出从计算图中剥离,避免在训练D时涉及到G的梯度
        fake_outputs=D(fake_images.detach())
        
        d_loss_fake=loss_fn(fake_outputs,fake_labels)
        
        d_loss=d_loss_real+d_loss_fake
        
        # 更新D
        d_optimizer.zero_grad()
        d_loss.backward()
        d_optimizer.step()
        
        #更新G
        # 不能detach,因为要追踪G的梯度
        g_outputs=D(fake_images)
        g_loss=loss_fn(g_outputs,real_labels)
        
        # 由于D还在计算图中,所以需要将D的梯度置0
        d_optimizer.zero_grad()
        g_optimizer.zero_grad()
        g_loss.backward()
        g_optimizer.step()
        
        # 一般来说 G loss 越小越好
        if i%200==0:
            print("Epoch[{}/{}],Step[{}/{}],D_loss:{},G_loss:{},D(x):{},D(G(z)):{}".format(
                epoch,num_epochs,i,total_steps,
                d_loss.item(),
                g_loss.item(),
                real_outputs.mean().item(),
                fake_outputs.mean().item()
            ))    
            
        if epoch%20==0 and i%total_steps==0:
            gzimshow(G,"Epoch[{}/{}],Step[{}/{}]".format(epoch,num_epochs,i,total_steps))