第十周周报：动手深度学习（五）

最新推荐文章于 2024-09-12 23:40:25 发布

KeepThinking！

最新推荐文章于 2024-09-12 23:40:25 发布

阅读量912

点赞数 22

文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/qq_52684249/article/details/141755256

版权

摘要

本周跟着李沐老师的课程完成了深度学习PyTorch的基础内容学习。主要学习了区域卷积神经网络的基础概念，通过不同方法提高CNN的识别准确率和效率；了解了语义分割的概念，以及在进行语义分割训练时用到的数据集VOCdevkit；还通过卷积层进行特征提取实现了两张图片之间的风格迁移。下文将依次进行介绍，并附上完整代码。

Abstract

I completed the basic content of deep learning PyTorch with Professor Li Mu's course this week. I mainly learned the basic concepts of regional convolutional neural networks and improved the recognition accuracy and efficiency of CNN through different methods. Understood the concept of semantic segmentation and the VOCdevkit dataset used for semantic segmentation training. We also achieved style transfer between two images through feature extraction using convolutional layers. The following text will be introduced in sequence and accompanied by complete code.

一、区域卷积神经网络

区域卷积神经网络是将深度模型应用于目标检测的神经网络模型。

1.1 R-CNN

R-CNN首先对输入图像使用选择性搜索来选取多个高质量的提议区域 。这些提议区域通常是在多个尺度下选取的，并具有不同的形状和大小。每个提议区域将被标注类别和真实边界框。

然后选取一个预训练的卷积神经网络，并将其在输出层之前截断。将每个提议区域变形为网络需要的输入尺寸，并通过前向计算输出抽取的提议区域特征。将每个提议区域的特征连同其标注的类别作为一个样本，训练多个支持向量机对目标分类。其中每个支持向量机用来判断样本是否属于某一个类别。将每个提议区域的特征连同其标注的边界框作为一个样本，训练线性回归模型来预测真实边界框。

R-CNN模型如下图所示：

1.2 Fast R-CNN

R-CNN虽然通过预训练的卷积神经网络有效抽取了图像特征，但是它速度很慢。因为我们可能从一张图像中选出上千个提议区域，对该图像做目标检测将导致上千次的卷积神经网络的前向计算。

所以Fast R-CNN用来提取特征的卷积神经网络的输入是整个图像，而不是各个提议区域。而且，这个网络通常会参与训练，即更新模型参数。假设选择性搜索生成n个提议区域。这些形状各异的提议区域在卷积神经网络的输出上分别标出形状各异的兴趣区域。这些兴趣区域需要抽取出形状相同的特征以便于连结后输出。

Fast R-CNN引入兴趣区域池化层，将卷积神经网络的输出和提议区域作为输入，输出连结后的各个提议区域抽取的特征。通过全连接层将输出形状变换为n×d，其中超参数d取决于模型设计。预测类别时，将全连接层的输出的形状再变换为n×q并使用softmax回归（q为类别个数）。预测边界框时，将全连接层的输出的形状变换为n×4。

Fast R-CNN模型结果如下图所示：

1.3 Faster R-CNN

Fast R-CNN通常需要在选择性搜索中生成较多的提议区域，以获得较精确的目标检测结果。Faster R-CNN提出将选择性搜索替换成区域提议网络，从而减少提议区域的生成数量，并保证目标检测的精度。与Fast R-CNN相比，Faster R-CNN只有生成提议区域的方法从选择性搜索变成了区域提议网络，而其他部分均保持不变。

Faster R-CNN使用填充为1的3×3卷积层变换卷积神经网络的输出，并将输出通道数记为c。这样，卷积神经网络为图像抽取的特征图中的每个单元均得到一个长度为c的新特征。以特征图每个单元为中心，生成多个不同大小和宽高比的锚框并标注它们。用锚框中心单元长度为c的特征分别预测该锚框的二元类别（含目标还是背景）和边界框。使用非极大值抑制，从预测类别为目标的预测边界框中移除相似的结果。最终输出的预测边界框即兴趣区域池化层所需要的提议区域。

Faster R-CNN模型结构图如下所示：

1.4 Mask R-CNN

如果训练数据还标注了每个目标在图像上的像素级位置，那么Mask R-CNN能有效利用这些详尽的标注信息进一步提升目标检测的精度。

Mask R-CNN在Faster R-CNN的基础上做了修改。Mask R-CNN将兴趣区域池化层替换成了兴趣区域对齐层，即通过双线性插值来保留特征图上的空间信息，从而更适于像素级预测。兴趣区域对齐层的输出包含了所有兴趣区域的形状相同的特征图。它们既用来预测兴趣区域的类别和边界框，又通过额外的全卷积网络预测目标的像素级位置。

Mask R-CNN模型结果如下图所示：

双线性插值：一种在二维空间中常用的插值方法，它基于两个方向上的线性插值。假设有一个二维网格，我们想要估计这个网格中某个非网格点上的值，而这个值可以通过它周围的四个网格点的值（如： $Q_{1,1}$ $(x_{1},y_{1})$ 、 $Q_{1,2}$ $(x_{1},y_{2})$ 、 $Q_{2,1}$ $(x_{2},y_{1})$ 、 $Q_{2,2}$ $(x_{2},y_{2})$ ）来近似。

公式如下所示：

$P=\frac{(x_{2}-x)(y_{2}-y)}{(x_{2}-x_{1})(y_{2}-y_{1})}Q_{11}+\frac{(x-x_{1})(y_{2}-y)}{(x_{2}-x_{1})(y_{2}-y_{1})}Q_{21}+\frac{(x_{2}-x)(y-y_{1})}{(x_{2}-x_{1})(y_{2}-y_{1})}Q_{12}+\frac{(x-x_{1})(y-y_{1})}{(x_{2}-x_{1})(y_{2}-y_{1})}Q_{22}$

二、语义分割和数据集

语义分割关注如何将图像分割成属于不同语义类别的区域。这些语义区域的标注和预测都是像素级的。与目标检测相比，语义分割标注的像素级的边框显然更加精细。语义分割如下图所示：

图像分割将图像分割成若干组成区域。这类问题的方法通常利用图像中像素之间的相关性。它在训练时不需要有关图像像素的标签信息，所以在预测时也无法保证分割出的区域具有我们希望得到的语义。图像分割可能将狗分割成两个区域：一个覆盖以黑色为主的嘴巴和眼睛，而另一个覆盖以黄色为主的其余部分身体。

实例分割研究如何识别图像中各个目标实例的像素级区域。与语义分割有所不同，实例分割不仅需要区分语义，还要区分不同的目标实例。如果图像中有两只狗，实例分割需要区分像素属于这两只狗中的哪一只。

下面代码是对语义分割重要数据集Pascal VOC2012的处理，以及显示。

# 语义分割关注如何将图像分割成属于不同语义类别的区域
# 语义区域的标注和预测都是像素级的，与目标检测相比，语义分割标注的像素级的边框更加精细
import time
import torch
import torch.nn.functional as F
import torchvision
import numpy as np
from PIL import Image
from tqdm import tqdm
import sys
sys.path.append("..")
import d2lzh_pytorch as d2l

def read_voc_images(root="./data/VOCdevkit/VOC2012", is_train=True, max_num=None):
    txt_fname = '%s/ImageSets/Segmentation/%s' % (
        root, 'train.txt' if is_train else 'val.txt')
    with open(txt_fname, 'r') as f:
        images = f.read().split()
    if max_num is not None:
        images = images[:min(max_num, len(images))]
    features, labels = [None] * len(images), [None] * len(images)
    for i, fname in tqdm(enumerate(images)):
        features[i] = Image.open('%s/JPEGImages/%s.jpg' % (root, fname)).convert("RGB")
        labels[i] = Image.open('%s/SegmentationClass/%s.png' % (root, fname)).convert("RGB")
    return features, labels # PIL image

voc_dir = "./data/VOCdevkit/VOC2012"
train_features, train_labels = read_voc_images(voc_dir, max_num=100)

n = 5
imgs = train_features[0:n] + train_labels[0:n]
d2l.show_images(imgs, 2, n)

# 列出标签中每个RGB颜色的值及其标注的类别
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128]]
# 本函数已保存在d2lzh_pytorch中方便以后使用
VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
               'diningtable', 'dog', 'horse', 'motorbike', 'person',
               'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']

colormap2label = torch.zeros(256 ** 3, dtype=torch.uint8)
for i, colormap in enumerate(VOC_COLORMAP):
    colormap2label[(colormap[0] * 256 + colormap[1]) * 256 + colormap[2]] = i

def voc_label_indices(colormap, colormap2label):
    """
    convert colormap (PIL image) to colormap2label (uint8 tensor).
    """
    colormap = np.array(colormap.convert("RGB")).astype('int32')
    idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256
           + colormap[:, :, 2])
    return colormap2label[idx]

y = voc_label_indices(train_labels[0], colormap2label)
print(y[105:115, 130:140], VOC_CLASSES[1])

# 数据预处理，将图像裁剪成固定尺寸
def voc_rand_crop(feature, label, height, width):
    """
    Random crop feature (PIL image) and label (PIL image).
    """
    i, j, h, w = torchvision.transforms.RandomCrop.get_params(
            feature, output_size=(height, width))

    feature = torchvision.transforms.functional.crop(feature, i, j, h, w)
    label = torchvision.transforms.functional.crop(label, i, j, h, w)

    return feature, label

imgs = []
for _ in range(n):
    imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300)
d2l.show_images(imgs[::2] + imgs[1::2], 2, n)

# 自定义语义分割数据集类
class VOCSegDataset(torch.utils.data.Dataset):
    def __init__(self, is_train, crop_size, voc_dir, colormap2label, max_num=None):
        """
        crop_size: (h, w)
        """
        self.rgb_mean = np.array([0.485, 0.456, 0.406])
        self.rgb_std = np.array([0.229, 0.224, 0.225])
        self.tsf = torchvision.transforms.Compose([
            torchvision.transforms.ToTensor(),
            torchvision.transforms.Normalize(mean=self.rgb_mean,
                                             std=self.rgb_std)
        ])

        self.crop_size = crop_size # (h, w)
        features, labels = read_voc_images(root=voc_dir,
                                           is_train=is_train,
                                           max_num=max_num)
        self.features = self.filter(features) # PIL image
        self.labels = self.filter(labels)     # PIL image
        self.colormap2label = colormap2label
        print('read ' + str(len(self.features)) + ' valid examples')

    def filter(self, imgs):
        return [img for img in imgs if (
            img.size[1] >= self.crop_size[0] and
            img.size[0] >= self.crop_size[1])]

    def __getitem__(self, idx):
        feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
                                       *self.crop_size)

        return (self.tsf(feature), # float32 tensor
                voc_label_indices(label, self.colormap2label)) # uint8 tensor

    def __len__(self):
        return len(self.features)

crop_size = (320, 480)  # 指定随机裁剪的输出图像的形状为320×480
max_num = 100
# 读取数据集
voc_train = VOCSegDataset(True, crop_size, voc_dir, colormap2label, max_num)
voc_test = VOCSegDataset(False, crop_size, voc_dir, colormap2label, max_num)

batch_size = 64
num_workers = 0 if sys.platform.startswith('win32') else 4
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True, drop_last=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(voc_test, batch_size, drop_last=True, num_workers=num_workers)

for X, Y in train_iter:
    print(X.dtype, X.shape)
    print(y.dtype, Y.shape)
    break

代码运行结果如下：

数据预处理随机裁剪：

三、风格迁移

基于卷积神经网络的样式迁移。首先，需要初始化合成图像，例如将其初始化成内容图像。该合成图像是样式迁移过程中唯一需要更新的变量，即样式迁移所需迭代的模型参数。然后，选择一个预训练的卷积神经网络来抽取图像的特征，其中的模型参数在训练中无须更新。

深度卷积神经网络凭借多个层逐级抽取图像的特征。我们可以选择其中某些层的输出作为内容特征或样式特征。例如：选取的预训练的神经网络含有3个卷积层，其中第二层输出图像的内容特征，而第一层和第三层的输出被作为图像的样式特征。接下来，我们通过正向传播计算样式迁移的损失函数，并通过反向传播迭代模型参数，即不断更新合成图像。

样式迁移常用的损失函数由3部分组成：内容损失使合成图像与内容图像在内容特征上接近；样式损失令合成图像与样式图像在样式特征上接近；总变差损失则有助于减少合成图像中的噪点。

最后，当模型训练结束时，我们输出样式迁移的模型参数，即得到最终的合成图像。

基于CNN的风格迁移方法结构如下图所示：

实线箭头：正向传播；虚线箭头：反向传播。

下文代码中将下图作为内容图像

将下图作为样式图像：

import time
import torch
import torch.nn.functional as F
import torchvision
import numpy as np
from PIL import Image
import sys
sys.path.append("..")
import d2lzh_pytorch as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

d2l.set_figsize()
content_img = Image.open('./data/image/rainier.png').convert('RGB')
d2l.plt.imshow(content_img)
d2l.plt.show()

d2l.set_figsize()
style_img = Image.open('./data/image/autumn_oak.png').convert('RGB')
d2l.plt.imshow(style_img)
d2l.plt.show()

rgb_mean = np.array([0.485, 0.456, 0.406])
rgb_std = np.array([0.229, 0.224, 0.225])

def preprocess(PIL_img, image_shape):
    process = torchvision.transforms.Compose([
        torchvision.transforms.Resize(image_shape),  # 更改输入图像的尺寸
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(mean=rgb_mean, std=rgb_std)])  # 在RGB三个通道分别做标准化

    return process(PIL_img).unsqueeze(dim=0)  # (batch_size, 3, H, W)

def postprocess(img_tensor):
    inv_normalize = torchvision.transforms.Normalize(  # 将输出图像中的像素值还原回标准化之前的值。
        mean=-rgb_mean / rgb_std,
        std=1/rgb_std)
    to_PIL_image = torchvision.transforms.ToPILImage()
    return to_PIL_image(inv_normalize(img_tensor[0].cpu()).clamp(0, 1))

# 使用基于ImageNet数据集预训练的VGG-19模型来抽取图像特征
pretrained_net = torchvision.models.vgg19(pretrained=True, progress=True)
print(pretrained_net)

style_layers, content_layers = [0, 5, 10, 19, 28], [25]

net_list = []
for i in range(max(content_layers + style_layers) + 1):
    net_list.append(pretrained_net.features[i])
net = torch.nn.Sequential(*net_list)  # 构建一个新的网络net，只保留需要用到的VGG的所有层

def extract_features(X, content_layers, style_layers):
    contents = []
    styles = []
    for i in range(len(net)):
        X = net[i](X)  # 逐层计算
        if i in style_layers:
            styles.append(X)
        if i in content_layers:
            contents.append(X)
    return contents, styles

def get_contents(image_shape, device):  # 抽取内容特征
    content_X = preprocess(content_img, image_shape).to(device)
    contents_Y, _ = extract_features(content_X, content_layers, style_layers)
    return content_X, contents_Y

def get_styles(image_shape, device):  # 抽取样式特征
    style_X = preprocess(style_img, image_shape).to(device)
    _, styles_Y = extract_features(style_X, content_layers, style_layers)
    return style_X, styles_Y

def content_loss(Y_hat, Y):  # 平方误差函数衡量内容损失
    return F.mse_loss(Y_hat, Y)

def gram(X):  # 格拉姆矩阵
    num_channels, n = X.shape[1], X.shape[2] * X.shape[3]
    X = X.view(num_channels, n)
    return torch.matmul(X, X.t()) / (num_channels * n)
def style_loss(Y_hat, gram_Y):  # 样式损失函数
    return F.mse_loss(gram(Y_hat), gram_Y)

# 学到的合成图像里面有大量高频噪点，即有特别亮或者特别暗的颗粒像素
def tv_loss(Y_hat):  # 总变差降噪
    return 0.5 * (F.l1_loss(Y_hat[:, :, 1:, :], Y_hat[:, :, :-1, :]) + F.l1_loss(Y_hat[:, :, :, 1:], Y_hat[:, :, :, :-1]))

# 总损失函数：样式迁移的损失函数即内容损失、样式损失和总变差损失的加权和
content_weight, style_weight, tv_weight = 1, 1e3, 10
def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram):
    # 分别计算内容损失、样式损失和总变差损失
    contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip(
        contents_Y_hat, contents_Y)]
    styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip(
        styles_Y_hat, styles_Y_gram)]
    tv_l = tv_loss(X) * tv_weight
    # 对所有损失求和
    l = sum(styles_l) + sum(contents_l) + tv_l
    return contents_l, styles_l, tv_l, l

#  初始化合成图像
class GeneratedImage(torch.nn.Module):
    def __init__(self, img_shape):
        super(GeneratedImage, self).__init__()
        self.weight = torch.nn.Parameter(torch.rand(*img_shape))

    def forward(self):
        return self.weight

def get_inits(X, device, lr, styles_Y):
    gen_img = GeneratedImage(X.shape).to(device)
    gen_img.weight.data = X.data
    optimizer = torch.optim.Adam(gen_img.parameters(), lr=lr)
    styles_Y_gram = [gram(Y) for Y in styles_Y]
    return gen_img(), styles_Y_gram, optimizer

# 模型训练
def train(X, contents_Y, styles_Y, device, lr, max_epochs, lr_decay_epoch):
    print("training on ", device)
    X, styles_Y_gram, optimizer = get_inits(X, device, lr, styles_Y)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, lr_decay_epoch, gamma=0.1)
    for i in range(max_epochs):
        start = time.time()

        contents_Y_hat, styles_Y_hat = extract_features(
                X, content_layers, style_layers)
        contents_l, styles_l, tv_l, l = compute_loss(
                X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram)

        optimizer.zero_grad()
        l.backward(retain_graph = True)
        optimizer.step()
        scheduler.step()

        if i % 50 == 0 and i != 0:
            print('epoch %3d, content loss %.2f, style loss %.2f, '
                  'TV loss %.2f, %.2f sec'
                  % (i, sum(contents_l).item(), sum(styles_l).item(), tv_l.item(),
                     time.time() - start))
    return X.detach()

image_shape =  (150, 225)
net = net.to(device)
content_X, contents_Y = get_contents(image_shape, device)
style_X, styles_Y = get_styles(image_shape, device)
output = train(content_X, contents_Y, styles_Y, device, 0.01, 500, 200)
d2l.plt.imshow(postprocess(output))
d2l.plt.show()