实验内容
-
Network Visualization (PyTorch)
- 探索可视化特征的方法,在ImageNet 上的预训练模型。
- 探索图像梯度的各种应用,包括图像梯度、欺骗图像,类可视化
实验原理
In this notebook we will explore the use of image gradients for generating new images. When training a model, we define a loss function which measures our current unhappiness with the model’s performance; we then use backpropagation to compute the gradient of the loss with respect to the model parameters, and perform gradient descent on the model parameters to minimize the loss.
Here we will do something slightly different. We will start from a convolutional neural network model which has been pretrained to perform image classification on the ImageNet dataset. We will use this model to define a loss function which quantifies our current unhappiness with our image, then use backpropagation to compute the gradient of this loss with respect to the pixels of the image. We will then keep the model fixed, and perform gradient descent on the image to synthesize a new image which minimizes the loss.
In this notebook we will explore three techniques for image generation:
-
Saliency Maps: Saliency maps are a quick way to tell which part of the image influenced the classification decision made by the network.
-
Fooling Images: We can perturb an input image so that it appears the same to humans, but will be misclassified by the pretrained network.
-
Class Visualization: We can synthesize an image to maximize the classification score of a particular class; this can give us some sense of what the network is looking for when it classifies images of that class.
This notebook uses PyTorch*; we have provided another notebook which explores the same concepts in TensorFlow. You only need to complete one of these two notebooks.
在训练一个模型时,我们定义了一个损失函数来衡量我们当前对模型表现的“不满”;然后,我们使用 反向传播计算损失相对于模型参数的梯度,并对模型参数执行梯度下降以最小化损失。
在这里,我们将做一些稍微不同的事情。我们将从一个经过预训练的卷积神经网络模型开始,在 ImageNet数据集上执行图像分类。我们将使用该模型定义一个损失函数,该函数量化我们当前对图像的 不满,然后使用反向传播计算该损失相对于图像像素的梯度。然后,我们将保持模型固定,并对图像执 行梯度下降,以合成一个新的图像,使损失最小化。
在本次实验中,我们将探讨三种图像生成技术:
- Saliency Maps 显著性图:显著性图能很快告诉我们图像的哪一部分影响了网络做出的分类决定;
- Fooling Images 愚弄图像:扰动一张输入图像,使之对人眼似乎一模一样,却能让预训练模型误分 类;
- Class Visualization 类可视化:我们可以合成一张图像使特定类别的分类分值最大化,这能给我们 一些感觉:网络模型在分类时到底在寻找什么。
实验步骤
Import related library initialization
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
import torch
import torchvision
import torchvision.transforms as T
import random
import numpy as np
from scipy.ndimage.filters import gaussian_filter1d
import matplotlib.pyplot as plt
from sducs2019.image_utils import SQUEEZENET_MEAN, SQUEEZENET_STD
from PIL import Image
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
Helper Functions
Our pretrained model was trained on images that had been preprocessed by subtracting the per-color mean and dividing by the per-color standard deviation. We define a few helper functions for performing and undoing this preprocessing. You don’t need to do anything in this cell.
我们的预训练模型被训练在经过预处理的图像上,通过减去每种颜色的平均值并除以每种颜色的标准 差。我们定义了几个helper函数来执行和撤消这个预处理。
-
预处理函数
def preprocess(img, size=224): transform = T.Compose([ T.Resize(size), T.ToTensor(), T.Normalize(mean=SQUEEZENET_MEAN.tolist(), std=SQUEEZENET_STD.tolist()), T.Lambda(lambda x: x[None]), ]) return transform(img)
-
处理函数
def deprocess(img, should_rescale=True): transform = T.Compose([ T.Lambda(lambda x: x[0]), T.Normalize(mean=[0, 0, 0], std=(1.0 / SQUEEZENET_STD).tolist()), T.Normalize(mean=(-SQUEEZENET_MEAN).tolist(), std=[1, 1, 1]), T.Lambda(rescale) if should_rescale else T.Lambda(lambda x: x), T.ToPILImage(), ]) return transform(img)
-
计算缩放尺寸函数
def rescale(x): low, high = x.min(), x.max() x_rescaled = (x - low) / (high - low) return x_rescaled
-
对图像进行高斯模糊的函数
def blur_image(X, sigma=1): X_np = X.cpu().clone().numpy() X_np = gaussian_filter1d(X_np, sigma, axis=2) X_np = gaussian_filter1d(X_np, sigma, axis=3) X.copy_(torch.Tensor(X_np).type_as(X)) return X
Pretrained Model
For all of our image generation experiments, we will start with a convolutional neural network which was pretrained to perform image classification on ImageNet. We can use any model here, but for the purposes of this assignment we will use SqueezeNet [1], which achieves accuracies comparable to AlexNet but with a significantly reduced parameter count and computational complexity.
Using SqueezeNet rather than AlexNet or VGG or ResNet means that we can easily perform all image generation experiments on CPU.
为了降低计算的复杂度,本次实验使用了SqueezeNet作为神经网络结构。
# Download and load the pretrained SqueezeNet model.
model = torchvision.models.squeezenet1_1(pretrained=True)
# We don't want to train the model, so tell PyTorch not to compute gradients
# with respect to model parameters.
for param in model.parameters():
param.requires_grad = False
# you may see warning regarding initialization deprecated, that's fine, please continue to next steps
Load some ImageNet images
We have provided a few example images from the validation set of the ImageNet ILSVRC 2012 Classification dataset.
Since they come from the validation set, our pretrained model did not see these images during training.
Run the following cell to visualize some of these images, along with their ground-truth labels.
from sducs2019.data_utils import load_imagenet_val
X, y, class_names = load_imagenet_val(num=5)
plt.figure(figsize=(12, 6))
for i in range(5):
plt.subplot(1, 5, i + 1)
plt.imshow(X[i])
plt.title(class_names[y[i]])
plt.axis('off')
plt.gcf().tight_layout()
Saliency Maps
Using this pretrained model, we will compute class saliency maps as described in Section 3.1 of [2].
A saliency map tells us the degree to which each pixel in the image affects the classification score for that image. To compute it, we compute the gradient of the unnormalized score corresponding to the correct class (which is a scalar) with respect to the pixels of the image. If the image has shape (3, H, W)
then this gradient will also have shape (3, H, W)
; for each pixel in the image, this gradient tells us the amount by which the classification score will change if the pixel changes by a small amount. To compute the saliency map, we take the absolute value of this gradient, then take the maximum value over the 3 input channels; the final saliency map thus has shape (H, W)
and all entries are nonnegative.
显著性图告诉我们图像中每个像素对该图像的分类分数的影响程度。为了计算它,我们计算对应于正 确类别(标量)的非标准化分数相对于图像像素的梯度。如果图像具有形状 (3,H,W) ,则该梯度也 将具有形状 (3,H,W) ;对于图像中的每个像素,该梯度告诉我们,如果像素变化很小,分类分数的 变化量。为了计算 Saliency Maps ,我们取梯度的绝对值,然后取3个输入通道上的最大值;因此,最 终的显著性图具有形状 (H,W) ,并且所有条目都是非负的。
Hint: PyTorch gather
method
Recall in Assignment 1 you needed to select one element from each row of a matrix; if s
is an numpy array of shape (N, C)
and y
is a numpy array of shape (N,
) containing integers 0 <= y[i] < C
, then s[np.arange(N), y]
is a numpy array of shape (N,)
which selects one element from each element in s
using the indices in y
.
In PyTorch you can perform the same operation using the gather()
method. If s
is a PyTorch Tensor of shape (N, C)
and y
is a PyTorch Tensor of shape (N,)
containing longs in the range 0 <= y[i] < C
, then
s.gather(1, y.view(-1, 1)).squeeze()
will be a PyTorch Tensor of shape (N,)
containing one entry from each row of s
, selected according to the indices in y
.
run the following cell to see an example. You can also read the documentation for the gather method and the squeeze method.
就是在指定的轴上,根据index指定的下标,选择元素重组成一个新的tensor,最后输出的 out与index的size是一样的。注意index的类型必须是LongTensor类型的。
-
测试gather函数
# Example of using gather to select one entry from each row in PyTorch def gather_example(): N, C = 4, 5 s = torch.randn(N, C) y = torch.LongTensor([1, 2, 1, 3]) print(s) print(y) print(s.gather(1, y.view(-1, 1)).squeeze()) gather_example()
-
根据输入,标签计算正确类别对输入的梯度
def compute_saliency_maps(X, y, model): """ Compute a class saliency map using the model for images X and labels y. Input: - X: Input images; Tensor of shape (N, 3, H, W) - y: Labels for X; LongTensor of shape (N,) - model: A pretrained CNN that will be used to compute the saliency map. Returns: - saliency: A Tensor of shape (N, H, W) giving the saliency maps for the input images. """ # Make sure the model is in "test" mode model.eval() # Make input tensor require gradient X.requires_grad_() saliency = None ############################################################################## # TODO: Implement this function. Perform a forward and backward pass through # # the model to compute the gradient of the correct class score with respect # # to each input image. You first want to compute the loss over the correct # # scores (we'll combine losses across a batch by summing), and then compute # # the gradients with a backward pass. # ############################################################################## # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)***** s = model(X) # torch.Size([5, 1000]),5表示一共有五个分类 # 选择正确的分类得分来进行反向传播,得到torch.Size([5]) correct_class_scores = s.gather(1, y.view(-1, 1)).squeeze() # 正确分类反向传播,求图像上每个点对该类别的梯度 correct_class_scores.backward(torch.ones_like(correct_class_scores)) # 最后我们先求梯度的绝对值,然后选出3个input通道的最大值 saliency, _ = torch.max(torch.abs(X.grad), dim=1) # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)***** ############################################################################## # END OF YOUR CODE ############################################################################## return saliency
-
关键代码讲解
-
首先得到模型,维度为[5,1000],5表示有五个不同的分类,因为是ImageNet,每一个最后都会出 来1000类
s = model(X) # torch.Size([5, 1000]),5表示一共有五个分类
-
选择正确的分类得分来进行反向传播,得到torch.Size([5])
# 选择正确的分类得分来进行反向传播,得到torch.Size([5]) correct_class_scores = s.gather(1, y.view(-1, 1)).squeeze()
-
正确分类反向传播,求图像上每个点对该类别的梯度 这里我们要给backward传一个参数,这里传递参数就是一个长度为5的向量,[1,1,1,1,1]。
# 正确分类反向传播,求图像上每个点对该类别的梯度 correct_class_scores.backward(torch.ones_like(correct_class_scores))
-
了解了一下backward函数的参数的意义: 因为经过了复杂的神经网络之后,out中每个数值都是由很多输入样本的属性(也就是输入数据)线性 或者非线性组合而成的,那么out中的每个数值和输入数据的每个数值都有关联,也就是说【out】中的 每个数都可以对【a】中每个数求导,那么我们backward()的参数[k1,k2,k3…kn]的含义就是:
-
也可以理解成每个out分量对an求导时的权重。 本次实验这里的n=5,所以显然我们要传一个[1,1,1,1,1]的参数 最后我们先求梯度的绝对值,然后选出3个input通道的最大值
saliency, _ = torch.max(torch.abs(X.grad), dim=1)
-
展示saliency_ma
def show_saliency_maps(X, y): # Convert X and y from numpy arrays to Torch Tensors X_tensor = torch.cat([preprocess(Image.fromarray(x)) for x in X], dim=0) y_tensor = torch.LongTensor(y) # Compute saliency maps for images in X saliency = compute_saliency_maps(X_tensor, y_tensor, model) # Convert the saliency map from Torch Tensor to numpy array and show images # and saliency maps together. saliency = saliency.numpy() N = X.shape[0] for i in range(N): plt.subplot(2, N, i + 1) plt.imshow(X[i]) plt.axis('off') plt.title(class_names[y[i]]) plt.subplot(2, N, N + i + 1) plt.imshow(saliency[i], cmap=plt.cm.hot) plt.axis('off') plt.gcf().set_size_inches(12, 5) plt.show() show_saliency_maps(X, y)
INLINE QUESTION
A friend of yours suggests that in order to find an image that maximizes the correct score, we can perform gradient ascent on the input image, but instead of the gradient we can actually use the saliency map in each step to update the image. Is this assertion true? Why or why not?
Your Answer:
不对,因为我们图像的大小是3HW的,包含了三个通道,所以我们更新的时候也要用相同大小size 的梯度去更新,但是显著图上计算时,在三个通道上取了最大值,减少了一个维度,算出来的是H*W大 小的,所以维度不匹配,没法更新原图像。
Fooling Images
We can also use image gradients to generate “fooling images” as discussed in [3]. Given an image and a target class, we can perform gradient ascent over the image to maximize the target class, stopping when the network classifies the image as the target class. Implement the following function to generate fooling images.
计算 fooling image 的方法和更新学习参数是一样的,即对目标类别的 score 作 backprop 得到梯 度 dx,根据梯度对图片作 gradient ascent,不断重复此过程,直到 model 被欺骗产生我们想要得到的 分类。
def make_fooling_image(X, target_y, model):
"""
Generate a fooling image that is close to X, but that the model classifies
as target_y.
Inputs:
- X: Input image; Tensor of shape (1, 3, 224, 224)
- target_y: An integer in the range [0, 1000)
- model: A pretrained CNN
Returns:
- X_fooling: An image that is close to X, but that is classifed as target_y
by the model.
"""
# Initialize our fooling image to the input image, and make it require gradient
X_fooling = X.clone()
X_fooling = X_fooling.requires_grad_()
learning_rate = 1
##############################################################################
# TODO: Generate a fooling image X_fooling that the model will classify as #
# the class target_y. You should perform gradient ascent on the score of the #
# target class, stopping when the model is fooled. #
# When computing an update step, first normalize the gradient: #
# dX = learning_rate * g / ||g||_2 #
# #
# You should write a training loop. #
# #
# HINT: For most examples, you should be able to generate a fooling image #
# in fewer than 100 iterations of gradient ascent. #
# You can print your progress over iterations to check your algorithm. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
for epoch in range(100):
scores = model(X_fooling) # torch.Size([1, 1000]) 注意不是torch.Size([1000])
# max返回最大值和它的索引
_,predictions = scores.max(1)
if predictions == target_y:
break
target_scores = scores[:, target_y]
#反向传播
target_scores.backward()
g = X_fooling.grad.data
#求得梯度
dX = learning_rate * g / g.norm()
#更新X_fooling
X_fooling.data += dX
X_fooling.grad.zero_()
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return X_fooling
-
Run the following cell to generate a fooling image. You should ideally see at first glance no major difference between the original and fooling images, and the network should now make an incorrect prediction on the fooling one. However you should see a bit of random noise if you look at the 10x magnified difference between the original and fooling images. Feel free to change the
idx
variable to explore other images.idx = 0 target_y = 6 X_tensor = torch.cat([preprocess(Image.fromarray(x)) for x in X], dim=0) X_fooling = make_fooling_image(X_tensor[idx:idx+1], target_y, model) scores = model(X_fooling) assert target_y == scores.data.max(1)[1][0].item(), 'The model is not fooled!'
-
After generating a fooling image, run the following cell to visualize the original image, the fooling image, as well as the difference between them.
X_fooling_np = deprocess(X_fooling.clone()) X_fooling_np = np.asarray(X_fooling_np).astype(np.uint8) plt.subplot(1, 4, 1) plt.imshow(X[idx]) plt.title(class_names[y[idx]]) plt.axis('off') plt.subplot(1, 4, 2) plt.imshow(X_fooling_np) plt.title(class_names[target_y]) plt.axis('off') plt.subplot(1, 4, 3) X_pre = preprocess(Image.fromarray(X[idx])) diff = np.asarray(deprocess(X_fooling - X_pre, should_rescale=False)) plt.imshow(diff) plt.title('Difference') plt.axis('off') plt.subplot(1, 4, 4) diff = np.asarray(deprocess(10 * (X_fooling - X_pre), should_rescale=False)) plt.imshow(diff) plt.title('Magnified difference (10x)') plt.axis('off') plt.gcf().set_size_inches(12, 5) plt.show()
-
可以看到,用梯度上升的方式改变输入图像可以比较容易欺骗过分类器。
Class visualization
By starting with a random noise image and performing gradient ascent on a target class, we can generate an image that the network will recognize as the target class. This idea was first presented in [2]; [3] extended this idea by suggesting several regularization techniques that can improve the quality of the generated image.
Concretely, let I I I be an image and let y y y be a target class. Let s y ( I ) s_y(I) sy(I) be the score that a convolutional network assigns to the image I I I for class y y y; note that these are raw unnormalized scores, not class probabilities. We wish to generate an image I ∗ I^* I∗ that achieves a high score for the class y y y by solving the problem
I ∗ = arg max I ( s y ( I ) − R ( I ) ) I^* = \arg\max_I (s_y(I) - R(I)) I∗=argImax(sy(I)−R(I))
where R R R is a (possibly implicit) regularizer (note the sign of R ( I ) R(I) R(I) in the argmax: we want to minimize this regularization term). We can solve this optimization problem using gradient ascent, computing gradients with respect to the generated image. We will use (explicit) L2 regularization of the form
R ( I ) = λ ∥ I ∥ 2 2 R(I) = \lambda \|I\|_2^2 R(I)=λ∥I∥22
and implicit regularization as suggested by [3] by periodically blurring the generated image. We can solve this problem using gradient ascent on the generated image.
In the cell below, complete the implementation of the create_class_visualization
function.
从一张随机的噪声图生成一张可以被神经网络判断为特定类别的图像。比上两次更加原始,直接从随 机噪声合成一张图片,再加上一些正则化手段,可以使生成的图片更加平滑(包括显性的L2正则化和隐 形的jitter/blur) 利用梯度上升优化函数 其中Sy是y类别的得分score,R(I)是正则化项,本次试验中将使用L2正则化项
-
随机左右上下抖动
def jitter(X, ox, oy): """ Helper function to randomly jitter an image. Inputs - X: PyTorch Tensor of shape (N, C, H, W) - ox, oy: Integers giving number of pixels to jitter along W and H axes Returns: A new PyTorch Tensor of shape (N, C, H, W) """ if ox != 0: left = X[:, :, :, :-ox] right = X[:, :, :, -ox:] X = torch.cat([right, left], dim=3) if oy != 0: top = X[:, :, :-oy] bottom = X[:, :, -oy:] X = torch.cat([bottom, top], dim=2) return X
-
利用梯度上升求类图像
def create_class_visualization(target_y, model, dtype, **kwargs): """ Generate an image to maximize the score of target_y under a pretrained model. Inputs: - target_y: Integer in the range [0, 1000) giving the index of the class - model: A pretrained CNN that will be used to generate the image - dtype: Torch datatype to use for computations Keyword arguments: - l2_reg: Strength of L2 regularization on the image - learning_rate: How big of a step to take - num_iterations: How many iterations to use - blur_every: How often to blur the image as an implicit regularizer - max_jitter: How much to gjitter the image as an implicit regularizer - show_every: How often to show the intermediate result """ model.type(dtype) l2_reg = kwargs.pop('l2_reg', 1e-3) learning_rate = kwargs.pop('learning_rate', 25) num_iterations = kwargs.pop('num_iterations', 100) blur_every = kwargs.pop('blur_every', 10) max_jitter = kwargs.pop('max_jitter', 16) show_every = kwargs.pop('show_every', 25) # Randomly initialize the image as a PyTorch Tensor, and make it requires gradient. img = torch.randn(1, 3, 224, 224).mul_(1.0).type(dtype).requires_grad_() for t in range(num_iterations): # Randomly jitter the image a bit; this gives slightly nicer results ox, oy = random.randint(0, max_jitter), random.randint(0, max_jitter) img.data.copy_(jitter(img.data, ox, oy)) ######################################################################## # TODO: Use the model to compute the gradient of the score for the # # class target_y with respect to the pixels of the image, and make a # # gradient step on the image using the learning rate. Don't forget the # # L2 regularization term! # # Be very careful about the signs of elements in your code. # ######################################################################## # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)***** scores = model(img) # 需要的类别分数 (N,), 这里是 (1,) score = scores[0][target_y] loss = score - l2_reg * torch.sum(img * img) loss.backward() img.data += learning_rate * img.grad.data / img.grad.data.norm() img.grad.data.zero_() # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)***** ######################################################################## # END OF YOUR CODE # ######################################################################## # Undo the random jitter img.data.copy_(jitter(img.data, -ox, -oy)) # As regularizer, clamp and periodically blur the image for c in range(3): lo = float(-SQUEEZENET_MEAN[c] / SQUEEZENET_STD[c]) hi = float((1.0 - SQUEEZENET_MEAN[c]) / SQUEEZENET_STD[c]) img.data[:, c].clamp_(min=lo, max=hi) if t % blur_every == 0: blur_image(img.data, sigma=0.5) # Periodically show the image if t == 0 or (t + 1) % show_every == 0 or t == num_iterations - 1: plt.imshow(deprocess(img.data.clone().cpu())) class_name = class_names[target_y] plt.title('%s\nIteration %d / %d' % (class_name, t + 1, num_iterations)) plt.gcf().set_size_inches(4, 4) plt.axis('off') plt.show() return deprocess(img.data.cpu())
-
计算出target_y的类别分数score
score = scores[0][target_y]
-
加入L2正则化项,并进行反向传播计算
loss = score - l2_reg * torch.sum(img * img) loss.backward()
-
利用梯度上升更新img
img.data += learning_rate * img.grad.data / img.grad.data.norm() img.grad.data.zero_()
-
Once you have completed the implementation in the cell above, run the following cell to generate an image of a Tarantula:
dtype = torch.FloatTensor # dtype = torch.cuda.FloatTensor # Uncomment this to use GPU model.type(dtype) target_y = 76 # Tarantula # target_y = 78 # Tick # target_y = 187 # Yorkshire Terrier # target_y = 683 # Oboe # target_y = 366 # Gorilla # target_y = 604 # Hourglass out = create_class_visualization(target_y, model, dtype)
-
Try out your class visualization on other classes! You should also feel free to play with various hyperparameters to try and improve the quality of the generated image, but this is not required.
# target_y = 78 # Tick # target_y = 187 # Yorkshire Terrier # target_y = 683 # Oboe # target_y = 366 # Gorilla # target_y = 604 # Hourglass target_y = np.random.randint(1000) print(class_names[target_y]) X = create_class_visualization(target_y, model, dtype)
附录
- 参考链接CS231n-assignment3-Network_Visualization(Pytorch)
- myfriend 张*浩