【经典论文阅读系列】ImageNet Classification with Deep Convolutional Neural Networks

最新推荐文章于 2024-08-08 21:44:51 发布

mjx792

最新推荐文章于 2024-08-08 21:44:51 发布

阅读量116

点赞数 1

分类专栏：经典论文阅读系列计算机视觉文章标签：论文阅读

本文链接：https://blog.csdn.net/mjx792/article/details/133799823

版权

经典论文阅读系列同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

计算机视觉

1 篇文章 0 订阅

订阅专栏

title: 【Classical Paper】ImageNet Classification with Deep Convolutional Neural Networks
date: 2023-10-12 11:29:47
tags: [Classical Paper, Classification]
katex: true

文章目录

- title: 【Classical Paper】ImageNet Classification with Deep Convolutional Neural Networks date: 2023-10-12 11:29:47 tags: [Classical Paper, Classification] katex: true
Tittle: ImageNet Classification with Deep Convolutional Neural Networks

Tittle: ImageNet Classification with Deep Convolutional Neural Networks

论文：pdf
pytorch 使用Alexnet测试例子
 pytorch Alexnet网络结构代码
 深入理解AlexNet网络
 简易版博客
 全文翻译
 李沐B站视频讲解

Dataset

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)
- 1.2 million training images
- 50,000 validation images
- 150,000 testing images
- Variable resolution images
Pre-process: down-sample to 256 $\times$ 256
python 测试
- For the rectangular image, rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image
⭐️ Augmentation
- Train Data:
  
  We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches
  
  随机裁剪 $224\times224$ 大小的patch，水平翻转
  为什么扩大2048倍呢？根据李沐视频和个人理解，裁剪224，H和W方向上各有32（理论上是33个，256-224+1）个裁剪起点，也就是 $32 \times 32=1024$ 个裁剪法，再加上水平翻转的数据， $1024 \times 2=2048$
- Test Data: 10-crop, 详见补充知识
  
  At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.
  在测试时，网络通过提取五个 224 × 224 补丁（四个角补丁和中心补丁）及其水平反射（因此总共十个补丁）来进行预测，并对网络的 softmax 层做出的预测进行平均在十个补丁上。
- Train Data: 对RGB 像素值集执行 PCA 😕
  
  对于每个训练图像，我们添加多个找到的主成分，其大小与相应的特征值乘以从均值为零、标准差为 0.1 的高斯分布中得出的随机变量成正比。因此，对于每个 RGB 图像像素 $I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^\mathrm{T}$ ，我们添加以下数量：
  $[P_1,P_2,P_3][\alpha_1\lambda_1,\alpha_2\lambda_2,\alpha_2\lambda_2]^\mathrm{T}$
  其中 $P_i$ 和 $\lambda_i$ 分别是RGB像素值的3×3协方差矩阵的第i个特征向量和特征值， $\alpha_i$ 是上述随机变量。

Architecture

AlexNet

AlexNet Architechture

eight learned layers: five convolutional and three fully-connected

The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.
最后一个全连接层的输出被馈送到 1000 路 softmax，生成 1000 个类标签的分布。

Why ReLU

The standard way to model a neuron’s output f as a function of its input x is with f (x) = tanh(x) or f (x) = (1 + e−x)−1. In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f (x) = max(0, x).

就梯度下降的训练时间而言，这些饱和非线性比非饱和非线性 f (x) = max(0, x) 慢得多。

Parallelization

The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU

我们采用的并行化方案本质上是将一半的内核（或神经元）放在每个 GPU 上

the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU.

GPU 仅在某些层中进行通信。这意味着，例如，第 3 层的内核从第 2 层中的所有内核映射获取输入。但是，第 4 层中的内核仅从驻留在同一 GPU 上的第 3 层中的那些内核映射获取输入。

😕 Local Response Normalization(局部响应归一化)

$b_{x,y}^i=\frac{a_{x,y}^i}{(k+\alpha \sum\limits_{j=max(0,i-n/2)}^{min(N-1,i+n/2)}(a_{x,y}^i)^2)^\beta}$

详解局部相应归一化

在神经网络中，我们用激活函数将神经元的输出做一个非线性映射，但是tanh和sigmoid这些传统的激活函数的值域都是有范围的，但是ReLU激活函数得到的值域没有一个区间，所以要对ReLU得到的结果进行归一化。也就是Local Response Normalization。

在这里插入图片描述

我们看上图，每一个矩形表示的一个卷积核生成的feature map。所有的pixel已经经过了ReLU激活函数，现在我们都要对具体的pixel进行局部的归一化。假设绿色箭头指向的是第i个kernel对应的map，其余的四个蓝色箭头是它周围的邻居kernel层对应的map，假设矩形中间的绿色的pixel的位置为(x, y)，那么我需要提取出来进行局部归一化的数据就是周围邻居kernel对应的map的(x, y)位置的pixel的值。也就是上面式子中的 $a^j_{(x, y)}$ 。然后把这些邻居pixel的值平方再加和。乘以一个系数 $\alpha$ 再加上一个常数 $k$ ，然后 $\beta$ 次幂，就是分母，分子就是第i个kernel对应的map的(x, y)位置的pixel值。这样理解之后我感觉就不是那么复杂了。
关键是参数 $\alpha$ , $\beta$ , $k$ 如何确定，论文中说在验证集中确定，最终确定的结果为：
$k = 2$ , $n = 5$ , $\alpha=10^{-4}$ , $\beta = 0.75$

Overlapping Pooling 重叠池化
- 间隔 $s = 2$ , 窗口大小 $z = 3$
- 相比于 $s = 2, z = 2$ , 将top-1和top-5错误率分别降低0.4%和0.3%

Reducing Overfitting 防止过拟合

Data Augmentation 见 Dataset
Dropout

This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
这项技术减少了神经元复杂的共同适应，因为神经元不能依赖于特定其他神经元的存在。因此，它被迫学习更强大的特征，这些特征与其他神经元的许多不同的随机子集结合使用是有用的。

在测试时，我们使用所有神经元，但将它们的输出乘以 0.5，这是采用指数多丢失网络产生的预测分布的几何平均值的合理近似值。

训练过程

优化器： SGD, 0.9, 0.0005(weight decay，这个数值很重要)
epoch: 90
训练时间：five to six days on two NVIDIA GTX 580 3GB GPUs.
batch size：128
We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01
我们从标准差为 0.01 的零均值高斯分布初始化每层的权重
We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1
我们用常数 1 初始化第二、第四和第五卷积层以及全连接隐藏层中的神经元偏差。
这种初始化通过为 ReLU 提供正输入来加速早期阶段的学习。
We initialized the neuron biases in the remaining layers with the constant 0
我们用常量 0 初始化剩余层中的神经元偏差
learning rate was initialized at 0.01
学习率初始化为0.01
The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate.
我们遵循的启发式方法是，当验证错误率不再随当前学习率提高时，将学习率除以 10。

讨论

表 1 总结了我们在 ILSVRC-2010 上的结果。我们的网络实现了 37.5% 和 17.0%5 的 top-1 和 top-5 测试集错误率。

交替使用验证错误率和测试错误率，因为根据我们的经验，它们的差异不会超过 0.1%（参见表 2）。

本文描述的 CNN 实现了 18.2% 的 top-5 错误率。
对 5 个相似 CNN 的预测进行平均，错误率为 16.4%。
训练一个 CNN，在最后一个池化层上增加一个额外的第六个卷积层，对整个 ImageNet Fall 2011 版本（15M 图像，22K 类别）进行分类，然后在 ILSVRC-2012 上对其进行“微调”，得到的错误率为 16.6 %。
将在整个 2011 年秋季版本中预训练的两个 CNN 与上述五个 CNN 的预测进行平均，得出的错误率为 15.3%。
第二佳参赛作品的错误率为 26.2%，其方法是对根据不同类型的密集采样特征计算出的 FV 训练的多个分类器的预测进行平均

图 3 显示了网络的两个数据连接层学习的卷积核。该网络已经学习了各种频率和方向选择性内核，以及各种彩色斑点。GPU 1 上的内核很大程度上与颜色无关，而 GPU 2 上的内核主要与颜色相关。

图 4 左图，计算 8 个测试图像的前 5 个预测来定性评估网络。即使是偏离中心的物体，例如左上角的螨虫，也可以被网络识别。大多数前 5 名的标签看起来都是合理的。例如，只有其他类型的猫才被认为是豹子的合理标签。
图 4 右图，如果两个图像在最后 4096 维隐藏层产生具有较小欧几里德分离的特征激活向量，则神经网络的更高层认为它们是相似的。图 4 右图显示了测试集中的 5 张图像和训练集中的 6 张图像，根据此测量，它们与每张图像最相似。在像素级别，检索到的训练图像在 L2 中通常与第一列中的查询图像不接近。例如，狗和大象以各种姿势出现。

补充知识

Multi-crop

😆 主要在测试过程中使用multi-crop

1-crop，10-crop

1-crop和10-crop顾名思义就是进行1次和10次裁剪。举个例子输入图像是256×256的，网络训练所需图像是224×224的。

1-corp是从256×256图像中间位置裁一个224×224的图像进行训练，而10-corp是先从中间裁一个224×224的图像，然后从图像左上角开始，横着数224个像素，竖着数224个像素开始裁剪，同样的方法在右上，左下，右下各裁剪一次。就得到了5张224*224的图像，镜像以后再做一遍，总共就有10张图片了。

k-crop 是数据增强的一种手段，10-crop一般情况是在数据集较少的情况下做的，10-crop是针对整个数据集，而数据集又会按一定比例分为训练，验证，测试集。计算量大的问题普遍存在，所以说深度学习的问题也是算力的问题。

测试图像做10-crop

有一种叫作10-crop的技术（crop理解为裁剪的意思），基本意思是，假设取中心区域，裁剪图片后，通过分类器去运行，然后取左上角区域，运行分类器，右上角用绿色表示，左下方用黄色表示，右下方用橙色表示，分别通过分类器来运行，然后对镜像图像做同样的事情。即取中心的crop，然后取四个角落的crop，通过分类器来运行这十张图片，最后对结果进行平均。对于multi-crop，它不会占用太多的内存，但它仍然会让你的运行时间变慢。

总结：multicrop的使用

# @file name  : test.py
# @brief      : multiCrops的使用
# @author     : liupc
# @date       : 2021/7/17
 
import torch
from PIL import Image
import torchvision.models as models
import torchvision.transforms as transforms
 
norm_mean = [0.485, 0.456, 0.406]  # RGB通道上的平均值。为什么都是0-1之间呢？因为像素值都除了255了。
norm_std = [0.229, 0.224, 0.225]  # RGB通道上的标准差。
normalizes = transforms.Normalize(norm_mean, norm_std)
inference_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.TenCrop(224, vertical_flip=False),
    transforms.Lambda(lambda crops: torch.stack([normalizes(transforms.ToTensor()(crop)) for crop in crops])),
])
 
 
path_img = "./data/images/tiger cat.jpg"
img_rgb = Image.open(path_img).convert('RGB')
img_tensor = inference_transform(img_rgb)       #得到的结果是一个crops,n,h,w的向量
print(img_tensor.size())                        #[10, 3, 224, 224]
 
img_tensor.unsqueeze_(0)                        #增加batchsize信息。模拟经过dataloader后的一批数据。
print(img_tensor.size())                        #[1,10,3,224,224]
 
model = models.alexnet(pretrained = False)
model.eval()
 
b, ncrops, c, h, w = img_tensor.size()
outputs = model(img_tensor.view(-1,c,h,w))     #输入到model中的数据必须是四维向量的形式：BCHW。所以把b和ncrops合并，当做一个batch的数据
print(outputs.size())                          #[10, 1000]。经过model得到的结果是一个B*1000的向量。
outputs = outputs.view(b, ncrops, -1).mean(1)  #再拆开成b*ncrops个结果，并且在ncrops维度上取平均。得到b个预测结果。

深度学习训练中为什么要将图片随机剪裁

关于single crop/multiple crops

代码资料

实现pytorch实现AlexNet（有训练过程）

pytorch 使用Alexnet测试例子

import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'alexnet', pretrained=True)
model.eval()

# Download an example image from the pytorch website
import urllib
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)

# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
input_image = Image.open(filename)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)
# Tensor of shape 1000, with confidence scores over ImageNet's 1000 classes
print(output[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
probabilities = torch.nn.functional.softmax(output[0], dim=0)
print(probabilities)

# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
input_image = Image.open(filename)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)
# Tensor of shape 1000, with confidence scores over ImageNet's 1000 classes
print(output[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
probabilities = torch.nn.functional.softmax(output[0], dim=0)
print(probabilities)

pytorch Alexnet网络结构代码

class AlexNet(nn.Module):
    def __init__(self, num_classes: int = 1000, dropout: float = 0.5) -> None:
        super().__init__()
        _log_api_usage_once(self)
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(p=dropout),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=dropout),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

mjx792

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【经典论文阅读系列】ImageNet Classification with Deep Convolutional Neural Networks

假设绿色箭头指向的是第i个kernel对应的map，其余的四个蓝色箭头是它周围的邻居kernel层对应的map，假设矩形中间的绿色的pixel的位置为(x, y)，那么我需要提取出来进行局部归一化的数据就是周围邻居kernel对应的map的(x, y)位置的pixel的值。在神经网络中，我们用激活函数将神经元的输出做一个非线性映射，但是tanh和sigmoid这些传统的激活函数的值域都是有范围的，但是ReLU激活函数得到的值域没有一个区间，所以要对ReLU得到的结果进行归一化。为什么扩大2048倍呢？
复制链接

扫一扫