论文精读-李沐:撑起计算机视觉半边天的ResNet【论文精读】_哔哩哔哩_bilibili
动手学深度学习-李沐(pdf):zh-v2.d2l.ai/d2l-zh-pytorch.pdf
动手学深度学习-李沐(视频):29 残差网络 ResNet【动手学深度学习v2】_哔哩哔哩_bilibili
He, K. et al. (2016) 'Deep Residual Learning for Image Recognition', 2016 IEEE Conference on Computer Vision and Pattern Recognition, doi: 10.1109/CVPR.2016.90
目录
1. ResNet
1.1. Json
(1)Json:JavaScript Object Notation(JavaScript对象表示法),是存储和交换文本信息的语法,是完全独立于任何程序语言的文本格式
(2)普通格式
"""例1"""
{
"key":"value", #一个唯一的键值和对象
"key2":"value2"
}
"""例2:这样写也是一样的"""
{"key3":"value3","key4":"value4"}
"""例3:后面数据类型实际上是随意的"""
{"key5":123.456}
"""例4:后再跟一个Json对象"""
{"key6":{"name":"Sherlily","gender":"female"}}
"""例5:数组"""
{"key7":[1,2,3,4],
"key8":["a","b","c"],
"key9":[{},{},{}]} #{}里面放的例4的类似对象
"""例6:弱类型/空类型"""
{"key":null}
普通取值的话就.就好了,例如key6.name,数组的话可以key7[0]取值。但是对于不同的编程语言可能提供不同的其他取值方法
1.2. re库
(1)re库为python正则表达式标准库
(2)函数详解和功能参见:python正则表达式标准库re库详解 - 知乎 (zhihu.com)
1.3. tqdm库
(1)tqdm是python中可以引入的动态进度条库
1.4. 整体实现步骤
(1)数据分类:
①将拿到的图片各自放在一个文件夹里
②划分训练集和验证集的比例,如设置split_rate = 0.2
③写循环函数把val集的图片先随机复制出来(因为val的会少一点所以放前面啦),然后剩下的就放在train里
(2)模型搭建(哎细节还是觉得好难...好难get好难梳理...人生好难)
①引入库函数
②构造普通块类、残差块类、ResNet类
(3)训练
①开训练即报错了OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized,和jupyter notebook里一样,需要在引用下方写入
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
②编写训练函数
1.2. ResNet理念
(1)做嵌套函数,让新的更复杂的模型总是包含旧的简单的模型,这样不会导致模型学偏(因为可能一个比较好的值是中等复杂的)
(2)残差块的使用,相当于每层新输出的f(x)还要和原先的输出x相加,来组成输出(其中x不一定是最初的,可能是前面层的输出)
⭐在动手学深度学习上,残差块为f(x)-x,最后加上x变成f(x),这个实际上不是定死的,即x的系数是可以根据实际情况变更的
(3)残差块使得很深的网络更加容易训练(一千层也可以)(看到说可以解决梯度消失的问题,暂时没有去详细看)
(4)ResNet用了152层,实际上是非常的深了,是VGG的八倍。但是复杂度很低
1.3. ResNet代码实现()
(1)ResNet本体
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
class Residual(nn.Module):
def __init__(self, input_channels, num_channels,
use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = nn.Conv2d(input_channels, num_channels,
kernel_size=3, padding=1, stride=strides)
self.conv2 = nn.Conv2d(num_channels, num_channels,
kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2d(input_channels, num_channels,
kernel_size=1, stride=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm2d(num_channels)
self.bn2 = nn.BatchNorm2d(num_channels)
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)
1.4. ResNet的弊端和局限
(1)resnet看起来很深但实际起作用的网络层数不是很深,大部分网络层都在防止模型退化,误差过大。而且残差不能完全解决梯度消失或者爆炸、网络退化的问题,只能是缓解(来源:ResNet-史上最详细解读 - 知乎 (zhihu.com))
(2)需要大量的计算资源来训练和推理,特别是在网络较深时
(3)在某些情况下,ResNet可能会过拟合,需要通过正则化等方法进行处理(2、3来源:RESNET的优缺点_haozhian的博客-CSDN博客)
2. ResNet论文原文学习
2.1. Abstract
(1)Present the difficulty of training deep neural networks and show their deep (152) but simple net
(2)Define the importance of depth and show their reward of ILSVRC & COCO
2.2. Introduction
(1)Introduce the CNN and features
(2)Usually, deep networks are stuck by vanishing/exploding gradients. Thus, SGD and back propagation are uesd for solving it.
(3)However, simply increase layers might touch the ceiling of increasing rate of accuracy and cause higher training error
(4)Hence, they put forward one programme which add a new identify layer
(5)⭐Propose a residual network and mention identity mapping (恒等映射是说每个元素与自身对映,感觉像是y=x上的散点集合都可?有点不太get得到它的意义)
(6)The identify "x" is a shortcut connection that skips one or more layers
(7)List the advantages of residual network
(8)They explore 100 even 1000 layers to prove the strong function of residual network
(9)Declare their victory of competition
2.3. Related Work
(1)Encoding residual vectors are effective than encoding original vectors
(2)Note hierarchical basis preconditioning can figure partial differential equations(不会这里的Multigrid method 和 hierarchical basis preconditioning)
(3)Shortcut connections are for addressing vanishing/exploding gradients
(4)Mention highway networks and gating functions(这是啥)
2.4. Deep Residual Learning
(1)Residual Learning
①我猜它的意思是一个原始函数是f(x)然后过完这层减去x,然后还有一种是原始函数是f(x)+x最后完了减去x?
②Residual network is able to approach identity mappings
③About zero mapping and identity mapping
(2)Identity Mapping by Shortcuts
①Introduce the function
②The shortcut connections are quite convenient
③⭐Dimension matching is neccessary and if not, using times x to match x with F
④Residual function is flexible
⑤什么上面的符号是关于全连接层的但是也适用于卷积层?
(3)Network Architectures
①Plain Network: "if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer". Moreover, Fliters and complexity of their model are lower than VGG
②Residual Network: 意思是如果x的维度少于f(x),x增补额外0项,或者用上面的W把x投影成正确的维度
2.5. Experiments
(1)ImageNet Classification
①Plain Networks: evaluate 18 and 34 layers networks then find error of 34 layers is more than 18 layers. However, it is not caused by vanishing gradients, it might caused by exponentially low convergence rates.
②Residual Networks: increasing in layers in ResNet will not increase the error rate. In addition, ResNet provide much faster convergence at the early stage
③Identity vs. Projection Shortcuts: it seems projection is better than zero-filling
④Deeper Bottleneck Architectures: for f(x), they use 3 layers stack(这三层是1×1, 3×3和1×1卷积,其中1×1层负责减少然后增加(恢复)维度,使3×3层成为输入/输出维度较小的瓶颈)
⑤50-layer ResNet
⑥101-layer and 152-layer ResNet
⑦Comparisons with State-of-the-art Methods: 152 layers model is the best one
(2)CIFAR-10 and Analysis
①They use simple model in large dataset
②The overall average of color valuesare subtracted from all pixel grids
③They use weight decay of 0.0001 and momentum of 0.9 with no dropout. Besides, starting with 0.1 learning rate, devide it by 10 at 32k and 48k iterations
④Notes the deeper the network is, the more eorror rate it has
⑤However, they optimize network model and decrease error rate
⑦They found that learning rate of 0.1 is silightly large, so they turn it to 0.01 until the training error rate is under 80%. Then go back to 0.1 learning rate
⑧Analysis of Layer Responses: Response of ResNet usually lower than plain?
⑨Exploring Over 1000 layers: Deep layers are worse than shallow. Stronger regularization may improve results
(3)Object Detection on PASCAL and MS COCO
①Good generalization performance represented in this task
②How they won the prize