计算机视觉基本任务综述
写在前面:由于本人在本科阶段没有接触深度学习,仅了解部分传统图像处理方法,刚开始学习计算机视觉相关知识,因此想写点东西对所学做一些总结,如有问题,也请各位多多指教。本文出于方便整理的缘故,部分参考copy网络相关博客,如有侵权,请联系我删除。
计算机视觉基本任务共四大类:分类、目标检测、语义分割、实例分割
a)分类(Classification)
图像分类要求给定一个图片输出图片里含有哪些分类,例如在图1(a)中检测出图中有瓶子、杯子以及立方体。
即将要介绍到的分类网络(ILSVRC历年冠亚军):LeNet、AlexNet(2012冠军)、VGG(2014亚军)、GoogLeNet(2014冠军)、ResNet(2015冠军)、DenseNet
1.LeNet-5:卷积神经网络的祖师爷LeCun在1998年提出,用于解决手写数字识别的视觉任务。自那时起,CNN的最基本的架构就定下来了:卷积层、池化层、全连接层。conv1 (6) -> pool1 -> conv2 (16) -> pool2 -> fc3 (120) -> fc4 (84) -> fc5 (10) -> softmax 网络名称中有5表示它有5层conv/fc层。
创新点:定义了CNN的基本组件,是CNN的鼻祖。
LeNet torch 实现,应用于cifar-10
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.conv1 = nn.Conv2d(3, 6, kernel_size=5)
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
self.fc1 = nn.Linear(16*5*5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">max_pool2d</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">conv2</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">max_pool2d</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc2</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span></code></pre></div><p data-pid="LWXkiMVx">2.AlexNet:在ILSVRC 2012夺得冠军,并领先第二名10.9个百分点。网络结构:conv1 (96) -> pool1 -> conv2 (256) -> pool2 -> conv3 (384) -> conv4 (384) -> conv5 (256) -> pool5 -> fc6 (4096) -> fc7 (4096) -> fc8 (1000) -> softmax</p><figure data-size="normal"><noscript><img src="https://pic1.zhimg.com/v2-9d6827da91e857836f6ffdef99cb9a54_b.jpg" data-caption="" data-size="normal" data-rawwidth="1074" data-rawheight="517" class="origin_image zh-lightbox-thumb" width="1074" data-original="https://pic1.zhimg.com/v2-9d6827da91e857836f6ffdef99cb9a54_r.jpg"/></noscript><img src="https://pic1.zhimg.com/80/v2-9d6827da91e857836f6ffdef99cb9a54_720w.jpg" data-caption="" data-size="normal" data-rawwidth="1074" data-rawheight="517" class="origin_image zh-lightbox-thumb lazy" width="1074" data-original="https://pic1.zhimg.com/v2-9d6827da91e857836f6ffdef99cb9a54_r.jpg" data-actualsrc="https://pic1.zhimg.com/v2-9d6827da91e857836f6ffdef99cb9a54_b.jpg" data-lazy-status="ok"></figure><p data-pid="ErXlk4Nw">创新点:(1)<b>更深</b>的网络。(2)使用了<b>ReLU</b>激活函数,使之有更好的梯度特性、训练更快。(3)大量使用<b>数据增广</b>技术,防止过拟合。(4)让人们意识到利用<b>GPU加速</b>训练。(5)使用了<b>随机失活(dropout)</b>,该方法通过让全连接层的神经元(该模型在前两个全连接层引入Dropout)以一定的概率失去活性(比如0.5)失活的神经元不再参与前向和反向传播,相当于约有一半的神经元不再起作用。在测试的时候,让所有神经元的输出乘0.5。Dropout的引用,有效缓解了模型的过拟合。</p><p data-pid="bTPegVpk">AlexNet torch 实现,应用于cifar-10</p><div class="highlight"><pre><code class="language-python"><span class="n">NUM_CLASSES</span> <span class="o">=</span> <span class="mi">10</span>
class AlexNet(nn.Module):
def init(self, num_classes=NUM_CLASSES):
super(AlexNet, self).init()
self.features = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2),
nn.Conv2d(64, 192, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2),
nn.Conv2d(192, 384, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(384, 256, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2),
)
self.classifier = nn.Sequential(
nn.Dropout(),
nn.Linear(256 2 2, 4096),
nn.ReLU(inplace=True),
nn.Dropout(),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, num_classes),
)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">features</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="mi">256</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">classifier</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span></code></pre></div><p data-pid="lZJur77-">3.VGGNet:ILSVRC 2014的分类亚军定位冠军网络。VGG可以看成是加深版本的AlexNet. 都是conv layer + FC layer。VGG-16的基本架构为:conv1^2 (64) -> pool1 -> conv2^2 (128) -> pool2 -> conv3^3 (256) -> pool3 -> conv4^3 (512) -> pool4 -> conv5^3 (512) -> pool5 -> fc6 (4096) -> fc7 (4096) -> fc8 (1000) -> softmax。 ^3代表重复3次。</p><figure data-size="normal"><noscript><img src="https://pic2.zhimg.com/v2-4d50e6d65421049e8268bc91ca855b19_b.jpg" data-caption="" data-size="normal" data-rawwidth="577" data-rawheight="577" class="origin_image zh-lightbox-thumb" width="577" data-original="https://pic2.zhimg.com/v2-4d50e6d65421049e8268bc91ca855b19_r.jpg"/></noscript><img src="https://pic2.zhimg.com/80/v2-4d50e6d65421049e8268bc91ca855b19_720w.jpg" data-caption="" data-size="normal" data-rawwidth="577" data-rawheight="577" class="origin_image zh-lightbox-thumb lazy" width="577" data-original="https://pic2.zhimg.com/v2-4d50e6d65421049e8268bc91ca855b19_r.jpg" data-actualsrc="https://pic2.zhimg.com/v2-4d50e6d65421049e8268bc91ca855b19_b.jpg" data-lazy-status="ok"></figure><p data-pid="nTW9p9C1">创新点:(1)<b>结构简单</b>,只有3×3卷积和2×2汇合两种配置,并且<b>重复堆叠</b>相同的模块组合。</p><p data-pid="1w24wz9u">3*3卷积核的优点:多个3×3的卷基层比一个大尺寸filter卷基层有更多的非线性,使得判决函数更加具有判决性。多个3×3的卷积层比一个大尺寸的filter有更少的参数。 1*1卷积核的优点:在不影响输入输出维数的情况下,对输入进行线性形变,然后通过Relu进行非线性处理,增加网络的非线性表达能力。</p><p data-pid="WOMV-GdB">(2). 合适的网络<b>初始化</b>和使用批量归一(batch normalization)层对训练深层网络很重要。在原论文中无法直接训练深层VGG网络,因此先训练浅层网络,并使用浅层网络对深层网络进行初始化。在BN出现之后,伴随其他技术,后续提出的深层网络可以直接得以训练。</p><p data-pid="QuzxCaKd">VGG16网络结构:</p><figure data-size="normal"><noscript><img src="https://pic1.zhimg.com/v2-9f21cdf20f8937fdb4e84fa91687f50c_b.jpg" data-caption="" data-size="normal" data-rawwidth="470" data-rawheight="276" class="origin_image zh-lightbox-thumb" width="470" data-original="https://pic1.zhimg.com/v2-9f21cdf20f8937fdb4e84fa91687f50c_r.jpg"/></noscript><img src="https://pic1.zhimg.com/80/v2-9f21cdf20f8937fdb4e84fa91687f50c_720w.jpg" data-caption="" data-size="normal" data-rawwidth="470" data-rawheight="276" class="origin_image zh-lightbox-thumb lazy" width="470" data-original="https://pic1.zhimg.com/v2-9f21cdf20f8937fdb4e84fa91687f50c_r.jpg" data-actualsrc="https://pic1.zhimg.com/v2-9f21cdf20f8937fdb4e84fa91687f50c_b.jpg" data-lazy-status="ok"></figure><p data-pid="_yTfHN5v">VGG torch 实现,应用于cifar-10</p><div class="highlight"><pre><code class="language-python"><span class="n">cfg</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'VGG11'</span><span class="p">:</span> <span class="p">[</span><span class="mi">64</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">],</span>
<span class="s1">'VGG13'</span><span class="p">:</span> <span class="p">[</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">],</span>
<span class="s1">'VGG16'</span><span class="p">:</span> <span class="p">[</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">],</span>
<span class="s1">'VGG19'</span><span class="p">:</span> <span class="p">[</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="s1">'M'</span><span class="p">],</span>
}
class VGG(nn.Module):
def init(self, vgg_name):
super(VGG, self).init()
self.features = self._make_layers(cfg[vgg_name])
self.classifier = nn.Linear(512, 10)
def forward(self, x):
out = self.features(x)
out = out.view(out.size(0), -1)
out = self.classifier(out)
return out
def _make_layers(self, cfg):
layers = []
in_channels = 3
for x in cfg:
if x == ‘M’:
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
nn.BatchNorm2d(x),
nn.ReLU(inplace=True)]
in_channels = x
layers += [nn.AvgPool2d(kernel_size=1, stride=1)]
return nn.Sequential(*layers)
def VGG11():
return VGG(‘VGG11’)
def VGG13():
return VGG(‘VGG13’)
def VGG16():
return VGG(‘VGG16’)
def VGG19():
return VGG(‘VGG19’)
4.GoogLeNet:ILSVRC 2014的冠军网络。GoogLeNet试图回答在设计网络时究竟应该选多大尺寸的卷积、或者应该选汇合层。其提出了Inception模块,同时用1×1、3×3、5×5卷积和3×3汇合,并保留所有结果。网络基本架构为:conv1 (64) -> pool1 -> conv2^2 (64, 192) -> pool2 -> inc3 (256, 480) -> pool3 -> inc4^5 (512, 512, 512, 528, 832) -> pool4 -> inc5^2 (832, 1024) -> pool5 -> fc (1000)。动机:在增加网络深度和宽度的同时减少参数,将全连接变成稀疏连接,有没有一种方法既能保持网络结构的稀疏性,又能用密集矩阵的高计算性能?
创新点:(1)引入Inception结构,多分支分别处理,并级联结果;(2)为了降低计算量,用了1×1卷积降维。GoogLeNet使用了简单的全局平均pooling替代全连接层(之前的网络参数全集中在这里),使网络参数大幅减少。
Inception V2 V3 V4还未研究,这里不做阐述。
GoogLeNet torch 实现,应用于cifar-10
class Inception(nn.Module):
def init(self, in_planes, kernel_1_x, kernel_3_in, kernel_3_x, kernel_5_in, kernel_5_x, pool_planes):
super(Inception, self).init()
# 1x1 conv branch
self.b1 = nn.Sequential(
nn.Conv2d(in_planes, kernel_1_x, kernel_size=1),
nn.BatchNorm2d(kernel_1_x),
nn.ReLU(True),
)
# 1x1 conv -> 3x3 conv branch
self.b2 = nn.Sequential(
nn.Conv2d(in_planes, kernel_3_in, kernel_size=1),
nn.BatchNorm2d(kernel_3_in),
nn.ReLU(True),
nn.Conv2d(kernel_3_in, kernel_3_x, kernel_size=3, padding=1),
nn.BatchNorm2d(kernel_3_x),
nn.ReLU(True),
)
# 1x1 conv -> 5x5 conv branch
self.b3 = nn.Sequential(
nn.Conv2d(in_planes, kernel_5_in, kernel_size=1),
nn.BatchNorm2d(kernel_5_in),
nn.ReLU(True),
nn.Conv2d(kernel_5_in, kernel_5_x, kernel_size=3, padding=1),
nn.BatchNorm2d(kernel_5_x),
nn.ReLU(True),
nn.Conv2d(kernel_5_x, kernel_5_x, kernel_size=3, padding=1),
nn.BatchNorm2d(kernel_5_x),
nn.ReLU(True),
)
# 3x3 pool -> 1x1 conv branch
self.b4 = nn.Sequential(
nn.MaxPool2d(3, stride=1, padding=1),
nn.Conv2d(in_planes, pool_planes, kernel_size=1),
nn.BatchNorm2d(pool_planes),
nn.ReLU(True),
)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">y1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y3</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y4</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">([</span><span class="n">y1</span><span class="p">,</span><span class="n">y2</span><span class="p">,</span><span class="n">y3</span><span class="p">,</span><span class="n">y4</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
class GoogLeNet(nn.Module):
def init(self):
super(GoogLeNet, self).init()
self.pre_layers = nn.Sequential(
nn.Conv2d(3, 192, kernel_size=3, padding=1),
nn.BatchNorm2d(192),
nn.ReLU(True),
)
<span class="bp">self</span><span class="o">.</span><span class="n">a3</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">192</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">96</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b3</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">192</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">96</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">max_pool</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">MaxPool2d</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">a4</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">480</span><span class="p">,</span> <span class="mi">192</span><span class="p">,</span> <span class="mi">96</span><span class="p">,</span> <span class="mi">208</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">48</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b4</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">160</span><span class="p">,</span> <span class="mi">112</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">24</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">c4</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">24</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d4</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">112</span><span class="p">,</span> <span class="mi">144</span><span class="p">,</span> <span class="mi">288</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">e4</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">528</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">160</span><span class="p">,</span> <span class="mi">320</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">a5</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">832</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">160</span><span class="p">,</span> <span class="mi">320</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b5</span> <span class="o">=</span> <span class="n">Inception</span><span class="p">(</span><span class="mi">832</span><span class="p">,</span> <span class="mi">384</span><span class="p">,</span> <span class="mi">192</span><span class="p">,</span> <span class="mi">384</span><span class="p">,</span> <span class="mi">48</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">avgpool</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">AvgPool2d</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">1024</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">pre_layers</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">a3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_pool</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">a4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">c4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">d4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">e4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_pool</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">a5</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">b5</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">avgpool</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">linear</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span></code></pre></div><p data-pid="sCdB5Fco">5.ResNet:2015年何恺明推出的ResNet在ISLVRC和COCO上横扫所有选手,获得冠军。ResNet旨在解决网络加深后训练难度增大的现象。其提出了residual模块,包含两个3×3卷积和一个短路连接(左图)。</p><figure data-size="normal"><noscript><img src="https://pic4.zhimg.com/v2-76877f7268a99f06eaeb7313245f2f03_b.jpg" data-caption="" data-size="normal" data-rawwidth="435" data-rawheight="221" class="origin_image zh-lightbox-thumb" width="435" data-original="https://pic4.zhimg.com/v2-76877f7268a99f06eaeb7313245f2f03_r.jpg"/></noscript><img src="https://pic4.zhimg.com/80/v2-76877f7268a99f06eaeb7313245f2f03_720w.jpg" data-caption="" data-size="normal" data-rawwidth="435" data-rawheight="221" class="origin_image zh-lightbox-thumb lazy" width="435" data-original="https://pic4.zhimg.com/v2-76877f7268a99f06eaeb7313245f2f03_r.jpg" data-actualsrc="https://pic4.zhimg.com/v2-76877f7268a99f06eaeb7313245f2f03_b.jpg" data-lazy-status="ok"></figure><p data-pid="-uRzdlQS">创新点:(1). 使用<b>短路连接</b>,使训练深层网络更容易,并且<b>重复堆叠</b>相同的模块组合。(2). ResNet大量使用了<b>批量归一层</b>。(3). 对于很深的网络(超过50层),ResNet使用了更高效的<b>瓶颈(bottleneck)</b>结构(右图)。(这里的1×1的卷积能够起降维或升维的作用,从而令3×3的卷积可以在相对较低维度的输入上进行,以达到提高计算效率的目的。)</p><p data-pid="VIu3oawy">从前面可以看到,随着网络深度增加,网络的准确度应该同步增加,当然要注意过拟合问题。但是网络深度增加的一个问题在于这些增加的层是参数更新的信号,因为梯度是从后向前传播的,增加网络深度后,比较靠前的层梯度会很小。这意味着这些层基本上学习停滞了,这就是<b><i>梯度消失</i></b>问题。<br>深度网络的第二个问题在于训练,当网络更深时意味着参数空间更大,优化问题变得更难,因此简单地去增加网络深度反而出现更高的训练误差,深层网络虽然收敛了,但网络却开始退化了,即增加网络层数却导致更大的误差。这就是烦人的<b><i>退化</i></b>问题。</p><p data-pid="_9DlBa33"><i>详解resblock:</i>数据经过了两条路线,一条是常规路线,另一条则是捷径(shortcut),直接实现单位映射的直接连接的路线,这有点类似与电路中的“短路”。通过实验,这种带有shortcut的结构确实可以很好地应对退化问题。我们把网络中的一个模块的输入和输出关系看作是y=H(x),那么直接通过梯度方法求H(x)就会遇到上面提到的退化问题,如果使用了这种带shortcut的结构,那么可变参数部分的优化目标就不再是H(x),若用F(x)来代表需要优化的部分的话,则H(x)=F(x)+x,也就是F(x)=H(x)-x。<br>因为在单位映射的假设中y=x就相当于观测值,所以F(x)就对应着残差,因而叫残差网络。为啥要这样做?因为作者认为学习残差F(X)比直接学习H(X)简单!设想下,现在根据我们只需要去学习输入和输出的差值就可以了,绝对量变为相对量(H(x)-x 就是输出相对于输入变化了多少),<b>优化起来简单很多。</b><br>考虑到x的维度与F(X)维度可能不匹配情况,需进行维度匹配。这里论文中采用两种方法解决这一问题(其实是三种,但通过实验发现第三种方法会使performance急剧下降,故不采用):1)zero_padding:对恒等层进行0填充的方式将维度补充完整。这种方法不会增加额外的参数;2)projection:在恒等层采用1x1的卷积核来增加维度。这种方法会增加额外的参数</p><p data-pid="k4bWLYaz">整体网络:</p><figure data-size="normal"><noscript><img src="https://pic1.zhimg.com/v2-fa1527afcbb2f7f8c1c0611f72894754_b.jpg" data-caption="" data-size="normal" data-rawwidth="357" data-rawheight="863" class="content_image" width="357"/></noscript><img src="https://pic1.zhimg.com/80/v2-fa1527afcbb2f7f8c1c0611f72894754_720w.jpg" data-caption="" data-size="normal" data-rawwidth="357" data-rawheight="863" class="content_image lazy" width="357" data-actualsrc="https://pic1.zhimg.com/v2-fa1527afcbb2f7f8c1c0611f72894754_b.jpg" data-lazy-status="ok"></figure><p data-pid="9E-188VV">ResNet torch 实现,应用于cifar-10</p><div class="highlight"><pre><code class="language-python"><span class="k">def</span> <span class="nf">conv3x3</span><span class="p">(</span><span class="n">in_planes</span><span class="p">,</span> <span class="n">out_planes</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="c1"># 3x3 convolution with padding</span>
<span class="k">return</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">in_planes</span><span class="p">,</span> <span class="n">out_planes</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="n">stride</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
class BasicBlock(nn.Module):
expansion = 1
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">downsample</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">BasicBlock</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">conv3x3</span><span class="p">(</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">stride</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">planes</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">relu</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv2</span> <span class="o">=</span> <span class="n">conv3x3</span><span class="p">(</span><span class="n">planes</span><span class="p">,</span> <span class="n">planes</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn2</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">planes</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">downsample</span> <span class="o">=</span> <span class="n">downsample</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">=</span> <span class="n">stride</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">residual</span> <span class="o">=</span> <span class="n">x</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bn1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bn2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">downsample</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">residual</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
class Bottleneck(nn.Module):
expansion = 4
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">downsample</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">Bottleneck</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">planes</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv2</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">planes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="n">stride</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn2</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">planes</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv3</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">planes</span><span class="p">,</span> <span class="n">planes</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn3</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">planes</span> <span class="o">*</span> <span class="mi">4</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">relu</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">downsample</span> <span class="o">=</span> <span class="n">downsample</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">=</span> <span class="n">stride</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">residual</span> <span class="o">=</span> <span class="n">x</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bn1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bn2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bn3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">downsample</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">residual</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
class ResNet(nn.Module):
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">block</span><span class="p">,</span> <span class="n">layers</span><span class="p">,</span> <span class="n">num_classes</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">inplanes</span> <span class="o">=</span> <span class="mi">64</span>
<span class="nb">super</span><span class="p">(</span><span class="n">ResNet</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="mi">64</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">relu</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layer1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_layer</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layer2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_layer</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layer3</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_layer</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layer4</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_layer</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">avgpool</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">AvgPool2d</span><span class="p">(</span><span class="n">kernel_size</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">512</span> <span class="o">*</span> <span class="n">block</span><span class="o">.</span><span class="n">expansion</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>
<span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">modules</span><span class="p">():</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">m</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">m</span><span class="o">.</span><span class="n">out_channels</span>
<span class="n">m</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">normal_</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">2.</span> <span class="o">/</span> <span class="n">n</span><span class="p">))</span>
<span class="k">elif</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">):</span>
<span class="n">m</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">fill_</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">m</span><span class="o">.</span><span class="n">bias</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_make_layer</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">block</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">blocks</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="n">downsample</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">stride</span> <span class="o">!=</span> <span class="mi">1</span> <span class="ow">or</span> <span class="bp">self</span><span class="o">.</span><span class="n">inplanes</span> <span class="o">!=</span> <span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="o">.</span><span class="n">expansion</span><span class="p">:</span>
<span class="n">downsample</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Sequential</span><span class="p">(</span>
<span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="o">.</span><span class="n">expansion</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="n">stride</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span>
<span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="o">.</span><span class="n">expansion</span><span class="p">),</span>
<span class="p">)</span>
<span class="n">layers</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">block</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">downsample</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">inplanes</span> <span class="o">=</span> <span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="o">.</span><span class="n">expansion</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">blocks</span><span class="p">):</span>
<span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">block</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">))</span>
<span class="k">return</span> <span class="n">nn</span><span class="o">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">layers</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bn1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">avgpool</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
def resnet18(kwargs):
return ResNet(BasicBlock, [2, 2, 2, 2], kwargs)
def resnet34(kwargs):
return ResNet(BasicBlock, [3, 4, 6, 3], kwargs)
def resnet50(kwargs):
return ResNet(Bottleneck, [3, 4, 6, 3], kwargs)
def resnet101(kwargs):
return ResNet(Bottleneck, [3, 4, 23, 3], kwargs)
def resnet152(kwargs):
return ResNet(Bottleneck, [3, 8, 36, 3], kwargs)
6.DenseNet:CVPR 2017最佳论文DenseNet。和residual模块不同,dense模块中任意两层之间均有短路连接。也就是说,每一层的输入通过级联(concatenation)包含了之前所有层的结果,即包含由低到高所有层次的特征。
创新点:密集连接:缓解梯度消失问题,加强特征传播,鼓励特征复用,极大的减少了参数量(卷积层的滤波器数很少,参数量仅为ResNet一半)
DenseNet则是让l层的输入直接影响到之后的所有层,它的输出为:xl=Hl([x0,x1,…,xl−1]),其中[x0,x1,…,xl−1]就是将之前的feature map以通道的维度进行合并。并且由于每一层都包含之前所有层的输出信息,因此其只需要很少的特征图就够了,这也是为什么DneseNet的参数量较其他模型大大减少的原因。dense connectivity 仅仅是在一个dense block里的,不同dense block 之间是没有dense connectivity的。
当然,性能的提升需要占用很大GPU存储。
DenseNet torch 实现,应用于cifar-10
class Bottleneck(nn.Module):
def init(self, in_planes, growth_rate):
super(Bottleneck, self).init()
self.bn1 = nn.BatchNorm2d(in_planes)
self.conv1 = nn.Conv2d(in_planes, 4 growth_rate, kernel_size=1, bias=False)
self.bn2 = nn.BatchNorm2d(4 growth_rate)
self.conv2 = nn.Conv2d(4 * growth_rate, growth_rate, kernel_size=3, padding=1, bias=False)
def forward(self, x):
y = self.conv1(func.relu(self.bn1(x)))
y = self.conv2(func.relu(self.bn2(y)))
x = torch.cat([y, x], 1)
return x
class Transition(nn.Module):
def init(self, in_planes, out_planes):
super(Transition, self).init()
self.bn = nn.BatchNorm2d(in_planes)
self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=1, bias=False)
def forward(self, x):
x = self.conv(func.relu(self.bn(x)))
x = func.avg_pool2d(x, 2)
return x
class DenseNet(nn.Module):
def init(self, block, num_block, growth_rate=12, reduction=0.5, num_classes=10):
super(DenseNet, self).init()
self.growth_rate = growth_rate
<span class="n">num_planes</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">growth_rate</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">num_planes</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dense1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_dense_layers</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">num_planes</span><span class="p">,</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">num_planes</span> <span class="o">+=</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">growth_rate</span>
<span class="n">out_planes</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">num_planes</span> <span class="o">*</span> <span class="n">reduction</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">trans1</span> <span class="o">=</span> <span class="n">Transition</span><span class="p">(</span><span class="n">num_planes</span><span class="p">,</span> <span class="n">out_planes</span><span class="p">)</span>
<span class="n">num_planes</span> <span class="o">=</span> <span class="n">out_planes</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dense2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_dense_layers</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">num_planes</span><span class="p">,</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">num_planes</span> <span class="o">+=</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">growth_rate</span>
<span class="n">out_planes</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">num_planes</span> <span class="o">*</span> <span class="n">reduction</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">trans2</span> <span class="o">=</span> <span class="n">Transition</span><span class="p">(</span><span class="n">num_planes</span><span class="p">,</span> <span class="n">out_planes</span><span class="p">)</span>
<span class="n">num_planes</span> <span class="o">=</span> <span class="n">out_planes</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dense3</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_dense_layers</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">num_planes</span><span class="p">,</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="n">num_planes</span> <span class="o">+=</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="n">growth_rate</span>
<span class="n">out_planes</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">num_planes</span> <span class="o">*</span> <span class="n">reduction</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">trans3</span> <span class="o">=</span> <span class="n">Transition</span><span class="p">(</span><span class="n">num_planes</span><span class="p">,</span> <span class="n">out_planes</span><span class="p">)</span>
<span class="n">num_planes</span> <span class="o">=</span> <span class="n">out_planes</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dense4</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_make_dense_layers</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">num_planes</span><span class="p">,</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span>
<span class="n">num_planes</span> <span class="o">+=</span> <span class="n">num_block</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">*</span> <span class="n">growth_rate</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bn</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">num_planes</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">num_planes</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_make_dense_layers</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">block</span><span class="p">,</span> <span class="n">in_planes</span><span class="p">,</span> <span class="n">num_block</span><span class="p">):</span>
<span class="n">layers</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_block</span><span class="p">):</span>
<span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">block</span><span class="p">(</span><span class="n">in_planes</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">growth_rate</span><span class="p">))</span>
<span class="n">in_planes</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">growth_rate</span>
<span class="k">return</span> <span class="n">nn</span><span class="o">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">layers</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">trans1</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dense1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">trans2</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dense2</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">trans3</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dense3</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">dense4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">func</span><span class="o">.</span><span class="n">avg_pool2d</span><span class="p">(</span><span class="n">func</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">bn</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">linear</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
def DenseNet121():
return DenseNet(Bottleneck, [6, 12, 24, 16], growth_rate=32)
def DenseNet169():
return DenseNet(Bottleneck, [6, 12, 32, 32], growth_rate=32)
def DenseNet201():
return DenseNet(Bottleneck, [6, 12, 48, 32], growth_rate=32)
def DenseNet161():
return DenseNet(Bottleneck, [6, 12, 36, 24], growth_rate=48)
def densenet_cifar():
return DenseNet(Bottleneck, [6, 12, 24, 16], growth_rate=12)
b)目标检测(Object detection)
简单来说就是图片里面有什么?分别在哪里?(把它们用矩形框框住)
目标检测常用评价方法:
mAP(mean average precision)当预测的包围盒和真实包围盒的交并比大于某一阈值(通常为0.5),则认为该预测正确。对每个类别,我们画出它的查准率-查全率(precision-recall)曲线,平均准确率是曲线下的面积。之后再对所有类别的平均准确率求平均,即可得到mAP,其取值为[0, 100%]。
- True Positive (TP): IoU> IOUthresholdIOU_{threshold} 的检测框数量,或者是检测到同一个 GT 的多余检测框的数量
- False Negative (FN): 没有检测到的 GT 的数量
- True Negative (TN): 在 mAP 评价指标中不会使用到
- 查准率(Precision): TP/(TP + FP)
- 查全率(Recall): TP/(TP + FN)
交并比(intersection over union, IoU)是度量两个检测框(对于目标检测来说)的交叠程度,(对于分割来说就是分割出的面积与金标准面积的交叠程度)公式如下:
近年来目标检测网络:
RCNN:利用候选区域与 CNN 结合做目标定位
候选区域(region proposal)候选区域生成算法通常基于图像的颜色、纹理、面积、位置等合并相似的像素,最终可以得到一系列的候选矩阵区域。这些算法,如selective search或EdgeBoxes,通常只需要几秒的CPU时间,而且,一个典型的候选区域数目是2k,相比于用滑动窗把图像所有区域都滑动一遍,基于候选区域的方法十分高效。另一方面,这些候选区域生成算法的查准率(precision)一般,但查全率(recall)通常比较高,这使得我们不容易遗漏图像中的目标。
步骤:
- 区域划分:使用selective search算法画出2k个左右候选框,送入CNN
- 特征提取:使用imagenet上训练好的模型,进行finetune
- 区域分类:从头训练一个SVM分类器,对CNN出来的特征向量进行分类
- 边框回归:使用线性回归,对边框坐标进行精修
Fast R-CNN:R-CNN的弊端是需要多次前馈网络,这使得R-CNN的运行效率不高,预测一张图像需要47秒。Fast R-CNN同样基于候选区域进行目标检测,但受SPPNet启发,在Fast R-CNN中,不同候选区域的卷积特征提取部分是共享的。先将整副图像进行卷积提取特征,再在原始图中利用候选区域生成算法的结果在卷积特征上进行采样,这就是ROI pooling layer。
Faster R-CNN Fast R-CNN测试时每张图像前馈网络只需0.2秒,但瓶颈在于提取候选区域需要2秒。Faster R-CNN不再使用现有的无监督候选区域生成算法,而利用候选区域网络从conv5特征中产生候选区域,并且将候选区域网络集成到整个网络中端到端训练。Faster R-CNN的测试时间是0.2秒,接近实时。后来有研究发现,通过使用更少的候选区域,可以在性能损失不大的条件下进一步提速。
候选区域网络(region proposal networks, RPN)在卷积特征上的通过两层卷积(3×3和1×1卷积),输出两个分支。其中,一个分支用于判断每个锚盒是否包含了目标(区分前景与背景),另一个分支对每个锚盒输出候选区域的4个坐标(预测proposal框大小x,y,w,h)。候选区域网络实际上延续了基于滑动窗进行目标定位的思路,不同之处在于候选区域网络在卷积特征而不是在原图上进行滑动。由于卷积特征的空间大小很小而感受野很大,即使使用3×3的滑动窗,也能对应于很大的原图区域。Faster R-CNN实际使用了3组大小(128×128、256×256、512×512)、3组长宽比(1:1、1:2、2:1),共计9个锚盒,这里锚盒的大小已经超过conv5特征感受野的大小。对一张1000×600的图像,可以得到20k个锚盒。
为什么使用锚盒(Anchor box)?锚盒是预先定义形状和大小的包围盒。使用锚盒的原因包括:(1). 图像中的候选区域大小和长宽比不同,直接回归比对锚盒坐标修正训练起来更困难。(2). conv5特征感受野很大,很可能该感受野内包含了不止一个目标,使用多个锚盒可以同时对感受野内出现的多个目标进行预测。(3). 使用锚盒也可以认为这是向神经网络引入先验知识的一种方式。我们可以根据数据中包围盒通常出现的形状和大小设定一组锚盒。锚盒之间是独立的,不同的锚盒对应不同的目标,比如高瘦的锚盒对应于人,而矮胖的锚盒对应于车辆。
- 提取特征:输入固定大小的图片,进过卷积层提取特征图feature maps
- 生成region proposals: 然后经过Region Proposal Networks(RPN)生成region proposals。该层通过softmax判断anchors属于foreground或者background,再利用bounding box 回归修正anchors获得精确的proposals(候选区域)。
- ROI Pooling: 该层的输入是feature maps和proposals,综合这些信息后提取proposal feature maps
- Classification: 将Roi pooling生成的proposal feature maps分别传入softmax分类和bounding box regression获得检测物体类别和检测框最终的精确位置。
c)语义分割(Semantic segmentation)
就是需要区分到图中每一点像素点,而不仅仅是矩形框框住了。但是同一物体的不同实例不需要单独分割出来。如图1©中,分割出瓶子、杯子、立方体,而不用说明立方体1、立方体2等
FCN/UNet/SegNet
d)实例分割(Instance segmentation)
如图1(d),其实就是目标检测和语义分割的结合。相对目标检测的边界框,实例分割可精确到物体的边缘;相对语义分割,实例分割需要标注出图上同一物体的不同个体(立方体1、立方体2、立方体3…)
Mask R-CNN 用FPN进行目标检测,并通过添加额外分支进行语义分割(额外分割分支和原检测分支不共享参数),即Mask R-CNN有三个输出分支(分类、坐标回归、和分割)。此外,Mask R-CNN的其他改进有:(1). 改进了ROI Pooling,通过双线性插值使候选区域和卷积特征的对齐不因量化而损失信息,变成了ROI Align。(2). 在分割时,Mask R-CNN将判断类别和输出模板(mask)这两个任务解耦合,用sigmoid配合对率(logistic)损失函数对每个类别的模板单独处理,取得了比经典分割方法用softmax让所有类别一起竞争更好的效果。
- Mask R-CNN的损失函数: L = Lcls + Lbox + Lmask
- Mask R-CNN基本结构:与Faster RCNN采用了相同的two-state步骤:首先是找出RPN,然后对RPN找到的每个RoI进行分类、定位、并找到binary mask。这与当时其他先找到mask然后在进行分类的网络是不同的。
- RoIAlign的输出坐标使用插值算法得到,不再量化;每个grid中的值也不再使用max,同样使用差值算法。