1. SPPNet
He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9): 1904-1916.
现有的卷积神经网络总是需要一个特定尺寸的图片作为输入,例如常用的 224 × 224 224 \times 224 224×224。假设存在一些不满足这种尺寸的原始图片,那么需要对图片进行一些预处理,例如裁剪拉伸。这一系列的人工处理会影响神经网络的预测精度,因此为了解决这一问题,能够使得神经网络模型可以接受任意输入尺寸的图片,本文提出了Spatial pyramid pooling
为什么CNN模型需要特定尺寸的输入呢,这来源于模型最后的线性分类器,分类器需要对经过CNN处理的feature map做flatten操作,这就需要知道最后CNN输出的feature map的形状以及通道数。其实可以通过全局池化来做,因为全局池化后的feature map的形状是 C × 1 × 1 C \times 1 \times 1 C×1×1。但是全局池化会损失一定的精度,相当于使用了一个和feature map尺寸相等的kernel做了max pool或者avg pool。
本文提出的SPPNet为了保留池化时feature map的精度,在分类器之前加入多尺度的池化,然后将多尺度池化后的结果展平拼接,最终可以得到一个固定尺寸的特征向量。
假设最后一个卷积输出的图片尺寸为
C
×
H
×
W
C \times H \times W
C×H×W,我们采用16倍,4倍,1倍尺度进行采样,最终我们可以得到
(
16
+
4
+
1
)
t
i
m
e
s
C
(16 + 4 + 1) \ times C
(16+4+1) timesC的特征向量。
2. 代码实现
class SPPNet(nn.Module):
def __init__(self, in_channels, levels=None):
super(SPPNet, self).__init__()
if levels is None:
self.levels = [6, 3, 2, 1]
else:
self.levels = levels
def forward(self, x):
# x [batch_size, C, H, W]
H, W = x.shape[2], x.shape[3]
ret = []
for i in range(len(self.levels)):
h_kernel = int(math.ceil(H / self.levels[i]))
w_kernel = int(math.ceil(W / self.levels[i]))
h_pad = int(math.ceil((h_kernel * self.levels[i] - H) / 2))
w_pad = int(math.ceil((w_kernel * self.levels[i] - W) / 2))
maxpool = nn.MaxPool2d(kernel_size=(h_kernel, w_kernel),
stride=(h_kernel, w_kernel),
padding=(h_pad, w_pad))
ret.append(torch.flatten(maxpool(x), start_dim=2))
return torch.flatten(torch.cat(ret, dim=-1), start_dim=1)
我们把这个模块嵌入到自定义的卷积模型中:
class ConvNet(nn.Module):
def __init__(self, num_classes=10, levels=None):
super(ConvNet, self).__init__()
if levels is None:
levels = [6, 3, 2, 1]
classifier_in = torch.sum(torch.tensor(levels) ** 2)
self.conv1 = nn.Conv2d(in_channels=1, out_channels=64, kernel_size=3,
stride=2, padding=1)
self.bn1 = nn.BatchNorm2d(num_features=64)
self.relu1 = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3,
stride=2, padding=1)
self.bn2 = nn.BatchNorm2d(num_features=128)
self.relu2 = nn.ReLU(inplace=True)
self.conv3 = nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3,
stride=2, padding=1)
self.bn3 = nn.BatchNorm2d(num_features=256)
self.relu3 = nn.ReLU(inplace=True)
self.conv4 = nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3,
stride=1, padding=1)
self.bn4 = nn.BatchNorm2d(num_features=512)
self.relu4 = nn.ReLU(inplace=True)
self.spp = SPPNet(in_channels=512, levels=levels)
self.relu5 = nn.ReLU(inplace=True)
self.classifier = nn.Linear(in_features=classifier_in * 512, out_features=num_classes)
self._init_params()
def _init_params(self):
for name, module in self.named_modules():
if isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight)
def forward(self, x):
x = self.relu1(self.bn1(self.conv1(x)))
x = self.relu2(self.bn2(self.conv2(x)))
x = self.relu3(self.bn3(self.conv3(x)))
x = self.relu4(self.bn4(self.conv4(x)))
x = self.relu5(self.spp(x))
x = self.classifier(x)
return x