RetinaNet的特点就是应用了FocalLoss。该模型大小为80M。
数据
读取
我使用的数据是依然是crowdhuman,将其整理成VOC格式,也就是图片路径放入一个json文件,标注放到一个标注文件。该模型并没有设置背景类,若模型只检测行人这一个类的话,那么类别索引就是0,若有两个类,那么类别索引就是0,1。
annotations = np.zeros((0, 5))
for box in boxes:
annotation = np.zeros((1, 5))
annotation[0, :4] = box # x1 y1 x2 y2
annotation[0,4] = 0
annotations = np.append(annotations, annotation, axis=0)
每张图对应一个annotations,annotations的行数代表目标个数,前四列为(xmin,ymin,xmax,ymax),第五列代表类别的索引,不考虑背景类,第一个类的索引从0开始。
预处理
训练处理
- 读取图片是将图片每个像素除以255.
img = img.astype(np.float32)/255.0
- 标准化每张图片
(image.astype(np.float32)-self.mean)/self.std
- resize 和padding
将图片的最小边固定到608,另一条边按比例缩放。然后对图片的高和宽进行填充,使其长度为32的倍数
pad_w = 32 - rows%32
pad_h = 32 - cols%32
# make the shape of new_image be the integral multiple of 32
new_image = np.zeros((rows + pad_w, cols + pad_h, cns)).astype(np.float32)
new_image[:rows, :cols, :] = image.astype(np.float32)
在训练时保证图片为的宽和高都是32的倍数是因为主干网络有5个池化层,最后三个池化层的输出要输入到fpn,而fpn中的需要上采样进行特征融合,而上采样的采样方式为
nn.Upsample(scale_factor=2, mode='nearest')
这样上采样的方式输出结果一定是2的倍数,所以FPN的输入特征图的大小也应该是2的倍数,不然没办法对应位置元素相加。
4. collater
训练时,在选择每个批次的图片时,要保证它们之间的长宽比非常接近,这需要根据上一步padding之后图片的长宽比进行排序。
选择完一个批次的图片以后还需要进一步填充,填充的大小为该批次中最大的那条边的长度。
max_width = np.array(widths).max()
max_height = np.array(heights).max()
padded_imgs = torch.zeros(batch_size, max_width, max_height, 3)
for i in range(batch_size):
img = imgs[i]
padded_imgs[i, :int(img.shape[0]), :int(img.shape[1]), :] = img
然后再进行每个维度通道的调整
padded_imgs = padded_imgs.permute(0, 3, 1, 2)
测试式处理
测试时可以直接测试原图
def preprocess(img):
img = img.astype(np.float32) / 255.0
mean = np.array([[[0.485, 0.456, 0.406]]])
std = np.array([[[0.229, 0.224, 0.225]]])
img = (img - mean) / std
rows, cols, cns = img.shape
pad_w = 32 - rows % 32
pad_h = 32 - cols % 32
# make the shape of new_image be the integral multiple of 32,ensuring every result of pooling layer of backbone are multiple of 2
# because the output of upsample in fpn is multiple of 2,so the output of pooling layer should be multiple of 2
# in this way the results of them can add in corresponding location
# in retinaFace project do not need padding for input image,because they use F.interpolate for upsample
# the pooling method of them are all use nn.Conv2d(in, out, kernel_size=3, stride=2, padding=1)
# This can make the result equality when the input scale which is odd or even number(like 10 and 9)
new_image = np.zeros((rows + pad_w, cols + pad_h, cns)).astype(np.float32)
new_image[:rows, :cols, :] = img.astype(np.float32)
input = torch.from_numpy(new_image).unsqueeze(0).permute(0,3,1,2)
return input
在测试时比较重要的一步处理就是填充,保证各边长是32的倍数,保证无论是宽还是高在接下来的五层池化中都能被2整除。
模型
backbone
模型的主干网络是18层的残差网络,其主要结构如下
def conv3x3(in_planes, out_planes, stride=1):
"""3x3 convolution with padding"""
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,padding=1, bias=False)
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(BasicBlock, self).__init__()
self.conv1 = conv3x3(inplanes, planes, stride)
self.bn1 = nn.BatchNorm2d(planes)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes)
self.bn2 = nn.BatchNorm2d(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
BasicBlock类里面有两个卷积层,第一个卷积层有时stride=2时,第一个卷积层就是用来做池化用的,此时downsample就不为None。若第一卷积层的stride=1,那就不需要downsample了。整个主干网络的池化形式基本上都是通过是卷积操作的步长变为2实现的。
比如:
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
或者
nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=2,padding=1, bias=False)
downsample的方式为:
nn.Conv2d(inplanes, planes ,kernel_size=1, stride=2, bias=False)
FPN
fpn一共有5层,每一层的大小都是上一次的一半。
class PyramidFeatures(nn.Module):
def __init__(self, C3_size, C4_size, C5_size, feature_size=256):
super(PyramidFeatures, self).__init__()
# upsample C5 to get P5 from the FPN paper
self.P5_1 = nn.Conv2d(C5_size, feature_size, kernel_size=1, stride=1, padding=0)
self.P5_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
self.P5_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)
# add P5 elementwise to C4
self.P4_1 = nn.Conv2d(C4_size, feature_size, kernel_size=1, stride=1, padding=0)
self.P4_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
self.P4_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)
# add P4 elementwise to C3
self.P3_1 = nn.Conv2d(C3_size, feature_size, kernel_size=1, stride=1, padding=0)
self.P3_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)
# "P6 is obtained via a 3x3 stride-2 conv on C5"
self.P6 = nn.Conv2d(C5_size, feature_size, kernel_size=3, stride=2, padding=1)
# "P7 is computed by applying ReLU followed by a 3x3 stride-2 conv on P6"
self.P7_1 = nn.ReLU()
self.P7_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=2, padding=1)
def forward(self, inputs):
C3, C4, C5 = inputs
P5_x = self.P5_1(C5)
P5_upsampled_x = self.P5_upsampled(P5_x)
P5_x = self.P5_2(P5_x)
P4_x = self.P4_1(C4)
P4_x = P5_upsampled_x + P4_x
P4_upsampled_x = self.P4_upsampled(P4_x)
P4_x = self.P4_2(P4_x)
P3_x = self.P3_1(C3)
P3_x = P3_x + P4_upsampled_x
P3_x = self.P3_2(P3_x)
P6_x = self.P6(C5)
P7_x = self.P7_1(P6_x)
P7_x = self.P7_2(P7_x)
return [P3_x, P4_x, P5_x, P6_x, P7_x]
其上采样方式为
nn.Upsample(scale_factor=2, mode='nearest')
得到的结果为2的倍数,所以要保证与之相加的下面的特征层大小一致,也应该保证是2的倍数。所以这就是前面为什么需要数据预处理时要填充的原因了。但是如果上采样的方式换成
F.interpolate(output3, size=[output2.size(2), output2.size(3)], mode="nearest")
就不要做这种方式的填充了
head
由于分类并没有考虑背景类,所以在分类的头部,需要对其输出的结果进行sigmoid激活,若其结果大于某一个阈值,那么就属于该类。
class ClassificationModelReduce(nn.Module):
def __init__(self, num_features_in, num_anchors=9, num_classes=80, prior=0.01, feature_size=256):
super(ClassificationModelReduce, self).__init__()
self.num_classes = num_classes
self.num_anchors = num_anchors
self.output = nn.Conv2d(num_features_in, num_anchors * num_classes, kernel_size=3, padding=1)
self.output_act = nn.Sigmoid()
def forward(self, x):
out = self.output(x)
out = self.output_act(out)
# out is B x C x W x H, with C = n_classes + n_anchors
out1 = out.permute(0, 2, 3, 1)
batch_size, width, height, channels = out1.shape
out2 = out1.view(batch_size, width, height, self.num_anchors, self.num_classes)
return out2.contiguous().view(x.shape[0], -1, self.num_classes)
anchor
class Anchors(nn.Module):
def __init__(self, pyramid_levels=None, strides=None, sizes=None, ratios=None, scales=None):
super(Anchors, self).__init__()
if pyramid_levels is None:
self.pyramid_levels = [3, 4, 5, 6, 7]
if strides is None:
self.strides = [2 ** x for x in self.pyramid_levels]
if sizes is None:
self.sizes = [2 ** (x + 2) for x in self.pyramid_levels]
if ratios is None:
self.ratios = np.array([0.5, 1, 2])
if scales is None:
self.scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])
def forward(self, image):
image_shape = image.shape[2:]
image_shape = np.array(image_shape)
image_shapes = [(image_shape + 2 ** x - 1) // (2 ** x) for x in self.pyramid_levels]
# compute anchors over all pyramid levels
all_anchors = np.zeros((0, 4)).astype(np.float32)
for idx, p in enumerate(self.pyramid_levels):
anchors = generate_anchors(base_size=self.sizes[idx], ratios=self.ratios, scales=self.scales)
shifted_anchors = shift(image_shapes[idx], self.strides[idx], anchors)
all_anchors = np.append(all_anchors, shifted_anchors, axis=0)
all_anchors = np.expand_dims(all_anchors, axis=0)
return torch.from_numpy(all_anchors.astype(np.float32)).cuda()
def generate_anchors(base_size=16, ratios=None, scales=None):
if ratios is None:
ratios = np.array([0.5, 1, 2])
if scales is None:
scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])
num_anchors = len(ratios) * len(scales)
# initialize output anchors
anchors = np.zeros((num_anchors, 4))
# scale base_size np.tile 复制
anchors[:, 2:] = base_size * np.tile(scales, (2, len(ratios))).T
# compute areas of anchors
areas = anchors[:, 2] * anchors[:, 3]
# correct for ratios
anchors[:, 2] = np.sqrt(areas / np.repeat(ratios, len(scales))) # w
anchors[:, 3] = anchors[:, 2] * np.repeat(ratios, len(scales)) # h
# transform from (x_ctr, y_ctr, w, h) -> (x1, y1, x2, y2)
anchors[:, 0::2] -= np.tile(anchors[:, 2] * 0.5, (2, 1)).T
anchors[:, 1::2] -= np.tile(anchors[:, 3] * 0.5, (2, 1)).T
return anchors
def shift(shape, stride, anchors):
shift_x = (np.arange(0, shape[1]) + 0.5) * stride
shift_y = (np.arange(0, shape[0]) + 0.5) * stride
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((
shift_x.ravel(), shift_y.ravel(),
shift_x.ravel(), shift_y.ravel()
)).transpose()
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = anchors.shape[0]
K = shifts.shape[0]
all_anchors = (anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
all_anchors = all_anchors.reshape((K * A, 4))
return all_anchors
nms
from torchvision.ops import nms
# regression.shape = (batch,priors,num_class)
transformed_anchors = self.regressBoxes(anchors, regression)
transformed_anchors = self.clipBoxes(transformed_anchors, img_batch)
scores = torch.max(classification, dim=2, keepdim=True)[0]
scores_over_thresh = (scores > 0.05)[0, :, 0]
classification = classification[:, scores_over_thresh, :]
transformed_anchors = transformed_anchors[:, scores_over_thresh, :]
scores = scores[:, scores_over_thresh, :]
anchors_nms_idx = nms(transformed_anchors[0,:,:], scores[0,:,0], 0.5)
nms_scores, nms_class = classification[0, anchors_nms_idx, :].max(dim=1)
return [nms_scores, nms_class, transformed_anchors[0, anchors_nms_idx, :]]