【技术文档】RetinaNet

最新推荐文章于 2024-03-27 21:31:01 发布

superjfhc

最新推荐文章于 2024-03-27 21:31:01 发布

阅读量465

点赞数 1

分类专栏：深度学习

本文链接：https://blog.csdn.net/weixin_41779359/article/details/104247570

版权

深度学习专栏收录该内容

10 篇文章 1 订阅

订阅专栏

RetinaNet的特点就是应用了FocalLoss。该模型大小为80M。
在这里插入图片描述

数据

读取

我使用的数据是依然是crowdhuman，将其整理成VOC格式，也就是图片路径放入一个json文件，标注放到一个标注文件。该模型并没有设置背景类，若模型只检测行人这一个类的话，那么类别索引就是0，若有两个类，那么类别索引就是0,1。

annotations = np.zeros((0, 5))
for box in boxes:
    annotation = np.zeros((1, 5))
    annotation[0, :4] = box   # x1 y1 x2 y2
    annotation[0,4] = 0
    annotations = np.append(annotations, annotation, axis=0)

每张图对应一个annotations，annotations的行数代表目标个数，前四列为（xmin,ymin,xmax,ymax）,第五列代表类别的索引，不考虑背景类，第一个类的索引从0开始。

预处理

训练处理

读取图片是将图片每个像素除以255.

img = img.astype(np.float32)/255.0

标准化每张图片

(image.astype(np.float32)-self.mean)/self.std

resize 和padding
将图片的最小边固定到608，另一条边按比例缩放。然后对图片的高和宽进行填充，使其长度为32的倍数

pad_w = 32 - rows%32
pad_h = 32 - cols%32
# make the shape of new_image be the integral multiple of 32
new_image = np.zeros((rows + pad_w, cols + pad_h, cns)).astype(np.float32)
new_image[:rows, :cols, :] = image.astype(np.float32)

在训练时保证图片为的宽和高都是32的倍数是因为主干网络有5个池化层，最后三个池化层的输出要输入到fpn，而fpn中的需要上采样进行特征融合，而上采样的采样方式为
nn.Upsample(scale_factor=2, mode='nearest')
这样上采样的方式输出结果一定是2的倍数，所以FPN的输入特征图的大小也应该是2的倍数，不然没办法对应位置元素相加。
4. collater
训练时，在选择每个批次的图片时，要保证它们之间的长宽比非常接近，这需要根据上一步padding之后图片的长宽比进行排序。
选择完一个批次的图片以后还需要进一步填充，填充的大小为该批次中最大的那条边的长度。

max_width = np.array(widths).max()
max_height = np.array(heights).max()
padded_imgs = torch.zeros(batch_size, max_width, max_height, 3)
for i in range(batch_size):
    img = imgs[i]
    padded_imgs[i, :int(img.shape[0]), :int(img.shape[1]), :] = img

然后再进行每个维度通道的调整

padded_imgs = padded_imgs.permute(0, 3, 1, 2)

测试式处理

测试时可以直接测试原图

def preprocess(img):
	img = img.astype(np.float32) / 255.0
	mean = np.array([[[0.485, 0.456, 0.406]]])
	std = np.array([[[0.229, 0.224, 0.225]]])
	img = (img - mean) / std
	rows, cols, cns = img.shape
	pad_w = 32 - rows % 32
	pad_h = 32 - cols % 32
	# make the shape of new_image be the integral multiple of 32,ensuring every result of pooling layer of backbone are multiple of 2
    # because the output of upsample in fpn is multiple of 2,so the output of pooling layer should be multiple of 2
	# in this way the results of them can add in corresponding location
	# in retinaFace project do not need padding for input image,because they use F.interpolate for upsample
	# the pooling method of them are all use nn.Conv2d(in, out, kernel_size=3, stride=2, padding=1)
	# This can make the result equality when the input scale which is odd or even number(like 10 and 9)
	new_image = np.zeros((rows + pad_w, cols + pad_h, cns)).astype(np.float32)
	new_image[:rows, :cols, :] = img.astype(np.float32)

	input = torch.from_numpy(new_image).unsqueeze(0).permute(0,3,1,2)
	return input

在测试时比较重要的一步处理就是填充，保证各边长是32的倍数，保证无论是宽还是高在接下来的五层池化中都能被2整除。

模型

backbone

模型的主干网络是18层的残差网络，其主要结构如下

def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,padding=1, bias=False)

class BasicBlock(nn.Module):
    expansion = 1
    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample is not None:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

BasicBlock类里面有两个卷积层，第一个卷积层有时stride=2时，第一个卷积层就是用来做池化用的，此时downsample就不为None。若第一卷积层的stride=1，那就不需要downsample了。整个主干网络的池化形式基本上都是通过是卷积操作的步长变为2实现的。
比如：

self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

或者

nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=2,padding=1, bias=False)

downsample的方式为：

nn.Conv2d(inplanes, planes ,kernel_size=1, stride=2, bias=False)

FPN

fpn一共有5层，每一层的大小都是上一次的一半。

class PyramidFeatures(nn.Module):
    def __init__(self, C3_size, C4_size, C5_size, feature_size=256):
        super(PyramidFeatures, self).__init__()

        # upsample C5 to get P5 from the FPN paper
        self.P5_1 = nn.Conv2d(C5_size, feature_size, kernel_size=1, stride=1, padding=0)
        self.P5_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
        self.P5_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

        # add P5 elementwise to C4
        self.P4_1 = nn.Conv2d(C4_size, feature_size, kernel_size=1, stride=1, padding=0)
        self.P4_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
        self.P4_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

        # add P4 elementwise to C3
        self.P3_1 = nn.Conv2d(C3_size, feature_size, kernel_size=1, stride=1, padding=0)
        self.P3_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

        # "P6 is obtained via a 3x3 stride-2 conv on C5"
        self.P6 = nn.Conv2d(C5_size, feature_size, kernel_size=3, stride=2, padding=1)

        # "P7 is computed by applying ReLU followed by a 3x3 stride-2 conv on P6"
        self.P7_1 = nn.ReLU()
        self.P7_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=2, padding=1)

    def forward(self, inputs):
        C3, C4, C5 = inputs

        P5_x = self.P5_1(C5)
        P5_upsampled_x = self.P5_upsampled(P5_x)
        P5_x = self.P5_2(P5_x)

        P4_x = self.P4_1(C4)
        P4_x = P5_upsampled_x + P4_x
        P4_upsampled_x = self.P4_upsampled(P4_x)
        P4_x = self.P4_2(P4_x)

        P3_x = self.P3_1(C3)
        P3_x = P3_x + P4_upsampled_x
        P3_x = self.P3_2(P3_x)

        P6_x = self.P6(C5)

        P7_x = self.P7_1(P6_x)
        P7_x = self.P7_2(P7_x)

        return [P3_x, P4_x, P5_x, P6_x, P7_x]

其上采样方式为

nn.Upsample(scale_factor=2, mode='nearest')

得到的结果为2的倍数，所以要保证与之相加的下面的特征层大小一致，也应该保证是2的倍数。所以这就是前面为什么需要数据预处理时要填充的原因了。但是如果上采样的方式换成

F.interpolate(output3, size=[output2.size(2), output2.size(3)], mode="nearest")

就不要做这种方式的填充了

head

由于分类并没有考虑背景类，所以在分类的头部，需要对其输出的结果进行sigmoid激活，若其结果大于某一个阈值，那么就属于该类。

class ClassificationModelReduce(nn.Module):
    def __init__(self, num_features_in, num_anchors=9, num_classes=80, prior=0.01, feature_size=256):
        super(ClassificationModelReduce, self).__init__()

        self.num_classes = num_classes
        self.num_anchors = num_anchors
        self.output = nn.Conv2d(num_features_in, num_anchors * num_classes, kernel_size=3, padding=1)
        self.output_act = nn.Sigmoid()

    def forward(self, x):
        out = self.output(x)
        out = self.output_act(out)
        # out is B x C x W x H, with C = n_classes + n_anchors
        out1 = out.permute(0, 2, 3, 1)
        batch_size, width, height, channels = out1.shape
        out2 = out1.view(batch_size, width, height, self.num_anchors, self.num_classes)
        return out2.contiguous().view(x.shape[0], -1, self.num_classes)

anchor

class Anchors(nn.Module):
    def __init__(self, pyramid_levels=None, strides=None, sizes=None, ratios=None, scales=None):
        super(Anchors, self).__init__()

        if pyramid_levels is None:
            self.pyramid_levels = [3, 4, 5, 6, 7]
        if strides is None:
            self.strides = [2 ** x for x in self.pyramid_levels]
        if sizes is None:
            self.sizes = [2 ** (x + 2) for x in self.pyramid_levels]
        if ratios is None:
            self.ratios = np.array([0.5, 1, 2])
        if scales is None:
            self.scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])

    def forward(self, image):
        image_shape = image.shape[2:]
        image_shape = np.array(image_shape)
        image_shapes = [(image_shape + 2 ** x - 1) // (2 ** x) for x in self.pyramid_levels]
        # compute anchors over all pyramid levels
        all_anchors = np.zeros((0, 4)).astype(np.float32)

        for idx, p in enumerate(self.pyramid_levels):
            anchors         = generate_anchors(base_size=self.sizes[idx], ratios=self.ratios, scales=self.scales)
            shifted_anchors = shift(image_shapes[idx], self.strides[idx], anchors)
            all_anchors     = np.append(all_anchors, shifted_anchors, axis=0)

        all_anchors = np.expand_dims(all_anchors, axis=0)

        return torch.from_numpy(all_anchors.astype(np.float32)).cuda()

def generate_anchors(base_size=16, ratios=None, scales=None):
    if ratios is None:
        ratios = np.array([0.5, 1, 2])
    if scales is None:
        scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])
    num_anchors = len(ratios) * len(scales)
    # initialize output anchors
    anchors = np.zeros((num_anchors, 4))
    # scale base_size  np.tile 复制
    anchors[:, 2:] = base_size * np.tile(scales, (2, len(ratios))).T
    # compute areas of anchors
    areas = anchors[:, 2] * anchors[:, 3]
    # correct for ratios
    anchors[:, 2] = np.sqrt(areas / np.repeat(ratios, len(scales))) # w
    anchors[:, 3] = anchors[:, 2] * np.repeat(ratios, len(scales))  # h
    # transform from (x_ctr, y_ctr, w, h) -> (x1, y1, x2, y2)
    anchors[:, 0::2] -= np.tile(anchors[:, 2] * 0.5, (2, 1)).T
    anchors[:, 1::2] -= np.tile(anchors[:, 3] * 0.5, (2, 1)).T
    return anchors
    
def shift(shape, stride, anchors):
    shift_x = (np.arange(0, shape[1]) + 0.5) * stride
    shift_y = (np.arange(0, shape[0]) + 0.5) * stride

    shift_x, shift_y = np.meshgrid(shift_x, shift_y)

    shifts = np.vstack((
        shift_x.ravel(), shift_y.ravel(),
        shift_x.ravel(), shift_y.ravel()
    )).transpose()

    # add A anchors (1, A, 4) to
    # cell K shifts (K, 1, 4) to get
    # shift anchors (K, A, 4)
    # reshape to (K*A, 4) shifted anchors
    A = anchors.shape[0]
    K = shifts.shape[0]
    all_anchors = (anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
    all_anchors = all_anchors.reshape((K * A, 4))

    return all_anchors

nms

from torchvision.ops import nms
# regression.shape = (batch,priors,num_class)
transformed_anchors = self.regressBoxes(anchors, regression)
transformed_anchors = self.clipBoxes(transformed_anchors, img_batch)

scores = torch.max(classification, dim=2, keepdim=True)[0]

scores_over_thresh = (scores > 0.05)[0, :, 0]

classification = classification[:, scores_over_thresh, :]
transformed_anchors = transformed_anchors[:, scores_over_thresh, :]
scores = scores[:, scores_over_thresh, :]
anchors_nms_idx = nms(transformed_anchors[0,:,:], scores[0,:,0], 0.5)

nms_scores, nms_class = classification[0, anchors_nms_idx, :].max(dim=1)

return [nms_scores, nms_class, transformed_anchors[0, anchors_nms_idx, :]]

superjfhc

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【技术文档】RetinaNet

RetinaNet的特点就是应用了FocalLoss。该模型大小为80M。数据读取我使用的数据是依然是crowdhuman，将其整理成VOC格式，也就是图片路径放入一个json文件，标注放到一个标注文件。该模型并没有设置背景类，若模型只检测行人这一个类的话，那么类别索引就是0，若有两个类，那么类别索引就是0,1。annotations = np.zeros((0, 5))for box ...
复制链接

扫一扫

专栏目录