FOTS（1）基础网络

最新推荐文章于 2024-05-28 18:52:10 发布

wyl2077

最新推荐文章于 2024-05-28 18:52:10 发布

阅读量453

点赞数 1

分类专栏：文本检测文章标签：深度学习

本文链接：https://blog.csdn.net/dbdxwyl/article/details/109278867

版权

文本检测专栏收录该内容

6 篇文章 0 订阅

订阅专栏

ResNet

paper:https://arxiv.org/pdf/1512.03385.pdf

ResNet（深度残差网络）提出于2015年，有效的解决了当网络层数增加导致的梯度消失和梯度爆炸问题（如下图所示）：
在这里插入图片描述文中提出了如下网络结构：

在原本的网络结构上引入右侧的x，这样网络的输出变为H(X)=F(X)+X,当网络层数太深时，若F(X)，那么H(X)=X,也就可以将深层网络看成浅层网络。
为什么我们要利用残差网络来制造恒等映射呢，当网络深度增加时，如果新增加的网络什么都不做，模型的效果也不至于下降，然而对于神经网络，什么都不做（恒等映射）恰恰是最难的地方，由于非线性层的存在，每一层都存在信息损失。也就是拟合F(X)=X很难，但拟合F(X)=0要简单很多。
这时我们会发现一个隐患，如果输入层和输出层的chanel不一样怎么相加，此时可以对X做卷积，来让X的chanel数与F（X）匹配。
ResNet的block的两种形式：
在这里插入图片描述
FOTS使用的ResNet50使用的就是右侧这种形式，通过11的卷积层降低chanel数，来显著减少33卷积层的参数。
ResNet网络结构：

代码：`

'''
取出Resnst50的四层，来作为encoding部分，保存来进行特征融合
'''
bbNet =  pretrainedmodels.__dict__['resnet50'](pretrained='imagenet')
self.backbone = bbNet
def __foward_backbone(self, input):
     conv2 = None
     conv3 = None
     conv4 = None
     output = None 
     for name, layer in self.backbone.named_children():
         input = layer(input)
         if name == 'layer1':
             conv2 = input
         elif name == 'layer2':
             conv3 = input
         elif name == 'layer3':
             conv4 = input
         elif name == 'layer4':
             output = input
             break

     return output, conv4, conv3, conv2

Unet

网络结构图：
在这里插入图片描述
左侧为encoding，采用卷积和下采样（池化），将下采样前的feature进行copy和crop送到右侧，右侧为decoding，采用卷积和上采样（反卷积），每一层上采样完与左侧的送来的进行concat。

'''
self.__foward_backbone：restnet50，返回output, conv4, conv3, conv2
self.mergeLayers0,1,2,3: concat+conv
self.__unpool: 上采样
'''
f = self.__foward_backbone(input)
g = [None] * 4
h = [None] * 4
# 底层
h[0] = self.mergeLayers0(f[0])
g[0] = self.__unpool(h[0])

# i = 2
h[1] = self.mergeLayers1(g[0], f[1])
g[1] = self.__unpool(h[1])

# i = 3
h[2] = self.mergeLayers2(g[1], f[2])
g[2] = self.__unpool(h[2])

# i = 4
h[3] = self.mergeLayers3(g[2], f[3])
#g[3] = self.__unpool(h[3])

# final stage
final = self.mergeLayers4(h[3])
final = self.bn5(final)
final = F.relu(final)

基础知识

上采样：与下采样减小图像尺寸相反，上采样被用来增大图像尺寸，常用的上采样方法有反卷积，双线性插值，上池化等。首先介绍反卷积。
反卷积(Transposed Convolution)
反卷积也是一种卷积，需要设置kernel_size，Stride，padding ；不同之处在于要对输入的特征图进行处理，反卷积的过程可以分为两步：
1.对原特征图进行插值，在宽和高的方向上，每两个像素点间插入（Stride−1）个值为0的点，新特征图的高为：H’=H+(Stride−1)∗(H−1)，宽为W’=W+(Stride−1)∗(W−1)
2.对插值后的特征图进行卷积，输出feature map的高为(H-1)Stride-2padding+Size

#反卷积
nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0,  output_padding=0, groups=1, bias=True, dilation=1)
'''
in_channels(int) – 输入信号的通道数
out_channels(int) – 卷积产生的通道数
kerner_size(int or tuple) - 卷积核的大小
stride(int or tuple,optional) - 卷积步长，即要将输入扩大的倍数。
padding(int or tuple, optional) - 输入的每一条边补充0的层数，高宽都增加2*padding
output_padding(int or tuple, optional) - 输出边补充0的层数，高宽都增加padding
groups(int, optional) – 从输入通道到输出通道的阻塞连接数
bias(bool, optional) - 如果bias=True，添加偏置
dilation(int or tuple, optional) – 卷积核元素之间的间距

'''
#上下采样函数
torch.nn.functional.interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None)
'''
参数:
    - input (Tensor): input tensor
    - size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):输出的 spatial 尺寸.
    - scale_factor (float or Tuple[float]): spatial 尺寸的缩放因子.
    - mode (string): 上采样算法:nearest, linear, bilinear, trilinear, area. 默认为 nearest.
    - align_corners (bool, optional): 如果 align_corners=True，则对齐 input 和 output 的角点像素(corner pixels)，保持在角点像素的值. 只会对 mode=linear, bilinear 和 trilinear 有作用. 默认是 False.
    """
'''

生成bbox

fots算法中的bbox采用的是east算法中的RBOX，即带角度的长方形框，其有五个参数，分别为像素位置到矩形的顶部，右侧，底部，左侧边界的4个距离di，和旋转角度θ。
由RBOX五个参数生成bbox四个顶点的坐标的函数为：

def restore_rectangle_rbox(origin, geometry):
    '''
    :param geometry:[d1,d2,d3,d4,cita] #distance to top,left,bottom,right
    :return:
    '''
    d = geometry[:, :4]
    angle = geometry[:, 4]
    # for angle > 0
    origin_0 = origin[angle >= 0]
    d_0 = d[angle >= 0]
    angle_0 = angle[angle >= 0]
    if origin_0.shape[0] > 0:
        # (0,-H),(W, -H),(W, 0),(0, 0),（Left, -bottom)
        p = np.array([np.zeros(d_0.shape[0]), -d_0[:, 0] - d_0[:, 2],
                      d_0[:, 1] + d_0[:, 3], -d_0[:, 0] - d_0[:, 2],
                      d_0[:, 1] + d_0[:, 3], np.zeros(d_0.shape[0]),
                      np.zeros(d_0.shape[0]), np.zeros(d_0.shape[0]),
                      d_0[:, 3], -d_0[:, 2]])
        p = p.transpose((1, 0)).reshape((-1, 5, 2))  # N*5*2

        rotate_matrix_x = np.array([np.cos(angle_0), np.sin(angle_0)]).transpose((1, 0))
        rotate_matrix_x = np.repeat(rotate_matrix_x, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1))  # N*5*2

        rotate_matrix_y = np.array([-np.sin(angle_0), np.cos(angle_0)]).transpose((1, 0))
        rotate_matrix_y = np.repeat(rotate_matrix_y, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1))

        p_rotate_x = np.sum(rotate_matrix_x * p, axis = 2)[:, :, np.newaxis]  # N*5*1
        p_rotate_y = np.sum(rotate_matrix_y * p, axis = 2)[:, :, np.newaxis]  # N*5*1

        p_rotate = np.concatenate([p_rotate_x, p_rotate_y], axis = 2)  # N*5*2

        p3_in_origin = origin_0 - p_rotate[:, 4, :]
        new_p0 = p_rotate[:, 0, :] + p3_in_origin  # N*2
        new_p1 = p_rotate[:, 1, :] + p3_in_origin
        new_p2 = p_rotate[:, 2, :] + p3_in_origin
        new_p3 = p_rotate[:, 3, :] + p3_in_origin

        new_p_0 = np.concatenate([new_p0[:, np.newaxis, :], new_p1[:, np.newaxis, :],
                                  new_p2[:, np.newaxis, :], new_p3[:, np.newaxis, :]], axis = 1)  # N*4*2
    else:
        new_p_0 = np.zeros((0, 4, 2))
    # for angle < 0
    origin_1 = origin[angle < 0]
    d_1 = d[angle < 0]
    angle_1 = angle[angle < 0]
    if origin_1.shape[0] > 0:
        p = np.array([-d_1[:, 1] - d_1[:, 3], -d_1[:, 0] - d_1[:, 2],
                      np.zeros(d_1.shape[0]), -d_1[:, 0] - d_1[:, 2],
                      np.zeros(d_1.shape[0]), np.zeros(d_1.shape[0]),
                      -d_1[:, 1] - d_1[:, 3], np.zeros(d_1.shape[0]),
                      -d_1[:, 1], -d_1[:, 2]])
        p = p.transpose((1, 0)).reshape((-1, 5, 2))  # N*5*2

        rotate_matrix_x = np.array([np.cos(-angle_1), -np.sin(-angle_1)]).transpose((1, 0))
        rotate_matrix_x = np.repeat(rotate_matrix_x, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1))  # N*5*2

        rotate_matrix_y = np.array([np.sin(-angle_1), np.cos(-angle_1)]).transpose((1, 0))
        rotate_matrix_y = np.repeat(rotate_matrix_y, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1))

        p_rotate_x = np.sum(rotate_matrix_x * p, axis = 2)[:, :, np.newaxis]  # N*5*1
        p_rotate_y = np.sum(rotate_matrix_y * p, axis = 2)[:, :, np.newaxis]  # N*5*1

        p_rotate = np.concatenate([p_rotate_x, p_rotate_y], axis = 2)  # N*5*2

        p3_in_origin = origin_1 - p_rotate[:, 4, :]
        new_p0 = p_rotate[:, 0, :] + p3_in_origin  # N*2
        new_p1 = p_rotate[:, 1, :] + p3_in_origin
        new_p2 = p_rotate[:, 2, :] + p3_in_origin
        new_p3 = p_rotate[:, 3, :] + p3_in_origin

        new_p_1 = np.concatenate([new_p0[:, np.newaxis, :], new_p1[:, np.newaxis, :],
                                  new_p2[:, np.newaxis, :], new_p3[:, np.newaxis, :]], axis = 1)  # N*4*2
    else:
        new_p_1 = np.zeros((0, 4, 2))
    return np.concatenate([new_p_0, new_p_1])

在这个函数的基础上从score map和geo map中得到bbox顶点坐标

def detect(score_map, geo_map, score_map_thresh = 0.5, box_thresh = 0.1, nms_thres = 0.2, timer = None):
    '''1e-5
    restore text boxes from score map and geo map
    :param score_map:1 channel
    :param geo_map:5 channel
    :param timer:
    :param score_map_thresh: threshhold for score map
    :param box_thresh: threshhold for boxes
    :param nms_thres: threshold for nms
    :return:
    '''
    if len(score_map.shape) == 4:
        score_map = score_map[0, :, :, 0]
        geo_map = geo_map[0, :, :, ]
    # filter the score map
    xy_text = np.argwhere(score_map > score_map_thresh)
    # sort the text boxes via the y axis
    xy_text = xy_text[np.argsort(xy_text[:, 0])]
    # restore
    start = time.time()
    text_box_restored = Toolbox.restore_rectangle_rbox(xy_text[:, ::-1] * 4, geo_map[xy_text[:, 0], xy_text[:, 1], :])  # N*4*2
    # print('{} text boxes before nms'.format(text_box_restored.shape[0]))
    boxes = np.zeros((text_box_restored.shape[0], 9), dtype = np.float32)
    boxes[:, :8] = text_box_restored.reshape((-1, 8))
    boxes[:, 8] = score_map[xy_text[:, 0], xy_text[:, 1]]
    timer['restore'] = time.time() - start
    # nms part
    start = time.time()
    # boxes = nms_locality.nms_locality(boxes.astype(np.float64), nms_thres)
    boxes = lanms.merge_quadrangle_n9(boxes.astype('float32'), nms_thres)
    timer['nms'] = time.time() - start
    if boxes.shape[0] == 0:
        return None, timer

    # here we filter some low score boxes by the average score map, this is different from the orginal paper
    for i, box in enumerate(boxes):
        mask = np.zeros_like(score_map, dtype = np.uint8)
        cv2.fillPoly(mask, box[:8].reshape((-1, 4, 2)).astype(np.int32) // 4, 1)
        boxes[i, 8] = cv2.mean(score_map, mask)[0]
    boxes = boxes[boxes[:, 8] > box_thresh]
    return boxes, timer

wyl2077

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
FOTS（1）基础网络

ResNetpaper:https://arxiv.org/pdf/1512.03385.pdfResNet（深度残差网络）提出于2015年，有效的解决了当网络层数增加导致的梯度消失和梯度爆炸问题（如下图所示）：文中提出了如下网络结构：在原本的网络结构上引入右侧的x，这样网络的输出变为H(X)=F(X)+X,当网络层数太深时，若F(X)，那么H(X)=X,也就可以将深层网络看成浅层网络。为什么我们要利用残差网络来制造恒等映射呢，当网络深度增加时，如果新增加的网络什么都不做，模型的效果也不至于下降
复制链接

扫一扫