论文阅读:(AAAI2021) 端对端文本检测和识别PGNet + PaddleOCR源码对应

引言
  • 这是端对端文本检测和识别论文阅读第二篇,也是目前整个方向中最好的算法,出自百度之手,并且已经开源集成到PaddleOCR中
  • 同时PaddleOCR中对该算法也有一些简单的介绍,详细可以参考PaddleOCR-PGNet
  • 本文主要对文中各个模块做简单介绍并给出相对应的PaddleOCR代码,以作学习之用。
基于PaddleOCR推理阶段代码对应
  • 整个网络结构图如下:
    PGNet Framework
  • 主要结构可以分为三个部分Backbone+FPN、四个分支部分、loss部分和后处理部分。
Backbone + FPN部分
  • Backbone有两种一种是server版ResNet50,一种是mobile版EfficientNet-B0。但是Repo中只给出了server版的模型。下面以ResNet50版作为例子,给出相关部分源码与论文对应关系。
  • 因为整个PGNet合并到了PaddleOCR中,所有各个部分结构关系与其他算法实现都类似。Backbone主要位于ppocr/modeling/backbones/e2e_resnet_vd_pg.py处,以下是forward部分代码,值得注意的是out中多添加前两层的特征用于后续的FPN。
    class ResNet(nn.Layer):
        def __init__(self, in_channels=3, layers=50, **kwargs):
            super(ResNet, self).__init__()
            # 此处省略一些代码
        
        def forward(self, inputs):
            # 这里将开头特征也加入到FPN中
            out = [inputs]
            y = self.conv1_1(inputs)
    
            # 经过一层卷积之后的,也加入到FPN中
            out.append(y)
            y = self.pool2d_max(y)
            for block in self.stages:
                y = block(y)
                out.append(y)
            return out
    
  • FPN部分主要是将ResNet的7个输出,分为两部分
    • 第一部分FPN Down Fusion(c0→c1→c2)
    • 第二部分FPN Up Fusion(c6→c5→c4→c3→c2)
    • 最终将两部分特征对应相加,得到论文中提到的 F v i s u a l F_{visual} Fvisual特征
  • 这样可以利用到各个尺度下特征信息。具体代码主要位于ppocr/modeling/necks/pg_fpn.py,以下为关键代码:
class PGFPN(nn.Layer):
    def __init__(self, in_channels, **kwargs):
        super(PGFPN, self).__init__()
       	# 省略一些代码
       	pass

    def forward(self, x):
        c0, c1, c2, c3, c4, c5, c6 = x
        # c0 shape: [1, 3, 768, 1408]
		# c1 shape: [1, 64, 384, 704]
		# c2 shape: [1, 256, 192, 352]
		# c3 shape: [1, 512, 96, 176]
		# c4 shape: [1, 1024, 48, 88]
		# c5 shape: [1, 2048, 24, 44]
		# c6 shape: [1, 2048, 12, 22]
        
        # FPN_Down_Fusion
        f = [c0, c1, c2]
        g = [None, None, None]
        h = [None, None, None]
        h[0] = self.conv_bn_layer_1(f[0])
        h[1] = self.conv_bn_layer_2(f[1])
        h[2] = self.conv_bn_layer_3(f[2])
        # 经过上面运算之后
        # h[0] shape: [1, 32, 768, 1408]
		# h[1] shape: [1, 64, 384, 704]
		# h[2] shape: [1, 128, 192, 352]

        g[0] = self.conv_bn_layer_4(h[0])
        g[1] = paddle.add(g[0], h[1])
        g[1] = F.relu(g[1])
        g[1] = self.conv_bn_layer_5(g[1])
        g[1] = self.conv_bn_layer_6(g[1])

        g[2] = paddle.add(g[1], h[2])
        g[2] = F.relu(g[2])
        g[2] = self.conv_bn_layer_7(g[2])
        # 经过上面运算之后
        # g[0] shape: [1, 64, 384, 704]
		# g[1] shape: [1, 128, 192, 352]
		# g[2] shape: [1, 128, 192, 352]
		
        f_down = self.conv_bn_layer_8(g[2])
        # f_down shape: [1, 128, 192, 352]

        # FPN UP Fusion
        f1 = [c6, c5, c4, c3, c2]
        g = [None, None, None, None, None]
        h = [None, None, None, None, None]
        h[0] = self.conv_h0(f1[0])
        h[1] = self.conv_h1(f1[1])
        h[2] = self.conv_h2(f1[2])
        h[3] = self.conv_h3(f1[3])
        h[4] = self.conv_h4(f1[4])
        # h[0] shape: [1, 256, 12, 22]
		# h[1] shape: [1, 256, 24, 44]
		# h[2] shape: [1, 192, 48, 88]
		# h[3] shape: [1, 192, 96, 176]
		# h[4] shape: [1, 128, 192, 352]

        g[0] = self.dconv0(h[0])
        g[1] = paddle.add(g[0], h[1])
        g[1] = F.relu(g[1])
        g[1] = self.conv_g1(g[1])
        g[1] = self.dconv1(g[1])

        g[2] = paddle.add(g[1], h[2])
        g[2] = F.relu(g[2])
        g[2] = self.conv_g2(g[2])
        g[2] = self.dconv2(g[2])

        g[3] = paddle.add(g[2], h[3])
        g[3] = F.relu(g[3])
        g[3] = self.conv_g3(g[3])
        g[3] = self.dconv3(g[3])

        g[4] = paddle.add(x=g[3], y=h[4])
        g[4] = F.relu(g[4])
        g[4] = self.conv_g4(g[4])
        # g[0] shape: [1, 256, 24, 44]
		# g[1] shape: [1, 192, 48, 88]
		# g[2] shape: [1, 192, 96, 176]
		# g[3] shape: [1, 128, 192, 352]
		# g[4] shape: [1, 128, 192, 352]
        
        f_up = self.convf(g[4])
        # f_down shape: [1, 128, 192, 352]
        # f_up shape: [1, 128, 192, 352]
        f_common = paddle.add(f_down, f_up)
        f_common = F.relu(f_common)
        # f_common shape: [1, 128, 192, 352]
        return f_common
四个分支部分
  • 这四个部分(Text center line (TCL)、Text border offset (TBO)、Text direction offset (TDO)、Text character classification map (TCC))在代码中均是由一系列 1 × 1 1\times1 1×1 3 × 3 3\times3 3×3卷积实现,在实现的该部分代码中,四个分支部分并无明显区别。
  • 论文中提到了同一backbone下有无Graph Refinement Module(GRM)模块的两种形式,通过查看源码,并无找到GRM的部分,看来,开源的是没有GRM的形式。
  • 主要代码位于ppocr/modeling/heads/e2e_pg_head.py中,主要代码如下:
    class PGHead(nn.Layer):
        def __init__(self, in_channels, **kwargs):
            super(PGHead, self).__init__()
            # 省略声明代码
            pass
            
        def forward(self, x, targets=None):
            f_score = self.conv_f_score1(x)
            f_score = self.conv_f_score2(f_score)
            f_score = self.conv_f_score3(f_score)
            f_score = self.conv1(f_score)
            f_score = F.sigmoid(f_score)
    
            # f_border
            f_border = self.conv_f_boder1(x)
            f_border = self.conv_f_boder2(f_border)
            f_border = self.conv_f_boder3(f_border)
            f_border = self.conv2(f_border)
    
            f_char = self.conv_f_char1(x)
            f_char = self.conv_f_char2(f_char)
            f_char = self.conv_f_char3(f_char)
            f_char = self.conv_f_char4(f_char)
            f_char = self.conv_f_char5(f_char)
            f_char = self.conv3(f_char)
    
            f_direction = self.conv_f_direc1(x)
            f_direction = self.conv_f_direc2(f_direction)
            f_direction = self.conv_f_direc3(f_direction)
            f_direction = self.conv4(f_direction)
    
    		# f_score shape: [1, 1, 192, 352]
    		# f_border shape: [1, 1, 192, 352]
    		# f_char shape: [1, 1, 192, 352]
    		# f_direction shape: [1, 1, 192, 352]	
            predicts = {}
            predicts['f_score'] = f_score
            predicts['f_border'] = f_border
            predicts['f_char'] = f_char
            predicts['f_direction'] = f_direction
            return predicts
    
Loss部分
  • 四个分支部分虽说实现并无明显区别,但是loss部分直接就能决定哪个分支学习什么, 从而在训练中赋予各个分支不同功能。
  • loss实现部分主要位于ppocr/losses/e2e_pg_loss.py处,四个分支均有对应的loss,主要代码如下:
    • 其中TCL分支采用的是Dice Loss,该loss源于医疗中图像分割中
    • TBO和TDO采用的是Smooth L 1 L_{1} L1 Loss
    • TCC采用的是PG-CTC Loss.
    class PGLoss(nn.Layer):
        def __init__(self,
                     tcl_bs,
                     max_text_length,
                     max_text_nums,
                     pad_num,
                     eps=1e-6,
                     **kwargs):
            super(PGLoss, self).__init__()
            self.tcl_bs = tcl_bs
            self.max_text_nums = max_text_nums
            self.max_text_length = max_text_length
            self.pad_num = pad_num
            self.dice_loss = DiceLoss(eps=eps)
    
        def border_loss(self, f_border, l_border, l_score, l_mask):
            l_border_split, l_border_norm = paddle.tensor.split(
                l_border, num_or_sections=[4, 1], axis=1)
            f_border_split = f_border
            b, c, h, w = l_border_norm.shape
            l_border_norm_split = paddle.expand(
                x=l_border_norm, shape=[b, 4 * c, h, w])
            b, c, h, w = l_score.shape
            l_border_score = paddle.expand(x=l_score, shape=[b, 4 * c, h, w])
            b, c, h, w = l_mask.shape
            l_border_mask = paddle.expand(x=l_mask, shape=[b, 4 * c, h, w])
            border_diff = l_border_split - f_border_split
            abs_border_diff = paddle.abs(border_diff)
            border_sign = abs_border_diff < 1.0
            border_sign = paddle.cast(border_sign, dtype='float32')
            border_sign.stop_gradient = True
            border_in_loss = 0.5 * abs_border_diff * abs_border_diff * border_sign + \
                             (abs_border_diff - 0.5) * (1.0 - border_sign)
            border_out_loss = l_border_norm_split * border_in_loss
            border_loss = paddle.sum(border_out_loss * l_border_score * l_border_mask) / \
                          (paddle.sum(l_border_score * l_border_mask) + 1e-5)
            return border_loss
    
        def direction_loss(self, f_direction, l_direction, l_score, l_mask):
            l_direction_split, l_direction_norm = paddle.tensor.split(
                l_direction, num_or_sections=[2, 1], axis=1)
            f_direction_split = f_direction
            b, c, h, w = l_direction_norm.shape
            l_direction_norm_split = paddle.expand(
                x=l_direction_norm, shape=[b, 2 * c, h, w])
            b, c, h, w = l_score.shape
            l_direction_score = paddle.expand(x=l_score, shape=[b, 2 * c, h, w])
            b, c, h, w = l_mask.shape
            l_direction_mask = paddle.expand(x=l_mask, shape=[b, 2 * c, h, w])
            direction_diff = l_direction_split - f_direction_split
            abs_direction_diff = paddle.abs(direction_diff)
            direction_sign = abs_direction_diff < 1.0
            direction_sign = paddle.cast(direction_sign, dtype='float32')
            direction_sign.stop_gradient = True
            direction_in_loss = 0.5 * abs_direction_diff * abs_direction_diff * direction_sign + \
                                (abs_direction_diff - 0.5) * (1.0 - direction_sign)
            direction_out_loss = l_direction_norm_split * direction_in_loss
            direction_loss = paddle.sum(direction_out_loss * l_direction_score * l_direction_mask) / \
                             (paddle.sum(l_direction_score * l_direction_mask) + 1e-5)
            return direction_loss
    
        def ctcloss(self, f_char, tcl_pos, tcl_mask, tcl_label, label_t):
            f_char = paddle.transpose(f_char, [0, 2, 3, 1])
            tcl_pos = paddle.reshape(tcl_pos, [-1, 3])
            tcl_pos = paddle.cast(tcl_pos, dtype=int)
            f_tcl_char = paddle.gather_nd(f_char, tcl_pos)
            f_tcl_char = paddle.reshape(f_tcl_char,
                                        [-1, 64, 37])  # len(Lexicon_Table)+1
            f_tcl_char_fg, f_tcl_char_bg = paddle.split(f_tcl_char, [36, 1], axis=2)
            f_tcl_char_bg = f_tcl_char_bg * tcl_mask + (1.0 - tcl_mask) * 20.0
            b, c, l = tcl_mask.shape
            tcl_mask_fg = paddle.expand(x=tcl_mask, shape=[b, c, 36 * l])
            tcl_mask_fg.stop_gradient = True
            f_tcl_char_fg = f_tcl_char_fg * tcl_mask_fg + (1.0 - tcl_mask_fg) * (
                -20.0)
            f_tcl_char_mask = paddle.concat([f_tcl_char_fg, f_tcl_char_bg], axis=2)
            f_tcl_char_ld = paddle.transpose(f_tcl_char_mask, (1, 0, 2))
            N, B, _ = f_tcl_char_ld.shape
            input_lengths = paddle.to_tensor([N] * B, dtype='int64')
            cost = paddle.nn.functional.ctc_loss(
                log_probs=f_tcl_char_ld,
                labels=tcl_label,
                input_lengths=input_lengths,
                label_lengths=label_t,
                blank=self.pad_num,
                reduction='none')
            cost = cost.mean()
            return cost
    
        def forward(self, predicts, labels):
            images, tcl_maps, tcl_label_maps, border_maps \
                , direction_maps, training_masks, label_list, pos_list, pos_mask = labels
            # for all the batch_size
            pos_list, pos_mask, label_list, label_t = pre_process(
                label_list, pos_list, pos_mask, self.max_text_length,
                self.max_text_nums, self.pad_num, self.tcl_bs)
    
            f_score, f_border = predicts['f_score'], predicts['f_border'],
            f_direction, f_char = predicts['f_direction'], predicts['f_char']
    
            score_loss = self.dice_loss(f_score, tcl_maps, training_masks)
            border_loss = self.border_loss(f_border, border_maps, tcl_maps,
                                           training_masks)
            direction_loss = self.direction_loss(f_direction, direction_maps,
                                                 tcl_maps, training_masks)
            ctc_loss = self.ctcloss(f_char, pos_list, pos_mask, label_list, label_t)
            loss_all = score_loss + border_loss + direction_loss + 5 * ctc_loss
    
            losses = {
                'loss': loss_all,
                "score_loss": score_loss,
                "border_loss": border_loss,
                "direction_loss": direction_loss,
                "ctc_loss": ctc_loss
            }
            return losses
    
后处理部分
  • 主要位于ppocr/postprocess/pg_postprocess.pyppocr/utils/e2e_utils/pgnet_pp_utils.py下,提供了两种不同的版本,一个fast,一个slow。
  • 这部分需要有空了仔细研究一下,暂时先不写了
总结

主要具备以下优势:

  • 整个网络没有NMS和RoI操作,有效提升了整体推理时间
  • 设计PGNet-CTC Loss用于训练,不需要字符级别标注
  • 引入恢复文本行阅读顺序策略,可以识别非传统方向的文本
  • 加入GRM模块可以有效提升CTC识别精度

不足之处:

  • 因为论文中主要针对英文和数字的数据集做的相关实验,不知道在中英文上效果如何,暂时不得而知。

可以探索之处:

  • 查看是否支持转ONNX模型,以及在ONNXRuntime支持下,推理速度和精度如何?
参考资料
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值