pytorchOCR之PSEnet

最新推荐文章于 2024-04-18 09:48:14 发布

一名ai小菜鸡

最新推荐文章于 2024-04-18 09:48:14 发布

阅读量655

点赞数 1

分类专栏： OCR 文章标签： ocr 深度学习

本文链接：https://blog.csdn.net/fxwfxw7037681/article/details/112059859

版权

OCR 专栏收录该内容

13 篇文章 105 订阅

订阅专栏

pytorchOCR之PSEnet

论文链接
 官方代码

论文解读这里就不做了，网上很多。这里只对项目代码解读。

标签制作

在这里插入图片描述

借用论文里的图，如图所示，需要生成若干个（自己设定，论文中为6）黑白图，文字部分为白即为1，背景部分为黑即为0. 白色最大的为文字分割图，最小的文中叫做kernel图，通过这样可以分开临近的文本。
在ptocr/dataloader/DetLoad/MakeSegMap.py里的

def shrink(self,bboxes, rate, max_shr=20):
        rate = rate * rate
        shrinked_bboxes = []
        for bbox in bboxes:
            area = plg.Polygon(bbox).area()
            peri = self.perimeter(bbox)

            pco = pyclipper.PyclipperOffset()
            pco.AddPath(bbox, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
            offset = min((int)(area * (1 - rate) / (peri + 0.001) + 0.5), max_shr)

            shrinked_bbox = pco.Execute(-offset)
            if len(shrinked_bbox) == 0:
                shrinked_bboxes.append(bbox)
                continue

            shrinked_bbox = np.array(shrinked_bbox)[0]
            shrinked_bbox = np.array(shrinked_bbox)
            if shrinked_bbox.shape[0] <= 2:
                shrinked_bboxes.append(bbox)
                continue

            shrinked_bboxes.append(shrinked_bbox)

        return np.array(shrinked_bboxes)

通过这个函数将标注框缩小，得到每个缩小的框。最后用opencv生成分割图。

模型解读

该检测方法是基于分割，论文使用FPN作为分割网络，其中backbone为resnet50，参看
ptocr/model/backbone/det_resnet.py部分代码如下

 def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x2 = self.layer1(x)
        x3 = self.layer2(x2)
        x4 = self.layer3(x3)
        x5 = self.layer4(x4)

        return x2, x3, x4, x5

经过该backbone返回四个map（x2,x3,x4,x5），分别为原图的1/4，1/8，1/16，1/32.此四个map 进入ptocr/model/head/det_FPNHead.py，如下：
该部分是fpn不同深度的map融合部分

self.toplayer = ConvBnRelu(in_channels[-1], inner_channels, kernel_size=1, stride=1,padding=0,bias=bias)  # Reduce channels
# Smooth layers
self.smooth1 = ConvBnRelu(inner_channels, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)
self.smooth2 = ConvBnRelu(inner_channels, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)
self.smooth3 = ConvBnRelu(inner_channels, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)
# Lateral layers
self.latlayer1 = ConvBnRelu(in_channels[-2], inner_channels, kernel_size=1, stride=1, padding=0,bias=bias)
self.latlayer2 = ConvBnRelu(in_channels[-3], inner_channels, kernel_size=1, stride=1, padding=0,bias=bias)
self.latlayer3 = ConvBnRelu(in_channels[-4], inner_channels, kernel_size=1, stride=1, padding=0,bias=bias)
# Out map
self.conv_out = ConvBnRelu(inner_channels * 4, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)

在config的yaml中需要设置in_channels和inner_channels，其中in_channels分别对应着不同尺度输出map（x2,x3,x4,x5）的channel数目,如果你想改变backbone，这里也要根据实际情况做相应改变，inner_channels可以随意设置，但是一般根据backbone来调整。

def forward(self, x):
		c2, c3, c4, c5 = x
        ##
        p5 = self.toplayer(c5)
        c4 = self.latlayer1(c4)
        p4 = upsample_add(p5, c4)
        p4 = self.smooth1(p4)
        c3 = self.latlayer2(c3)
        p3 = upsample_add(p4, c3)
        p3 = self.smooth2(p3)
        c2 = self.latlayer3(c2)
        p2 = upsample_add(p3, c2)
        p2 = self.smooth3(p2)
        ##
        p3 = upsample(p3, p2)
        p4 = upsample(p4, p2)
        p5 = upsample(p5, p2)

        fuse = torch.cat((p2, p3, p4, p5), 1)
        fuse = self.conv_out(fuse)
        return fuse

这里操作就是将深层map向上做插值和上一层的map做融合，最后将不同尺度的map进行concat，论文中对此有描述。至此FPN部分完成。于是进入ptocr/model/segout/det_PSE_segout.py

class SegDetector(nn.Module):
    def __init__(self,inner_channels=256,classes=7):
        super(SegDetector,self).__init__()
        self.binarize = nn.Conv2d(inner_channels,classes,1,1,0)
    def forward(self, x,img):
        x = self.binarize(x)
        x = upsample(x,img)
        if self.training:
            pre_batch = dict(pre_text=x[:,0])
            pre_batch['pre_kernel'] = x[:,1:]
            return pre_batch
        return x

这里就是输出分割图，并把分割图插值成原图大小，这里输出7个分割图，其中第0个为最大对应着图片中文字的分割图，依次不断减小，kernel就是最小的一个分割图即第6个kernel图作用就是用来区分密集文本。

loss 部分

这里用到了分割常用的dice loss，在ptocr/model/loss/basical_loss.py如下：

class DiceLoss(nn.Module):
    def __init__(self,eps=1e-6):
        super(DiceLoss,self).__init__()
        self.eps = eps
    def forward(self,pre_score,gt_score,train_mask):
        pre_score = pre_score.contiguous().view(pre_score.size()[0], -1)
        gt_score = gt_score.contiguous().view(gt_score.size()[0], -1)
        train_mask = train_mask.contiguous().view(train_mask.size()[0], -1)

        pre_score = pre_score * train_mask
        gt_score = gt_score * train_mask

        a = torch.sum(pre_score * gt_score, 1)
        b = torch.sum(pre_score * pre_score, 1) + self.eps
        c = torch.sum(gt_score * gt_score, 1) + self.eps
        d = (2 * a) / (b + c)
        dice_loss = torch.mean(d)
        return 1 - dice_loss

这里共需要三个输入，一个网络输出的7个图，一个标签制作好的7个图，以及这七个图的train_mask，这里train_mask的作用就是使得部分像素不参与loss计算（即这部分的loss为0）。
这里用到了ohem如下：

def ohem_single(score, gt_text, training_mask):
   pos_num = (int)(np.sum(gt_text > 0.5)) - (int)(np.sum((gt_text > 0.5) & (training_mask <= 0.5)))

   if pos_num == 0:
       # selected_mask = gt_text.copy() * 0 # may be not good
       selected_mask = training_mask
       selected_mask = selected_mask.reshape(1, selected_mask.shape[0], selected_mask.shape[1]).astype('float32')
       return selected_mask

   neg_num = (int)(np.sum(gt_text <= 0.5))
   neg_num = (int)(min(pos_num * 3, neg_num))

   if neg_num == 0:
       selected_mask = training_mask
       selected_mask = selected_mask.reshape(1, selected_mask.shape[0], selected_mask.shape[1]).astype('float32')
       return selected_mask

   neg_score = score[gt_text <= 0.5]
   neg_score_sorted = np.sort(-neg_score)
   threshold = -neg_score_sorted[neg_num - 1]

   selected_mask = ((score >= threshold) | (gt_text > 0.5)) & (training_mask > 0.5)
   selected_mask = selected_mask.reshape(1, selected_mask.shape[0], selected_mask.shape[1]).astype('float32')
   return selected_mask

这里就是选取负样本中loss排序大的，选择正负样本为1:3，假如正样本有3个，负样本像素就要选9个。选择loss最大的九个。

说明：文中图均来自论文

一名ai小菜鸡

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
pytorchOCR之PSEnet

pytorchOCR之PSEnet论文链接官方代码论文解读这里就不做了，网上很多。这里只对项目代码解读。标签制作模型解读该检测方法是基于分割，论文使用FPN作为分割网络，其中backbone为resnet50，参看ptocr/model/backbone/det_resnet.py...
复制链接

扫一扫

专栏目录