文本检测 -- Differentiable Binarization

1. Abstract

Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text.

However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text.

In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network.

Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection.

Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. I

In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency.

Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset.



2. Introduction

As a key component of scene text reading, scene text detection that aims to localize the bounding box or region of each text instance is still a challenging task, since scene text is often with various scales and shapes, including horizontal, multi-oriented and curved text.

Segmentation- based scene text detection has attracted a lot of attention recently, as it can describe the text of various shapes, benefiting from its prediction results at the pixel-level.

However, most segmentation-based methods require complex post-processing for grouping the pixel-level prediction results into detected text instances, resulting in a considerable time cost in the inference procedure.


Take two recent state- of-the-art methods for scene text detection as examples:

PSENet (Wang et al. 2019a) proposed the post-processing of progressive scale expansion for improving the detection accuracies;

Pixel embedding in (Tian et al. 2019) is used for clustering the pixels based on the segmentation results, which has to calculate the feature distances among pixels.


Most existing detection methods use the similar post- processing pipeline as shown in Fig. 2:

Firstly, they set a fixed threshold for converting the probability map produced by a segmentation network into a binary image;

Then, some heuristic techniques like pixel clustering are used for grouping pixels into text instances.

Alternatively, our pipeline aims to insert the binarization operation into a segmentation network for joint optimization.

In this manner, the threshold value at every place of an image can be adaptively predicted, which can fully distinguish the pixels from the foreground and background.

However, the standard binarization function is not differentiable, we instead present an approximate function for binarization called Differentiable Binarization (DB), which is fully differentiable when training it along with a segmentation network.


The major contribution in this paper is the proposed DB module that is differentiable, which makes the process of binarization end-to-end trainable in a CNN.

By combining a simple network for semantic segmentation and the proposed DB module, we proposed a robust and fast scene text detector.

Observed from the performance evaluation of us- ing the DB module, we discover that our detector has several prominent advantages over the previous state-of-the-art segmentation-based approaches.

  1. Our method achieves consistently better performances on five benchmark datasets of scene text, including horizon- tal, multi-oriented and curved text.
  2. Our method performs much faster than the previous lead- ing methods, as DB can provide a highly robust binariza- tion map, significantly simplifying the post-processing.
  3. DB works quite well when using a light-weight backbone, which significantly enhances the detection performance with the backbone of ResNet-18.
  4. AsDBcanberemovedintheinferencestagewithoutsac- rificing the performance, there is no extra memory/time cost for testing.


3. Methodology

The architecture of our proposed method is shown in Fig. 3:

Firstly, the input image is fed into a feature-pyramid backbone.

Secondly, the pyramid features are up-sampled to the same scale and cascaded to produce feature F.

Then, feature F is used to predict both the probability map (P ) and the threshold map (T).

After that, the approximate binary map (Bˆ ) is calculated by P and F .

In the training period, the supervision is applied on the probability map, the threshold map, and the approximate binary map, where the probability map and the approximate binary map share the same supervision.

In the inference period, the bounding boxes can be obtained easily from the approximate binary map or the probability map by a box formulation module.





参考文献

Real-time Scene Text Detection with Differentiable Binarization

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值