文本检测 -- CTPN

1. Abstract

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image.

The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps.

We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy.

The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.

This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text.

The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering.

It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recent results [8,35] by a large margin.

The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27].



2. Introduction

Reading text in natural image has recently attracted increasing attention in computer vision [8,14,15,10,35,11,9,1,28,32].

This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc.

It includes two sub tasks: text detection and recognition. This work focus on the detection task [14,1,28,32], which is more challenging than recognition task carried out on a well-cropped word image [15,9].

Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization.


Current approaches for text detection mostly employ a bottom-up pipeline [28,1,14,32,33].

They commonly start from low-level character or stroke detection, which is typically followed by a number of subsequent steps: non-text component filtering, text line construction and text line verification.

These multi-step bottom-up approaches are generally complicated with less robustness and reliability.

Their performance heavily rely on the results of character detection, and connected-components methods or sliding-window methods have been proposed.

These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background.

However, they are not robust by identifying individual strokes or characters separately, without context information.

For example, it is more confident for people to identify a sequence of characters than an individual one, especially when a character is extremely ambiguous.

These limitations often result in a large number of non-text components in character detection, causing main difficulties for handling them in following steps.

Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline, as pointed out in [28].

To address these problems, we exploit strong deep features for detecting text information directly in convolutional maps.

We develop text anchor mechanism that accurately predicts text locations in fine scale. Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.


Deep Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6].

The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps.

Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection.

However, it is difficult to apply these general object detection systems directly to scene text detection, which generally requires a higher localization accuracy.

In generic object detection, each object has a well-defined closed boundary [2], while such a well-defined boundary may not exist in text, since a text line or word is composed of a number of separate characters or strokes.

For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it.

By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word. Therefore, text detection generally requires a more accurate localization, leading to a different evaluation standard, e.g., the Wolf’s standard [30] which is commonly employed by text benchmarks [19,21].

In this work, we fill this gap by extending the RPN architecture [25] to accurate text line localization.

We present several technical developments that tailor generic object detection model elegantly towards our problem.

We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model.


2.1 Contributions

We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers.

This overcomes a number of main limitations raised by previous bottom-up approaches building on character detection.

We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1.

It makes the following major contributions:

First, we cast the problem of text detection into localizing a sequence of fine- scale text proposals. We develop an anchor regression mechanism that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy. This departs from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy.

Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. This connection allows our detector to explore meaningful context information of text line, making it powerful to detect extremely challenging text reliably.

Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model. Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.

Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].



3. Related Work

  • Text detection.

Past works in scene text detection have been dominated by bottom-up approaches which are generally built on stroke or character detection.

They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods.

The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3].

The sliding-window based methods detect character candidates by densely moving a multi-scale window through an image. The character or non-character window is discriminated by a pre-trained classifier, by using manually-designed features [28,29], or recent CNN features [16].

However, both groups of methods commonly suffer from poor performance of character detection, causing accumulated errors in following component filtering and text line construction steps.

Furthermore, robustly filtering out non-character components or confidently verifying detected text lines are even difficult themselves [1,33,14].

Another limitation is that the sliding-window methods are computationally expensive, by running a classifier on a huge number of the sliding windows.


  • Object detection.

Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6].

A common strategy is to generate a number of object proposals by employing inexpensive low-level features, and then a strong CNN classifier is applied to further classify and refine the generated proposals.

Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5].

Recently, Ren et al. [25] proposed a Faster R-CNN system for object detection. They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps. The RPN is fast by sharing convolutional computation. However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. More importantly, text is different significantly from general objects, making it difficult to directly apply general object detection system to this highly domain-specific task.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值