文字识别 -- CRNN

1. Abstract

In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition.

A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified frame- work, is proposed.

Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties:

(1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned.
(2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization.
(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks.
(4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios.

The experiments on standard bench- marks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algo- rithm over the prior arts.

Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.



2. Introduction

In this paper, we are concerned with a classic problem in computer vision: image- based sequence recognition.

In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation.

Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label.

Therefore, recognition of such objects can be naturally cast as a sequence recognition problem.


Another unique property of sequence-like objects is that their lengths may vary drastically.

For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”.

Consequently, the most popular deep models like DCNN cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text).

Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image.

Some other approaches treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total).

It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence- like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million.

In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.


Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences.

One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing.

However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential.

For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features.

The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.


Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field.

For example, Almaza`n et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition.

Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.


The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images.

The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN.

For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models:

  1. It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters);
  2. It has the same property of DCNN on learning informative representations directly from image data, requiring neither handcraft features nor preprocessing steps, including binarization/segmentation, component localization, etc.;
  3. It has the same property of RNN, being able to produce a sequence of labels;
  4. It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases;
  5. It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8];
  6. It contains much less parameters than a standard DCNN model, consuming less storage space.


3. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image.

On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers.

The transcription layer at the top of CRNN is adopted to translate the perframe predictions by the recurrent layers into a label se- quence.

Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.





参考文献

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值