文字识别 -- CRNN

最新推荐文章于 2024-07-01 21:08:53 发布

Fiona-Dong

最新推荐文章于 2024-07-01 21:08:53 发布

阅读量232

点赞数

分类专栏：博客/文章

原文链接：https://arxiv.org/abs/1507.05717

版权

博客/文章专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1. Abstract

In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition.

A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified frame- work, is proposed.

Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties:

(1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned.
(2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization.
(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks.
(4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios.

The experiments on standard bench- marks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algo- rithm over the prior arts.

Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

2. Introduction

In this paper, we are concerned with a classic problem in computer vision: image- based sequence recognition.

In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation.

Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label.

Therefore, recognition of such objects can be naturally cast as a sequence recognition problem.

Another unique property of sequence-like objects is that their lengths may vary drastically.

For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”.

Consequently, the most popular deep models like DCNN cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text).

Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image.

Some other approaches treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total).

It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence- like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million.

In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.

Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences.

One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing.

However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential.

For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features.

The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.

Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field.

For example, Almaza`n et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition.

Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images.

The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN.

For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models:

It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters);
It has the same property of DCNN on learning informative representations directly from image data, requiring neither handcraft features nor preprocessing steps, including binarization/segmentation, component localization, etc.;
It has the same property of RNN, being able to produce a sequence of labels;
It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases;
It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8];
It contains much less parameters than a standard DCNN model, consuming less storage space.

3. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image.

On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers.

The transcription layer at the top of CRNN is adopted to translate the perframe predictions by the recurrent layers into a label se- quence.

Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.

参考文献

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Fiona-Dong

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
文字识别 -- CRNN

1. AbstractIn this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition.A novel neural net- work architecture, which integrates feature extraction, se- qu
复制链接

扫一扫