cnn卷积神经手写数字识别_基于一维CNN的全卷积手写识别模型

cnn卷积神经手写数字识别

Handwriting Recognition also termed as HTR(Handwritten Text Recognition) is a machine learning method that aims at giving the machines an ability to read human handwriting from real-world documents(images).

手写识别也称为HTR(手写文本识别),它是一种机器学习方法,旨在使机器具有从真实文档(图像)中读取人类手写内容的能力。

The traditional Optical Character Recognition systems(OCR systems) are trained to understand the variations and font-styles in the machine-printed text(from documents/images) and they work really well in practice(example-Tesseract). Handwriting Recognition on the other hand is a more challenging task due to a large number of variations among the handwritings of people.

对传统的光学字符识别系统(OCR系统)进行了培训,以了解机器打印的文本(来自文档/图像)中的变化和字体样式,并且它们在实践中非常有效(example-Tesseract)。 另一方面,由于人的笔迹之间存在大量差异,因此笔迹识别是一项更具挑战性的任务。

Recent progress in deep learning has led to the development of efficient OCR/HTR solutions. Although these models perform remarkably well in practice, these aren’t easy to train, understand and deploy due to the following limitations:-

深度学习的最新进展已导致开发有效的OCR / HTR解决方案。 尽管这些模型在实践中表现非常出色,但是由于以下限制,它们很难训练,理解和部署:

  1. They require a huge amount of labeled training data.

    他们需要大量带标签的培训数据。

  2. Due to a large number of training parameters, they are hard to train and slow in inference.

    由于训练参数很多,因此难以训练并且推理速度较慢。

  3. As they are slow, they require huge deployment cost(hardware requirements) to make them useful in real-time applications.

    由于速度很慢,因此需要巨大的部署成本(硬件要求)才能使其在实时应用程序中有用。

  4. Models are complex in nature and difficult to scale(stacked LSTMs, complex attention layers).

    模型本质上是复杂的,难以扩展(堆叠的LSTM,复杂的关注层)。

In this article, we will talk about a novel deep learning architecture (EASTER) that solves the above-listed challenges to some extent. This architecture in a fast, scalable, simple, and also efficient than many complex choices for the task of OCR and HTR.

在本文中,我们将讨论一种新颖的深度学习架构(EASTER),该架构在一定程度上解决了上述挑战。 与OCR和HTR的许多复杂选择相比,该体系结构具有快速,可扩展,简单且高效的特点。

EASTER model utilizes only one-dimensional convolutional layers for the task of HTR and OCR.

EASTER模型仅使用一维卷积层来完成HTR和OCR。

Here is a list of items that this article is going to cover regarding the EASTER model-

这是本文将要介绍的有关EASTER模型的项目列表,

  1. EASTER Overview

    复活节概述

  2. 1D-CNN on images? Really? how?

    图片上的1D-CNN? 真? 怎么样?

  3. EASTER Model Architecture

    EASTER模型架构

  4. OCR/HTR Capability with zero Training Data

    具有零培训数据的OCR / HTR功能

  5. Results

    结果

  6. Summary

    概要

复活节概述(EASTER Overview)

EASTER (Efficient and Scalable Text Recognizer), is a fully convolutional architecture that utilizes only 1-D Convolutional layers in the encoder and adds a CTC-decoder(Connectionist Temporal Classification) at the end.

EASTER(高效可扩展的文本识别器)是一种完全卷积的体系结构,仅在编码器中使用一维卷积层,并在末尾添加了CTC解码器(连接器时间分类)。

EASTER sets a new way of visualizing and efficiently solving OCR/HTR tasks with only 1-D Convolutional layers.

EASTER设置了一种可视化和有效解决仅需一维卷积层的OCR / HTR任务的新方法。

Here are a few important points about EASTER architecture:-

以下是有关EASTER体系结构的一些要点:-

  1. Fully Convolutional architecture that is parallelly trainable on GPUs.

    可在GPU上并行训练的完全卷积架构。

  2. Only 1-D Convolutional layers, faster with less-parameters.

    仅一维卷积层,参数更少,速度更快。

  3. Works well even when training data is limited.

    即使训练数据有限,效果也很好。

  4. No complex layers (easy to understand).

    没有复杂的层(易于理解)。

  5. Works well for line-level OCR/HTR tasks.

    非常适合行级OCR / HTR任务。

In addition to the EASTER architecture, this paper also presents a synthetic data generation pipeline with an augmentation setup. That means you can train your own OCR/HTR system with zero training data requirements.

除了EASTER架构外,本文还介绍了具有增强设置的综合数据生成管道。 这意味着您可以以零培训数据要求来培训自己的OCR / HTR系统。

Now the question comes-How do you apply the one Dimensional Convolutions on a two-dimensional image. This is a very valid question, and the next paragraph explains it-

现在的问题来了:如何在二维图像上应用一维卷积。 这是一个非常有效的问题,下一段对此进行了解释-

图片上的1D-CNN? 真? 怎么样? (1D-CNN on images? Really? how?)

Consider an input image of size 600 X 50 (W X H) as shown in the figure below.

如下图所示,考虑大小为600 X 50(WXH)的输入图像。

Here, if you draw any vertical line in this image, you will only cut a single character (if not drawn in white-space), and if you draw a horizontal line you will probably end up cutting all the characters.

在这里,如果在此图像中绘制任何垂直线,则只会剪切一个字符(如果未在空白区域中绘制),而如果绘制水平线,则可能最终会剪切所有字符。

In other words, what I am trying to say here is- along the height of the image you will only find the properties of a single character while along the width you will find all different characters as you move from left-to-right.

换句话说,我想在这里说的是-在图像的高度上,您只能找到单个字符的属性,而在宽度上,当您从左向右移动时,您将找到所有不同的字符。

Image for post
Image by Author图片作者

So basically, the width can be assumed as a time dimension where if you move along the time, you find the different subsequent characters, while height represents the properties of a character at a given time-stamp.

因此,基本上可以将宽度假定为时间维度,如果您沿着时间移动,则会找到不同的后续字符,而高度则表示给定时间戳记下字符的属性。

A one-dimensional filter of kernel size-3 actually means a filter of dimension 3 in the time dimension(along the width, 3 pixels at a time) that covers the overall height of 50 pixels(H). So, basically a filter of kernel size-3 means a filter of 3x50 (or 3xH) dimensions (just like 1-D CNN works for NLP word embeddings).

一维内核大小为3的一维过滤器实际上是指在时间维度上(沿宽度,一次为3个像素)维度为3的过滤器,它覆盖了50像素(H)的整体高度。 因此,基本上,内核大小为3的过滤器意味着过滤器的尺寸为3x50(或3xH)(就像1-D CNN用于NLP词嵌入一样)。

As shown in the figure above, this red rectangular box is a 1-D convolutional filter that scans the full height of the image as it moves on the time dimension(the width) from left to right. Each scan stores the information of the observed character(or part of the character).

如上图所示,此红色矩形框是一维卷积滤镜,当图像在时间维度(宽度)上从左向右移动时,它将扫描图像的整个高度。 每次扫描都存储观察到的角色(或角色的一部分)的信息。

This information is finally passed to a softmax layer that gives a probability distribution over all the characters possible for each time-step along the width. This probability distribution is then passed to the CTC decoding layer to generate the final output sequence.

最终,此信息传递到softmax层,该层在宽度上每个时间步的所有可能字符上给出了概率分布。 然后将该概率分布传递到CTC解码层以生成最终输出序列。

EASTER模型架构 (EASTER Model Architecture)

Easter model architecture is quite simple that utilizes only 1-D Convolutional layers for the task of OCR and HTR.

Easter模型架构非常简单,仅使用一维卷积层来完成OCR和HTR。

Easter encoder part consists of multiple stacked 1-D Convolutional layers where kernel-size increases with the depth of the model. The effectiveness of stacked 1-D Convolution based networks to handle the sequence-to-sequence tasks has already been proved in the area of ASR (Automatic Speech Recognition).

Easter编码器部分由多个堆叠的1-D卷积层组成,其中内核大小随模型的深度而增加。 基于堆叠的一维卷积网络处理序列到序列任务的有效性已经在ASR(自动语音识别)领域得到了证明。

复活节块 (Easter Block)

The basic structure of an EASTER block is shown in the figure below. Each block has multiple repeating sub-blocks. Each sub-block is made up of 4 ordered components-

EASTER块的基本结构如下图所示。 每个块具有多个重复的子块。 每个子块由4个有序组件组成-

  1. 1-D Convolutional layer

    一维卷积层
  2. Batch-Normalization layer

    批标准化层
  3. Activation layer (ReLU)

    激活层(ReLU)
  4. A Dropout layer

    辍学层
Image for post
Image Source 图片来源

最终编码器(Final Encoder)

The overall encoder is a stack of multiple repeating EASTER blocks (discussed in the last paragraph). Apart from repeating blocks, there are four extra 1-D Convolutional blocks present in the overall architecture as shown in the figure below.

整个编码器是多个重复的EASTER块的堆栈(在上一段中讨论)。 除了重复块外,整个体系结构中还存在四个额外的1-D卷积块,如下图所示。

预处理块(下采样块) (Preprocessing Block (Downsampling block))

This the first block of the model that contains two 1-D convolutional layers with a stride of 2. This block is used to downsample the original width of the image to width/4. Apart from the stride, all other components of the sub0-blocks are similar to the one discussed above.

这是模型的第一个块,其中包含两个跨度为2的1-D卷积层。此块用于将图像的原始宽度下采样为width / 4。 除了大步外,sub0块的所有其他组件与上述组件相似。

后处理块 (Post-Processing Blocks)

There are three post-processing blocks at the end of the encoder part, where the first one is a dilated 1-D Convolutional block with dilation of 2, the second one is a normal 1-D Convolutional block while the third post-processing block is a 1-D Convolutional block with ‘number-of-filters’ equal to the number of possible outcomes (model vocabulary length) and with a softmax activation layer. The output of this layer is passed to the CTC decoder.

编码器部分的末尾有三个后处理块,其中第一个是膨胀为2的膨胀的一维卷积块,第二个是普通的一维卷积块,而第三个是后处理块是一维卷积块,其“过滤器数量”等于可能结果的数量(模型词汇长度),并带有softmax激活层。 该层的输出传递到CTC解码器。

Image for post
Image Source 图片来源

CTC解码器(CTC Decoder)

EASTER encoder passes the output probability distribution of the encoded sequence to a CTC decoder for decoding.

EASTER编码器将编码序列的输出概率分布传递给CTC解码器以进行解码。

To map the predicted output characters into the resulting output sequence, the EASTER model utilizes a weighted CTC decoder. This weighted CTC decoder results in the fast convergence of the model and gives better results than vanilla-CTC when training data is limited.

为了将预测的输出字符映射到结果输出序列中,EASTER模型利用加权的CTC解码器。 当训练数据有限时,这种加权的CTC解码器可以使模型快速收敛,并且比普通CTC产生更好的结果。

The configurations of this weighted-CTC decoder is described in detail in the original paper.

该加权CTC解码器的配置在原始论文中有详细描述。

3x3体系结构变体 (3x3 Architecture Variant)

EASTER 3X3: A 14-layered variant can be constructed using the table shown below. This is a very shallow/simple architecture with just 1M parameters yet very effective for the task of OCR/HTR.

复活节3X3:可以使用下表显示14层变体。 这是一个非常浅/简单的体系结构,仅具有1M个参数,但对OCR / HTR的任务非常有效。

Image for post
Image Source 图片来源

This model can be easily scaled to increase performance/capacity. In the experiments shown in the paper, a 5x3 variant achieves the state of the art performance for the tasks of HTR and OCR.

该模型可以轻松扩展以提高性能/容量。 在本文显示的实验中,5x3变体可实现HTR和OCR任务的最新性能。

具有零培训数据的OCR / HTR功能 (OCR/HTR Capability with zero Training Data)

In addition to a novel architecture, the EASTER paper also describes the ways to synthetically generate training data for both machine-printed as well as handwriting recognition tasks.

除了新颖的体系结构外,EASTER论文还描述了综合生成用于机器打印以及手写识别任务的训练数据的方法。

Using these methods(well described in the paper), you can train an optical character recognition system (OCR) or a handwriting recognition system (HTR) of your own without any labeled data. As the configurable data generator shown in the paper will prepare the synthetic labeled training dataset for you.

使用这些方法(在本文中有很好的描述),您可以训练自己的光学字符识别系统(OCR)或手写识别系统(HTR),而无需任何标记数据。 正如本文中所示的可配置数据生成器将为您准备合成的带标签的训练数据集。

The following figure shows some synthetically generated samples from the paper, they look very realistic-

下图显示了从纸上合成的一些样本,它们看起来非常逼真-

Image for post
Image Source 图片来源

结果(Results)

The paper shows some amazing results on the IAM-offline line recognition tasks. The experiments on the handwriting recognition task prove that the EASTER model works really well even when the training data is limited.

本文显示了IAM离线识别任务的一些惊人结果。 手写识别任务的实验证明,即使训练数据有限,EASTER模型也能很好地工作。

Handwriting recognition results of EASTER are compared with one google’s paper on ‘A Scalable Handwritten Text Recognition System’ (aka GRCL) where the author shows good handwritten line recognition results with a limited training dataset. EASTER model outperforms GRCL even with lesser training samples as shown in the table below.

EASTER的手写识别结果与google的“可扩展手写文本识别系统”(又名GRCL)论文进行了比较,作者在有限的训练数据集中显示出良好的手写线识别结果。 如下表所示,即使使用更少的训练样本,EASTER模型也优于GRCL。

Image for post
Image Source 图片来源

EASTER further shows SOTA results on the scenic text recognition (Machine Printed) tasks, without any augmentations, and with a greedy-search-decoding mechanism(without language model decoding).

EASTER进一步显示了在风景秀丽的文本识别(机器打印)任务上的SOTA结果,没有任何增加,并且具有贪婪的搜索解码机制(无需语言模型解码)。

Here is a screen print of model results on handwritten as well as machine-printed tasks from the paper itself-

这是手写的模型结果以及纸张本身的机器打印任务的屏幕打印,

Image for post
Image Source 图片来源

概要(Summary)

In this article, we discussed a novel fully convolutional(with only 1-D Convolutions), end-to-end, OCR/HTR pipeline that is simple, fast, efficient, and scalable.

在本文中,我们讨论了一种新颖的完全卷积(仅具有一维卷积),端到端的OCR / HTR流水线,该流水线简单,快速,高效且可扩展。

In addition to the architecture, we learned about how a 1- Dimensional Convolutional filter works on an image to be recognized.

除了该体系结构之外,我们还了解了一维卷积滤波器如何在要识别的图像上工作。

Finally, we discussed the synthetic data generation pipeline along with the recognition results as shown in the original paper.

最后,我们讨论了综合数据生成管道以及原始文件中所示的识别结果。

For more details, you can read the original paper here as it has a detailed explanation of all the aspects we have touched in this article.

有关更多详细信息,您可以在此处阅读原始文章因为该文章对我们在本文中涉及的所有方面进行了详细说明。

Thanks for reading! Hope this article was helpful for you. Kindly let me know your feedback through the comments. See you in the next article.

谢谢阅读! 希望本文对您有所帮助。 请通过评论让我知道您的反馈。 下篇文章见。

This article was originally published here.

本文最初发表在这里

翻译自: https://towardsdatascience.com/1d-cnn-based-fully-convolutional-model-for-handwriting-recognition-7853976f5784

cnn卷积神经手写数字识别

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值