AON[1803]
从四个方向上提取feature map,然后进行融合
整个网络结构如下
在各个数据集上有不错的表现,对于整体倾斜而没有curve的文字应当比较有效
Rosetta[1810]
一种实时文字识别
基于FasterRCNN的二阶段物体检测模型,但是分开训练
Recognition是FCN+CTC,
训练过程中纠正了一个CRNN以来的问题
Instead,at training time,we resize the word images to 32x128 pixels distorting the aspect ratio only if they are wider than 128 pixels, and using right-zero-padding otherwise
有趣的发现(更胖的英文结果更好)是不是更瘦的中文结果更好?
We empirically found that a stretching factor of 1.2 leads to superior results than using the original aspect ratio.
训练从短到长的warm-up
We started training with words of length≤3, for which the alignment would be easy and where the variations in length would be tiny, and increased the maximum length of the word at every epoch.
为了效率的两个设计:
用ShuffleNet作为backbone(效果略好于ResNet-18)
Recognition没有LSTM和Attention
Mask TextSpotter [1808]
用mask rcnn做ocr的end-to-end模型
在RPN网络之后分为两部分,Fast RCNN用于提供更准确的bounding box
mask branch则包括文字与背景的语义分割和字符的实体分割 固定大小为16*64然后进行像素的classification
PixelLink[1801][AAAI2018]
其想法来源于
text/non-text predictions can not only be used as the confidences on regression results, but also as a segmentation score map, which contains location information in itself and can be used to obtain bounding boxes directly.
所以希望通过预测每个pixel是否和它的8个neighbor相连来分割文本实例
有link就conv-lstm?
TextSnake[1807]
一种更加灵活的文本区域表示方式,其想法来自
successively decomposing text into local components and then composing them back into text instances.
(seglink的升级)
具体方法是先通过FPN得到原图1/2大小的feature map,预测出文本区域TR和文本行TCL,然后用TR来mask TCL得到分隔实体,然后
A striding algorithm is used to extract the central axis point lists and finally reconstruct the text instances.
训练loss由两部分组成 r和theta的loss是对TCL的每个点计算
(文中所有lambda都设成1)
ASTER[PAMI2018]
An Attentional Scene Text Recognizer with Flexible Rectification
先预测Thin-Plate Spline(TPS),在通过 Spatial Transformer Networks (STN)将图片rectification在进行recognition
其中的TPS变换是高维线性变换,其中T是2*(K+3)矩阵,K是每边锚定点=20
T可以通过下面的式子近似
文中提到下面这个trick极大提升了效果
Without tanh, the control points may fall outside image borders, so value clipping is performed in the sampler to ensure valid sampling
且这个网络初始化fc2需要初始化权重为0,bias初始化为恒同变换
在Recognition中使用了带Attention的LSTM
Rectification网络结构:
Recognition的网络结构:
Attention-based Extraction of Structured Information from Street View Imagery[1704]
比较诡异的是只对FSNS数据集报告了结果,没有对比就没有伤害
提出Attention-based LSTM中加入位置信息embedding
改成
从结果过来看,可能类似于template matching的效果?
ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification[1812]
主要贡献在rectification方面,分为两部分,一部分是
robust distortion rectification for optimal scene text recognition.
另一部分是
accurate rectification of perspective and curvature distortions of texts in scenes.
文中一直强调是Novel的方法,其实更像是Iterative ASTER(只是确定控制点的方式不同)
具体而言,是设计了新的line-fitting transformation来定位文字。即一条K次多项式表达的文字中心连线,加上一系列分割线来估计文字纵向宽度。其参数为3L+K+1,网络结构如下
估计变换矩阵
然后这个过程迭代N次
结果对比,相比于ASTER(38)的优势并不明显,比AON(8)有不少提升
Focusing Attention: Towards Accurate Text Recognition in Natural Images[1710]
提出了FAN网络,将偏移的Attention重新修正到正确的位置
其中的FN网路首先计算了Attention的中心,然后截取了中心的邻域,计算attention的分布概率。Focusing loss显然需要字级别标注....
这已经有些接近GAN的想法了,下面的SSFL就是GAN的做法
Synthetically Supervised Feature Learning for Scene Text Recognition[ECCV2018]
通过对比真实和生成的图片来研究如何提取feature
leverage the difference between real and synthetic images, namely the controllability of the generation process, and control the generation process to generate paired training data.
具体而言
Feature invariance requires that the encoder extracts the same feature for any input image x and its corresponding clean image x¯: E(x) = E(x¯).
使用GAN的方法训练
值得注意的是,本文中没有attention机制,IC03、IC13达到了FAN的结果
这可能是一种可以代替Rectification的方法
What is wrong with scene text recognition model comparisons? dataset and model analysis[1904]
1、比较已有模型
统一使用了数据集MJSynth和SynthText
2、引入四步框架
transformation (Trans.) : STN
feature extraction (Feat.)
sequence modeling (Seq.) :
prediction (Pred.)):CTC/Attention
实验结果
- 训练样本的样式对效果提升比训练样本的数量重要 ( training on 20% of MJSynth and 20% of SynthText together (total 2.9M) provides 81.3% accuracy – better performance than the individual usages of MJSynth or SynthText)
- RESNET 速度快, RCNN 内存小
- ATT不一定好: The final change, Attn, on the other hand, only improves the accuracy by 1.1% at a huge cost in efficiency (27.6 ms).
- Failure: Font; Vertical Text; 特殊符号$; overcome occlusion; Low resolution; Label标错
Appendix有详细的Rectification网络实现(ASTER和ESIR相仿)和各种backbone网络
Scene Text Detection and Recognition: The Deep Learning Era[1812]
是一篇综述性文章,基本涵盖了18年上半年之前的ocr论文
(1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends.
除了探讨Detection和Recognition算法之外,比较有意思的是对Auxiliary Techniques的一些总结,包括Synthetic Data(SynthText/GAN),Bootstrapping(结合MSER/字符标注),Text Deblurring,Context information(可能可以运用在特定场景下),Adversarial Attack等。
SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-DecoderNetwork for ocr[2018AAAI]
通过二值化网络来减小运算量和参数量
能够接近2016年左右的SOTA,有一定参考价值
不太清楚对汉字能否这么做
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild[CVPR2016]
R2AM使用了类似于convLSTM的结构来提取feature,然后用了基于attention的模型来识别。算是当时的SOTA,但是一方面convLSTM层数太浅,另一方面还有fc层emmm应该有很大的改进空间。
Gated Recurrent Convolution Neural Network for OCR[NIPS2017]
在RC layer之后加了gate
a gate to control the context modulation in each RCL layer, which leads to the Gated Recurrent Convolution Neural Network (GRCNN).
上面两个结合RNN和CNN的方法都不太work,如何结合两者还需要进一步思考
Learning to Read Irregular Text with Attention Mechanism[1708]
整体设计上其实有很多问题
结构如下图,
FCN预测character加入FCN loss用来提升CNN的表示能力(只是这个作用)
然后提取的特征和位置编码加起来作attention预测
然后没有encoder直接decoder...
然后,作者设计了attention loss,对attention与gt计算距离......这操作比上面的FAN还猛的的多......
Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition[2018AAAI]
设计了字符级别的CharNet,其中有hierarchical attention mechanism(HAM)
The newly proposed HAM consists of two layers, namely the recurrent RoIWarp layer and the character-level attention layer.
word level encoder(通过CNN得到feature map)
-->recurrent ROIWarp(截取当前character 的feature map)
-->character-level attention(STN +CNN+the form of a traditional attention mechanism)
-->decoder
ROI Warp的具体操作:通过上一个hidden state获得weight s(训练中gt中s为二维高斯分布)
不知为什么,作者把具体结构放在Implementation Details里....
- word level encoder
- conv3*3 channel 64 num 3 + 2*2 max pool
- conv3*3 channel 128 num 2 + 2*2 max pool
- conv3*3 channel 256 num 2 + LRN
- conv3*3 channel 256 num 4 + LRN
- conv3*3 channel 256 num 4 + LRN 后三层concat + conv 1*1
- char level
- conv channel 256 num 2 --> predict rotate angle(STN)
- conv channel 512 num 3 (CNN)
The width and height of Ft_c are both set to 5.
可能对精度有很大影响
上一个字相对这个字的位置应该有助于STN判断方向,可以加以利用